Language model sizes
Summary of current models
Increasing dataset sizes 2018-2022
Alan D. Thompson
26 pages incl title page, references, appendix.
GPT-3’s top 10 datasets by domain/source
Contents of GPT-3 & the Pile v1
Contents of Chinese models
Language model sizes & predictions
Facebook BlenderBot 2.0
Launched July 2021, BlenderBot 2.0 is pre-trained on (Reddit discussion), fine-tuned on ConvAI2, Empathetic Dialogues, and Wizard of Wikipedia (WoW) datasets. The two additional datasets are Multi-Session Chat and Wizard of the Internet (WizInt). To train for safety, it uses the BAD dataset. Finally—in realtime—it is able to add live results by ‘generating its own search queries, reading the results, and taking them into account when formulating a response.’
Facts on GPT-3
Think you’re a fast typer? In March 2021, GPT-3 was typing 3.1 million words per minute, non-stop, 24×7. With the general availability of the model, I expect that number is a lot higher now… (Nov/2021).
Per day = 4,500,000,000 (4.5 billion)
Per hour = 187,500,000 (187.5 million)
Per minute = 3,125,000 (3.125 million)
Every day, GPT-3 generates the equivalent of an entire US public library (80,000 books) of new content.
(“…more than 300 applications are now using GPT-3, and tens of thousands of developers around the globe are building on our platform. We currently generate an average of 4.5 billion words per day, and continue to scale production traffic.” (OpenAI blog, March 2021). Using an average of 55k words per book = 81,818 books per day. “In 2017, there were 9,045 public libraries in the United States with a total of 715 million books and serial volumes” (US stats) = 79,049 books per library.)
“GitHub says that for some programming languages, about 30% of newly written code is being suggested by the company’s AI programming tool Copilot.” (Axios, October 2021)
“The supercomputer developed for OpenAI [as of May 2020] is a single system with more than 285,000 CPU cores, 10,000 GPUs [assume NVIDIA Tesla V100 GPUs released May/2017, superseded by NVIDIA Ampère A100 GPUs in May/2020] and 400 gigabits per second of network connectivity for each GPU server.”
“Training GPT-3 with 175 billion parameters would require approximately 288 years with a single V100 NVIDIA GPU.”
“…the model is a big black box, we can’t infer its beliefs.”
– InstructGPT paper, 2022: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf
“Despite the impending widespread deployment of foundation [language] models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties.”
– Stanford paper, 2021: https://fsi.stanford.edu/publication/opportunities-and-risks-foundation-models
Note that there are some challenges with writing books using GPT-3 due to the output token limits. 2,048 tokens is about…
- 1,430 words (token is 0.7 words).
- 82 sentences (sentence is 17.5 words).
- 9 paragraphs (paragraph is 150 words).
- 2.8 pages of text (page is 500 words).
There are clever ways to increase this output by feeding in the last/most important output to a new prompt.
Our model was trained… on 300B tokens drawn from publicly available resources, attempting, in part, to replicate the structure of the training data as reported in Brown et al. (2020) [the GPT-3 dataset, which is detailed in the viz above at LifeArchitect.ai/models].
— AI21’s Jurassic-1 paper
BLOOM by BigScience & languages within LLMs
176B parameter multi-lingual model.
Trained in March-July 2022.
BLOOM = BigScience Language Open-source Open-access Multilingual.
The BigScience project for open research is a year-long initiative (2021-2022) targeting the study of large models and datasets. The goal of the project is to research language models in a public environment outside large technology companies. The project has 1,000 researchers from 60 countries and more than 250 institutions. The BigScience project was initiated by Thomas Wolf at Hugging Face.
M6 by Alibaba
Multi-Modality to MultiModality Multitask Mega-transformer (M6)
From 100B to 1T to 10T parameters in less than a year!
“M6-Corpus for pretraining in Chinese, which consists of over 1.9TB image and 292GB text. The dataset has large coverage over domains, including encyclopedia, question answering, forum discussion, common crawl, etc”
Due to the complexity of this transformer and related language models, Megatron has its own page showing a summary of timeline, labs involved, and other details.
InstructGPT by OpenAI one-pager
* The initialism ‘HHH’ was coined by Anthropic, and demonstrated in InstructGPT.
WebGPT by OpenAI sample question set
PaLM by Google: Explaining jokes + Inference chaining
Luminous by Aleph Alpha
The Luminous text model was announced at a conference in Nov/2021, where the parameter count for Luminous (assuming Luminous-World, in progress as of Apr/2022) was said to be 200B.
The announced luminous model uses up to 200 billion parameters and is considered to be just as powerful in the text part as GPT, whose third version includes up to 175 billion parameters. In contrast to the American counterpart, luminous can be combined with any number of images, the model is available in five languages (German, English, French, Italian, Spanish) and has been trained in the European cultural context.
— Heise, translated.
|#||Model name||Token estimate*|
|1||Luminous Base||12–40B (Alan best guess)|
|2||Luminous Extended||40–80B (Alan best guess)|
|3||Luminous Enhanced||80–180B (Alan best guess)|
|4||Luminous World||200B (from presentation)|
* Alan best guess made during testing Apr/2022, based on informed ‘feel’, token/credit comparison, and tiering.
DeepMind’s models are: Gopher, Chinchilla, Flamingo, and Gato (cat).
Google Imagen has 2B image parameters + 1B upscale parameters + 4.6B LLM parameters (text encoding) via T5-XXL. Google Imagen was released by the Google Research and Google Brain teams in Toronto, Canada.
BriVL by RUC, China (Jun/2022)
I rarely comment on VLMs or multimodal models. This one was interesting. They may have been pipped to the post in 2022 by DeepMind (Flamingo, Gato) and Google (PaLM, Imagen), and even OpenAI (DALL-E 2).
BriVL seems to be mainly a publicity stunt, to drive marketing to Beijing’s pursuit of being an AI leader.
BriVL a year ago (Mar/2021)
>”[In Mar/2021] The first version of our BriVL model has 1 billion parameters, which is pretrained on the RUC-CAS-WenLan dataset with 30 million image-text pairs…In the near future, our BriVL model will be enlarged to 10 billion parameters, which will be pre-trained with 500 million imagetext pairs.” — https://arxiv.org/pdf/2103.06561.pdf
BriVL today (Jun/2022)
“[In Jun/2022] With 112 NVIDIA A100 GPUs in total, it takes about 10 days to pre-train our BriVL model over our WSCD of 650 million image-text pairs.” – Nature Communications
CLIP was 400M image-text pairs trained to 63M parameters. DALL-E had 250M pairs and 12B Parameters. So… BriVL is a nice evolution here.
The most interesting parts of the paper were the bombastic and flowery quotes around artificial general intelligence (AGI).
First, compare this quote from the cautious and mindful open-source AI lab, EleutherAI, in their paper on GPT-NeoX-20B:
We believe that Transformative Artificial Intelligence (TAI) is approaching… recent increases in the capabilities of large language models (LLMs) raises the possibility that the first generation of transformatively powerful AI systems may be based on similar principles and architectures as current large language models like GPT. This has motivated a number of research groups to work on “prosaic alignment”, a field of study that considers the AI alignment problem in the case of TAI being built primarily with techniques already used in modern ML. We believe that due to the speed of AI progress, there is a significant chance that this assumption is true, and, therefore, that contributing and enabling contributions to prosaic alignment research will have a large impact. – EleutherAI, 20B paper, Feb/2022
Next, compare the carefulness above with the Chinese BriVL paper:
– “…we demonstrate that strong imagination ability is now possessed by our foundation model. We believe that our work makes a transformative stride towards AGI, from our common practice of “weak or narrow AI” to that of “strong or generalized AI”.”
– “BriVL possesses strong capability of imagination given a complicated sentence as prompt.”
– “…even hints of common sense reasoning ability of our BriVL.”
– “…by effectively fusing the complex human emotions and thoughts from those weakly correlated image-text pairs, our BriVL is made more cognitive and general (i.e., much closer to AGI).”
Perceiver AR by DeepMind (Jun/2022)
Perceiver AR (autoregressive), modality-agnostic architecture… can directly attend to over a hundred thousand tokens, enabling practical long-context density estimation.
AlexaTM 20B by Amazon Alexa AI (Aug/2022)
Dataset: multilingual Wiki + mC4 only.
Training cost at standard rate:
16x AWS p4d.24xlarge compute instances
(8x GPUs each = 128x NVIDIA A100 GPUs)
= $32.77/hr on-demand each
= $524.32/hr on-demand total
2880 hours (120 days)
Dr Alan D. Thompson is an AI expert and consultant. With Leta (an AI powered by GPT-3), Alan co-presented a seminar called ‘The new irrelevance of intelligence’ at the World Gifted Conference in August 2021. His applied AI research and visualisations are featured across major international media, including citations in the University of Oxford’s debate on AI Ethics in December 2021. He has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organisations and enterprise.
This page last updated: 9/Aug/2022. https://lifearchitect.ai/models/↑