Inside language models (from GPT-3 to PaLM)

Summary of current models

Summary of current models: View the full data (Google sheets)
Download PDF version

Language model sizes

Or: While you were sleeping, AI sizes were exploding

Download source (PDF)

Increasing dataset sizes 2018-2022


What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher

Alan D. Thompson
March 2022
26 pages incl title page, references, appendix.

Read more…

GPT-3’s top 10 datasets by domain/source

Download source (PDF)
Contents: View the data (Google sheets)

Contents of GPT-3 & the Pile v1

Download source (PDF)
Contents: View the data (Google sheets)
Read detail of datasets within GPT-3 and the Pile v1, & see alternative viz

List of datasets in data models GPT-3, GPT-J, GPT-NeoX

Note: Text provided here for indexing only, please see the Google sheet above for formatting as intended.


What is in GPT-3? GPT-3 contains (sorted by most tokens/effective size):

  1. Common Crawl (www)
  2. WebText (Reddit links)
  3. Books2 (Libgen or similar)
  4. Books1/BookCorpus (Smashwords)
  5. Wikipedia (facts)
  6. -end of list-

The Pile v1

What is in the Pile v1? The Pile v1 contains (sorted by most tokens/effective size):

  1. Common Crawl (www)
  2. PubMed Central (papers)
  3. Books3 (Bibliotik tracker)
  4. WebText (Reddit links)
  5. ArXiv (papers)
  6. Github (code)
  7. FreeLaw (papers)
  8. Stack Exchange (discussion)
  9. USPTO Background (papers)
  10. PubMed Abstracts (papers)
  11. Gutenberg (books)
  12. OpenSubtitles (movies)
  13. Wikipedia (facts)
  14. DM Mathematics (papers)
  15. Ubuntu IRC (discussion)
  16. Books1/BookCorpus (Smashwords)
  17. EuroParl (formal discussion)
  18. HackerNews (discussion)
  19. YoutubeSubtitles (movies)
  20. PhilPapers (papers)
  21. NIH ExPorter (papers)
  22. Enron Emails (discussion).
  23. -end of list-

GPT-3 is sometimes misspelt as: GPT3, GPT 3, GPT three, GTP-3, GTP3, GTP 3, GTP three.

List of domains in the WebText dataset

What is in WebText? WebText contains (sorted by domain most ‘seen’):

  1. Google (www).
  2. Archive (www).
  3. Blogspot (blogs).
  4. GitHub (code).
  5. NYTimes (news).
  6. WordPress (blogs).
  7. Washington Post (news).
  8. BBC (news).
  9. The Guardian (news).
  10. eBay (goods).
  11. Pastebin (text).
  12. CNN (news).
  13. Yahoo! (news).
  14. Huffington Post (news).


List of domains in the C4 dataset

Common Crawl (C4)

What is in Common Crawl? Common Crawl includes (C4, cleaned/filtered, sorted by most tokens):

# C4 (Filtered Common Crawl) contents with Wikipedia removed for dedup… % of 156B Tokens
1 Google Patents (papers) 0.48% ~750M
2 The New York Times (news) 0.06% ~100M
3 Los Angeles Times (news) 0.06% ~90M
4 The Guardian (news) 0.06% ~90M
5 PLoS – Public Library of Science (papers) 0.06% ~90M
6 Forbes (news) 0.05% ~80M
7 HuffPost (news) 0.05% ~75M
8 – dead link (papers) 0.05% ~71M
9 Scribd (books) 0.04% ~70M
10 The Washington Post (news) 0.04% ~65M
11 The Motley Fool (opinion) 0.04% ~61M
12 InterPlanetary File System (mix) 0.04% ~60M
13 Frontiers Media (papers) 0.04% ~60M
14 Business Insider (news) 0.04% ~60M
15 Chicago Tribune (news) 0.04% ~59M
16 (discussion) 0.04% ~58M
17 The Atlantic (news) 0.04% ~57M
18 Springer Link (papers) 0.04% ~56M
19 Al Jazeera (news) 0.04% ~55M
20 Kickstarter (discussion) 0.03% ~54M
21 FindLaw Caselaw (papers) 0.03% ~53M
22 National Center for Biotech Info (papers) 0.03% ~53M
23 NPR (news) 0.03% ~52M
and 1M+ more domains… ~98.58% ~153.8B

A huge ‘thank you!’ to Drs Jesse Dodge and Maarten Sap from the Allen Institute for AI for the revised chart in the C4 paper.

You can also search for any domain in the C4 dataset using the index hosted by the Allen Institute for AI.

Contents: View the data (Google sheets)

Contents of Chinese models

Download source (PDF)
Contents: View the data (Google sheets)

List of datasets in Chinese data models PanGu Alpha, Wudao 2.0

Note: Text provided here for indexing only, please see the Google sheet above for formatting as intended.

PanGu Alpha

What is in PanGu Alpha? PanGu Alpha contains (sorted by most tokens/effective size):

  1. Common Crawl (www)
  2. Public datasets: DuReader (discussion), Baidu QA (discussion), CAIL2018 (legal papers), SogouCA (news), and more…;
  3. News
  4. Encyclopedia: Baidu Baike (facts), Sogou Baike (facts), and more…;
  5. e-Books
  6. -end of list-

WuDao 2.0

WuDaoCorpora 1.0 (dataset) and Wudao 1.0 (model) were launched in March 2021.
WuDaoCorpora 2.0 (dataset) and Wudao 2.0 (model) were launched in June 2021 (at the 2021 BAAI conference).

WuDaoCorpora 2.0 is composed of three parts:
1. WDC-Text (3TB text), the world’s largest plain text dataset.
2. WDC-ImageCaption (90TB image and text), the world’s largest multimodal dataset.
3. WDC-Dialogue (180GB text), the world’s largest Chinese dialogue dataset.

WDC-Text (3TB text)
3TB of text data, with labelling. “20 strict cleaning rules used by WuDaoCorpora1.0, and derives high-quality datasets from more than 100TB of original web page data.”

WDC-ImageCaption (90TB image and text)
“Contains 630 million image and text pairs, with a total data volume of about 90TB, the largest in the world. Among them, 600 million is related to graphics and text, and 30 million is a specific description of the content of the image.”

WDC-Dialogue (180GB text)
“Contains 181GB of high-quality Chinese dialogue data, and the total number of dialogues reaches 1.4B… Cleaned up 180GB of high-quality dialogue data from 9TB of raw data.”

What is in Wudao 2.0? Wudao 2.0 contains:
WuDaoCorpora2 – Chinese text only:

  1. Zhihu (discussion)
  2. Baidu Baike (facts/encyclopedia)
  3. Sogou Baike (facts/encyclopedia)
  4. Baidu QA (discussion)
  5. Other*:

(*best guess only, sorted by most visits);

  1. Tencent QQ (messenger)
  2. Sohu (news)
  3. Sina Weibo (discussion)
  4. Sina Corporation (news)
  5. Xinhua News Agency (news)
  6. Chinese Software Dev Network (discussion)
  7. Global Times (news)
  8. Tianya Club (discussion)
  9. (finance discussion)
  10. BabyTree (parenting discussion)
  11. CNBlogs (software discussion)
  12. 6Rooms (news)
  13. NetEase (discussion)
  14. Hunan Rednet (news)
  15. Bilibili (video discussion)
  16. and more…

“Corpora contains various data types including news, post bar comments (sic), encyclopedia information, etc. More specifically, WuDaoCorpora contains a 3 TB Chinese corpus collected from 822 million Web pages” (WuDaoCorpora paper, Tang et al, June 2021).

“For training of base model, we use a training set of 302GB, the distribution of these data is shown in Table 7” (Inverse Prompting paper, Tang et al, June 2021).

Wudao 2.0 is sometimes misspelt as: Wudao-2, Wudao 2, Wu dao 2.0, Woodao, Woo dao.

Chinese model names & dataset equivalent in English

PanGu Alpha: Launched by Huawei and others in April 2021.
Simplified Chinese: 盘古
Traditional Chinese: 盤古
Pinyin: Pán gǔ
Pronounced: pun-goo (rhymes with done tool)
English: Literal: ‘coil ancient’, first living being and the creator (coiled up in an egg).
Etymology: Mythical Chinese creation figure who emerged from a yin-yang egg and created the earth and sky (similar to the Christian creation story, and Pangu has been compared to Adam).

Wudao 2.0: Launched by the Beijing Academy of Artificial Intelligence (BAAI) and others in June 2021.
Simplified Chinese: 悟道
Traditional Chinese: 悟道
Pinyin: Wù dào
Pronounced: oo-dao (rhymes with tool now)
English: Literal: ‘Enlightenment’.
Etymology: Truth of the Dharma, the spiritual path.

Chinese dataset English dataset equivalent
Zhihu (discussion) Quora
Baidu Baike (facts) (16M articles) English Wikipedia (7M articles)
Sogou Baike (facts) English Wikipedia (7M articles)
Baidu QA (discussion) Stack Exchange
Tencent QQ (messenger) ICQ
Sohu (news) NBC
Sina Weibo (discussion) Twitter
Sina Corporation (news) CNN
Xinhua News Agency (news) CBS
Chinese Software
Dev Network (discussion)
Stack Exchange
Global Times (news) Washington Post
Tianya Club (discussion) Yahoo! Groups (finance discussion) Yahoo Finance
BabyTree (parenting discussion) TheBump
CNBlogs (software discussion) Hacker News
6Rooms (news) Huffington Post
NetEase (discussion) Blizzard
Hunan Rednet (news) The New York Times
Bilibili (video discussion) YouTube

Language model sizes & predictions

Download source (PDF)
Sizes: View the data (Google sheets)

Facebook BlenderBot 2.0

Launched July 2021, BlenderBot 2.0 is pre-trained on (Reddit discussion), fine-tuned on ConvAI2, Empathetic Dialogues, and Wizard of Wikipedia (WoW) datasets. The two additional datasets are Multi-Session Chat and Wizard of the Internet (WizInt). To train for safety, it uses the BAD dataset. Finally—in realtime—it is able to add live results by ‘generating its own search queries, reading the results, and taking them into account when formulating a response.’

List of validation set domains in WizInt/BlenderBot 2.0

BlenderBot 2.0 chatbot uses live/realtime web search engine results as part of its language model. The validation set (WizInt) paired up humans to have a conversation, with one human given the option to perform a web search (query and query + "news") to respond to their partner in conversation. Search results were added to the conversation by the human 80.3% of the time. The resulting WizInt dataset (validation set of human conversations with search) is used as supervision for new queries in BlenderBot 2.0. That is, new conversations with BlenderBot 2.0 will generate new responses that may include live/realtime web search engine results.

Breakdown of most common domains used during search… (validation set breakdown). Shown is the most common 24.41%, there is a long tail of 1,233 other domains across the whole validation set.

Domain %
Wikipedia 8.56%
IMDb 3.08%
Britannica 2.28%
Healthline 0.84%
All Recipes 0.84%
Rotten Tomatoes 0.8%
Ranker 0.8%
Genius 0.76%
Rolling Stone 0.67%
Live About 0.63%
The Spruce Eats 0.55%
The Guardian 0.51%
Biography 0.51%
Esquire 0.42%
The Spruce 0.38%
Men’s Health 0.38%
Book Series in Order 0.38%
Trip Savvy 0.34%
Forbes 0.34%
Thoughtco 0.34%
Wikihow 0.34%
WebMD 0.34%
Thrillist 0.34%
1,233 more domains… 75.59%

References for Blenderbot 2.0

From the paper: Summary of Figure 2.

Read the paper:
BlenderBot 2.0 (Facebook): Komeili et al (2021). Internet-Augmented Dialogue Generation. (PDF)

Facts on GPT-3

Think you’re a fast typer? In March 2021, GPT-3 was typing 3.1 million words per minute, non-stop, 24×7. With the general availability of the model, I expect that number is a lot higher now… (Nov/2021).

Per day = 4,500,000,000 (4.5 billion)
Per hour = 187,500,000 (187.5 million)
Per minute = 3,125,000 (3.125 million)

Every day, GPT-3 generates the equivalent of an entire US public library (80,000 books) of new content.

(“…more than 300 applications are now using GPT-3, and tens of thousands of developers around the globe are building on our platform. We currently generate an average of 4.5 billion words per day, and continue to scale production traffic.” (OpenAI blog, March 2021). Using an average of 55k words per book = 81,818 books per day. “In 2017, there were 9,045 public libraries in the United States with a total of 715 million books and serial volumes” (US stats) = 79,049 books per library.)

“GitHub says that for some programming languages, about 30% of newly written code is being suggested by the company’s AI programming tool Copilot.” (Axios, October 2021)

“The supercomputer developed for OpenAI [as of May 2020] is a single system with more than 285,000 CPU cores, 10,000 GPUs [assume NVIDIA Tesla V100 GPUs released May/2017, superseded by NVIDIA Ampère A100 GPUs in May/2020] and 400 gigabits per second of network connectivity for each GPU server.”

Show more soundbites

Jurassic-1 (178B)

Launched 12/Aug/2021.

Our model was trained… on 300B tokens drawn from publicly available resources, attempting, in part, to replicate the structure of the training data as reported in Brown et al. (2020) [the GPT-3 dataset, which is detailed in the viz above at].
— AI21’s Jurassic-1 paper

tr11 by BigScience

176B parameter multi-lingual model.
Trained in March 2022.

The BigScience project for open research is a year-long initiative (2021-2022) targeting the study of large models and datasets. The goal of the project is to research language models in a public environment outside large technology companies. The project has 600 researchers from 50 countries and more than 250 institutions. The BigScience project was initiated by Thomas Wolf at Hugging Face.

“using regular Megatron-LM GPT2 language model w/o multi-lingual dataset”

GPT-2’s dataset was just English WebText, which is popular outbound Reddit links with more than two upvotes. It is 40GB.

They then claim to be using more languages:
“XXX: correct this Languages: ar, ca, code, en, es, eu, fr, id, indic-as, indic-bn, indic-gu, indic-hi, indic-kn, indic-ml, indic-mr, indic-ne, indic-or, indic-pa, indic-ta, indic-te, indic-ur, nigercongo-all, oscar-en, oscar-zh, pt, vi, zhs, zht”

These ISO 639-1 codes translate to:

  • Arabic.
  • Catalan; Valencian.
  • Code?
  • English.
  • Spanish.
  • Basque.
  • French.
  • Indonesian.
  • Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, Urdu.
  • Niger-Kordofanian languages.
  • Chinese.
  • Portuguese.
  • Vietnamese.
  • Chinese-Simplified.
  • Chinese-Traditional.

Google did something similar with mT5 in Jun/2021 (with pretty graphs and tables!).

Given that OSCAR (oscar-en and oscar-zh) is sourced from the CommonCrawl, it might be safe to assume that all other languages are also sourced from the CC. Jesse and the Allen AI team found 101 languages in the massive CommonCrawl/C4 dataset:

M6 by Alibaba

Multi-Modality to MultiModality Multitask Mega-transformer (M6)
From 100B to 1T to 10T parameters in less than a year!

“M6-Corpus for pretraining in Chinese, which consists of over 1.9TB image and 292GB text. The dataset has large coverage over domains, including encyclopedia, question answering, forum discussion, common crawl, etc”


Due to the complexity of this transformer and related language models, Megatron has its own page showing a summary of timeline, labs involved, and other details.

View the Megatron page.

InstructGPT by OpenAI one-pager

* The initialism ‘HHH’ was coined by Anthropic, and demonstrated in InstructGPT.

WebGPT by OpenAI sample question set

Contents: View the data (Google sheets)

PaLM by Google: Explaining jokes + Inference chaining

PaLM has 540B parameters. It is multilingual.

Contents: View the data (Google sheets)

Luminous by Aleph Alpha

The Luminous text model was announced at a conference in Nov/2021, where the parameter count for Luminous (assuming Luminous-World, in progress as of Apr/2022) was said to be 200B.

The announced luminous model uses up to 200 billion parameters and is considered to be just as powerful in the text part as GPT, whose third version includes up to 175 billion parameters. In contrast to the American counterpart, luminous can be combined with any number of images, the model is available in five languages ​​(German, English, French, Italian, Spanish) and has been trained in the European cultural context.
— Heise, translated.

# Model name Token estimate*
1 Luminous Base 12–40B (Alan best guess)
2 Luminous Extended 40–80B (Alan best guess)
3 Luminous Enhanced 80–180B (Alan best guess)
4 Luminous World 200B (from presentation)

* Alan best guess made during testing Apr/2022, based on informed ‘feel’, token/credit comparison, and tiering.

DeepMind’s models

DeepMind’s models are: Gopher, Chinchilla, Flamingo, and Gato (cat).

Download source (PDF)

Get The Memo

by Dr Alan D. Thompson · Hundreds of paid subscribers
Be inside the lightning-fast AI revolution.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant. With Leta (an AI powered by GPT-3), Alan co-presented a seminar called ‘The new irrelevance of intelligence’ at the World Gifted Conference in August 2021. His applied AI research and visualisations are featured across major international media, including citations in the University of Oxford’s debate on AI Ethics in December 2021. He has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organisations and enterprise. Contact.

This page last updated: 18/May/2022.