What’s in my AI?

A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher

Alan D. Thompson
March 2022
26 pages incl title page, references, appendix.

Download PDF (2.5MB).

Major updates since publication
Obviously a lot has changed since publication of this report back in Mar/2022. The big one is DeepMind’s multilingual version of MassiveText explored in the DeepMind RETRO paper (pp24, Table 8) in Feb/2022 (slightly before I published this report). While MassiveText (English) was 2.35T tokens determined to be 10.5TB, MassiveText (multilingual) is now 5T tokens, estimate 22.3TB, and would still be the largest text dataset in the world as of Q1 2023.

Received by AllenAI (AI2).
Received by the UN.


Pre-trained transformer language models have become a stepping stone towards artificial general intelligence (AGI), with some researchers reporting that AGI may evolve from our current language model technology. While these models are trained on increasingly larger datasets, the documentation of basic metrics including dataset size, dataset token count, and specific details of content is lacking. Notwithstanding proposed standards for documentation of dataset composition and collection, nearly all major research labs have fallen behind in disclosing details of datasets used in model training. The research synthesized here covers the period from 2018 to early 2022, and represents a comprehensive view of all datasets—including major components Wikipedia and Common Crawl—of selected language models from GPT-1 to Gopher.


1. Overview
1.1. Wikipedia
1.2. Books
1.3. Journals
1.4. Reddit links
1.5. Common Crawl
1.6. Other
2. Common Datasets
2.1. Wikipedia (English) Analysis
2.2. Common Crawl Analysis
3. GPT-1 Dataset
3.1. GPT-1 Dataset Summary
4. GPT-2 Dataset
4.1. GPT-2 Dataset Summary
5. GPT-3 Datasets
5.1. GPT-3: Concerns with Dataset Analysis of Books1 and Books2
5.2. GPT-3: Books1
5.3. GPT-3: Books2
5.4. GPT-3 Dataset Summary
6. The Pile v1 (GPT-J & GPT-NeoX-20B) datasets
6.1. The Pile v1 Grouped Datasets
6.2. The Pile v1 Dataset Summary
7. Megatron-11B & RoBERTa Datasets
7.1. Megatron-11B & RoBERTa Dataset Summary
8. MT-NLG Datasets
8.1. Common Crawl in MT-NLG
8.2. MT-NLG Grouped Datasets
8.3. MT-NLG Dataset Summary
9. Gopher Datasets
9.1. MassiveWeb Dataset Analysis
9.2. Gopher: Concerns with Dataset Analysis of Wikipedia
9.3. Gopher: No WebText
9.4. Gopher Grouped Datasets
9.5. Gopher Dataset Summary
10. Conclusion
11. Further reading
Appendix A: Top 50 Resources: Wikipedia + CC + WebText (i.e. GPT-3)

Download PDF of alt view (1920×1080 slide)

Update Aug/2022: Coverage of the datasets used to train the Google Pathways models including PaLM 540B (Apr/2022) is available in the related report, Google Pathways. An Exploration of the Pathways Architecture from PaLM to Parti.

Video: Presentation of this paper @ Devoxx Belgium 2022

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Thousands of paid subscribers. Readers from Microsoft, Tesla, Google AI...
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 2.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organizations and enterprise.

This page last updated: 5/Mar/2023. https://lifearchitect.ai/whats-in-my-ai/