What’s in my AI?

A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher

Alan D. Thompson
March 2022
26 pages incl title page, references, appendix.

Download PDF (2.5MB).


Pre-trained transformer language models have become a stepping stone towards artificial general intelligence (AGI), with some researchers reporting that AGI may evolve from our current language model technology. While these models are trained on increasingly larger datasets, the documentation of basic metrics including dataset size, dataset token count, and specific details of content is lacking. Notwithstanding proposed standards for documentation of dataset composition and collection, nearly all major research labs have fallen behind in disclosing details of datasets used in model training. The research synthesized here covers the period from 2018 to early 2022, and represents a comprehensive view of all datasets—including major components Wikipedia and Common Crawl—of selected language models from GPT-1 to Gopher.


1. Overview
1.1. Wikipedia
1.2. Books
1.3. Journals
1.4. Reddit links
1.5. Common Crawl
1.6. Other
2. Common Datasets
2.1. Wikipedia (English) Analysis
2.2. Common Crawl Analysis
3. GPT-1 Dataset
3.1. GPT-1 Dataset Summary
4. GPT-2 Dataset
4.1. GPT-2 Dataset Summary
5. GPT-3 Datasets
5.1. GPT-3: Concerns with Dataset Analysis of Books1 and Books2
5.2. GPT-3: Books1
5.3. GPT-3: Books2
5.4. GPT-3 Dataset Summary
6. The Pile v1 (GPT-J & GPT-NeoX-20B) datasets
6.1. The Pile v1 Grouped Datasets
6.2. The Pile v1 Dataset Summary
7. Megatron-11B & RoBERTa Datasets
7.1. Megatron-11B & RoBERTa Dataset Summary
8. MT-NLG Datasets
8.1. Common Crawl in MT-NLG
8.2. MT-NLG Grouped Datasets
8.3. MT-NLG Dataset Summary
9. Gopher Datasets
9.1. MassiveWeb Dataset Analysis
9.2. Gopher: Concerns with Dataset Analysis of Wikipedia
9.3. Gopher: No WebText
9.4. Gopher Grouped Datasets
9.5. Gopher Dataset Summary
10. Conclusion
11. Further reading
Appendix A: Top 50 Resources: Wikipedia + CC + WebText (i.e. GPT-3)

Download PDF of alt view (1920×1080 slide)

Get The Memo

by Dr Alan D. Thompson · Hundreds of paid subscribers
Be inside the lightning-fast AI revolution.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant. With Leta (an AI powered by GPT-3), Alan co-presented a seminar called ‘The new irrelevance of intelligence’ at the World Gifted Conference in August 2021. His applied AI research and visualisations are featured across major international media, including citations in the University of Oxford’s debate on AI Ethics in December 2021. He has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organisations and enterprise. Contact.

This page last updated: 14/May/2022. https://lifearchitect.ai/whats-in-my-ai/