Alan D. Thompson
26 pages incl title page, references, appendix.
Received by AllenAI (AI2).
Received by the UN.
Pre-trained transformer language models have become a stepping stone towards artificial general intelligence (AGI), with some researchers reporting that AGI may evolve from our current language model technology. While these models are trained on increasingly larger datasets, the documentation of basic metrics including dataset size, dataset token count, and specific details of content is lacking. Notwithstanding proposed standards for documentation of dataset composition and collection, nearly all major research labs have fallen behind in disclosing details of datasets used in model training. The research synthesized here covers the period from 2018 to early 2022, and represents a comprehensive view of all datasets—including major components Wikipedia and Common Crawl—of selected language models from GPT-1 to Gopher.
1.4. Reddit links
1.5. Common Crawl
2. Common Datasets
2.1. Wikipedia (English) Analysis
2.2. Common Crawl Analysis
3. GPT-1 Dataset
3.1. GPT-1 Dataset Summary
4. GPT-2 Dataset
4.1. GPT-2 Dataset Summary
5. GPT-3 Datasets
5.1. GPT-3: Concerns with Dataset Analysis of Books1 and Books2
5.2. GPT-3: Books1
5.3. GPT-3: Books2
5.4. GPT-3 Dataset Summary
6. The Pile v1 (GPT-J & GPT-NeoX-20B) datasets
6.1. The Pile v1 Grouped Datasets
6.2. The Pile v1 Dataset Summary
7. Megatron-11B & RoBERTa Datasets
7.1. Megatron-11B & RoBERTa Dataset Summary
8. MT-NLG Datasets
8.1. Common Crawl in MT-NLG
8.2. MT-NLG Grouped Datasets
8.3. MT-NLG Dataset Summary
9. Gopher Datasets
9.1. MassiveWeb Dataset Analysis
9.2. Gopher: Concerns with Dataset Analysis of Wikipedia
9.3. Gopher: No WebText
9.4. Gopher Grouped Datasets
9.5. Gopher Dataset Summary
11. Further reading
Appendix A: Top 50 Resources: Wikipedia + CC + WebText (i.e. GPT-3)
Update Aug/2022: Coverage of the datasets used to train the Google Pathways models including PaLM 540B (Apr/2022) is available in the related report, Google Pathways. An Exploration of the Pathways Architecture from PaLM to Parti.
Dr Alan D. Thompson is an AI expert and consultant. With Leta (an AI powered by GPT-3), Alan co-presented a seminar called ‘The new irrelevance of intelligence’ at the World Gifted Conference in August 2021. His applied AI research and visualisations are featured across major international media, including citations in the University of Oxford’s debate on AI Ethics in December 2021. He has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organisations and enterprise.
This page last updated: 2/Aug/2022. https://lifearchitect.ai/whats-in-my-ai/↑