A Comprehensive Analysis of Datasets Likely Used to Train GPT-5
Alan D. Thompson
LifeArchitect.ai
August 2024
27 pages incl title page, references, appendices.
The report
Download PDF (5MB).
(Available exclusively to full subscribers of The Memo.)
Abstract
The rapid advancement of large language models (LLMs) has been accompanied by a corresponding decrease in transparency regarding their training data. As these models are developed using ever-expanding datasets, crucial information such as dataset size, token count, and content specifics has become increasingly obscure. Based on a thorough investigation of OpenAI’s known partnerships and collaborations with various organizations, this report provides a comprehensive analysis of more than 27 datasets totalling around half a quadrillion tokens in two petabytes of data before filtering. Following a cleaning process, this would become more than 70T tokens in 281TB used in training OpenAI GPT-5, and a rough parameter estimate is provided in this paper. Building on the acclaimed 2022 report, What’s in my AI?, and the pre-release analysis Google DeepMind Gemini: A general specialist, this report offers a quantitative foundation for exploring how frontier models powered by vast datasets are augmenting and expanding our cognitive capabilities, and ushering in a new era of superintelligence.
Contents
1. Overview
2. A note on very large numbers
3. Synthetic data
4. GPT-5 datasets: Advancing data quality, volume, and breadth
4.1. Synthetic
4.2. Reddit – Posts + Outbound Only (web)
4.3. Common Crawl – General (web)
4.4. Consensus NLP (academic papers)
4.5. YouTube (dialogue)
4.6. News Corp/WSJ (news articles)
4.7. Common Crawl – Edu (web)
4.8. Reddit – Comments Only (dialogue)
4.9. FreeLaw: PACER (legal)
4.10. GitHub (code)
4.11. Books2 (books)
4.12. FreeLaw: Opinions (legal)
4.13. Stack Exchange (Q&A)
4.14. Books3 (books)
4.15. Shutterstock (metadata)
4.16. Wikipedia – Multilingual (wiki)
4.17. Books1 (books)
4.18. Time Inc (news articles)
4.19. Le Monde and Prisa Media (news articles)
4.20. Wikipedia – English (wiki)
4.21. AP (news articles)
4.22. Dotdash Meredith (news articles)
4.23. FT (news articles)
4.24. Axel Springer SE (news articles)
4.25. Icelandic Government (multilingual data)
4.26. Khan Academy (Q&A)
4.27. ExamSolutions (Q&A)
4.28. Other
5. Conclusion
6. Further reading
Appendix A: Synthetic data topics (original via Cosmopedia)
Appendix B: General dataset calculations lookup table
Appendix C: GPT-5 dataset calculations (selected data only)
Cover image
Image generated in a few seconds, on 28 July 2024, text prompt by Alan D. Thompson, via Firefly Image 3 (art, 4:3): ‘navy blue data pattern, background’
Technical acknowledgments
Claude 3.5 Sonnet was used for some writing, copyediting, text manipulation, APA style referencing, OCR, and more.
GPT-4o was used for manipulating CSV files for the tables in the linked sheets and shown in the appendices.
Google Workspace was instrumental in facilitating complex linking with automatic updates between Sheets, Docs, and charting, allowing the draft of this paper to be completed in a single night.
The soundtrack to this paper was provided by 1980s Aussie band vSpy vSpy.
Synthetic data (extract from Section 3, p6-9 of the report)
Synthetic data is artificially generated information that mimics real-world data, used to train AI models. It is used when human-generated datasets are exhausted or unavailable, for increasing quantity, and—more recently—for increasing quality.
Up until the end of 2023, the use of synthetic data to train large language models was not widespread. While OpenAI did assign one of the GPT-4 team as a ‘synthetic data contributor’ in 2022, this role was specifically for the reinforcement learning and alignment process that happened after pre-training on the GPT-4 dataset was complete.
In October 2023, Microsoft sourced 6B tokens of human-generated ‘textbook quality’ data from the web, and used a large language model (OpenAI GPT-3.5) to generate another 1B tokens of synthetic data similar to the format of new textbooks and exercises. Later, Microsoft increased this to 20B tokens of synthetic data.
In March 2024, Hugging Face detailed an end-to-end process for generating synthetic text that they called ‘Cosmopedia’.
[For phi, Microsoft] curated 20,000 topics to produce 20 billion tokens of synthetic textbooks while using samples from web datasets for diversity… Assuming an average file length of 1000 tokens, this suggests using approximately 20 million distinct prompts…By targeting four different audiences (young children, high school students, college students, researchers) and leveraging three generation styles (textbooks, blog posts, wikiHow articles)… [Hugging Face created] over 30 million prompts for Cosmopedia….
Sourcing a nominal amount of web data, Hugging Face used a large language model (Mistral Mixtral-8x7B) to generate 25B tokens of ‘synthetic textbooks, blogposts, stories, posts and WikiHow articles’. To increase the variety of generated samples, the prompts leveraged diversity in audience and style, as illustrated in these three published Cosmopedia prompt examples for the same topic:
1. Kids. Write a long and very detailed course unit for a textbook on ‘Why Go To Space?’ intended for young children… Remember this unit is intended for young children books, so use very simple, everyday words and phrases that a 10-year-old would easily understand. Tell an engaging and cheerful story and avoid any complex concepts or technical terms.
2. Students. Write a long and very detailed course unit for a textbook on ‘Why Go To Space?’ intended for high school students… Use language and examples that would relate with teenage students balancing educational rigor with accessibility. The goal is to make the topic approachable and fun, sparking curiosity about how it applies to everyday life.
3. Professionals. Write a long and very detailed course unit for a textbook on ‘Why Go To Space?’ intended for professionals and researchers in the field… The content should aim to engage a highly knowledgeable audience with very deep expertise in the topic. Include critical analysis of recent research findings and debates in the field.
For GPT-4, the OpenAI ‘data team’ consisted of 35 people responsible for collecting and curating corpora for the dataset1GPT-4 paper, p15 within a total headcount of 1,200 staff.2Bloomberg, 4/Apr/2024: ‘OpenAI now has about 1,200 employees, Lightcap said.’ https://archive.md/SptzU Given unlimited access to frontier models and inference, as well as an extensive human workforce dedicated to dataset collection and curation, it would be trivial to build out high quality synthetic data against relevant topics, with the final dataset measured in trillions and then quadrillions of tokens. Notably, this process only needs to happen once, and the final ‘gold’ dataset can be used in training future models.
In the case of GPT-5, it has been reported that synthetic data comprises around… <see full report for details>
A complete list of 110 topics used for synthetic data generation via Cosmopedia is available in Appendix A, and outlined in the following table. For this report, topics have been assigned a category by GPT-4o, and the list has been sorted in alphabetical order. These categories can be compared with (but are not the same as) the proprietary library classification system, the Dewey Decimal Classification (DDC 23): Computer science (000), Philosophy and psychology (100), Religion (200), Social sciences (300), Language (400), Pure science (500), Technology (600), Arts and recreation (700), Literature (800), History and geography (900).
Arts & Entertainment | Arts and Crafts, Fiction and Fantasy Writing, Literature and Creative Writing, Music, Performing Arts, Video Games, Visual Arts and Art Appreciation |
Business & Economics | Business and Entrepreneurship, Business and Management, Career Development and Job Opportunities, Digital Marketing and Business, Economics and Finance, Finance and Investment, Insurance, Loans and Mortgages, Marketing and Business Strategies, Personal Finance and Investments, Product Marketing and Design, Real Estate & Investment, Taxation and Finance |
Education & Learning | Education, Education and Youth Development, Human Resources / Organizational Management, Human Resources and Education, Leadership and Education, Writing and Storytelling |
Food & Beverage | Cooking and Baking, Cooking and Recipes, Culinary Arts and Beverages, Wine & Winemaking |
Health & Medicine | Addiction and Mental Illness, Cosmetic Surgery and Body Modifications, Dentistry, Entomology and Apiculture, HIV Treatment and Care, Healthcare & Medical Services, Healthcare and Operations Management, Infant Feeding and Child Development, Medicine, Mental Health Counseling, Mental Health and Therapy, Nutrition and Health, Pharmaceutical manufacturing and technology, Psychology |
Home & Lifestyle | Audio Equipment and Home Theater Systems, Automotive Parts and Accessories, Cleaning and Maintenance, Home Improvement and Maintenance, Online Chat Platforms and Data Privacy, Pets and Pet Care, Recreational Fishing, Travel, Waste Management and Recycling, Watchmaking and Horology |
Law & Policy | International Relations and Conflict, International Relations and Current Events, International Relations and Politics, Legal Services and Issues, Legal Studies / Law, Legal Studies and Public Policy, Political Science, Politics and Government |
Lifestyle & Culture | Astrology, Cannabis and CBD Products, Christian Theology and Spirituality, Christianity and Theology, Events and Community Happenings, Fashion & Apparel, Hair Care, Hair Care and Styling, Health and Lifestyle, Jewelry Design and Manufacturing, Online Dating & Relationships, Personal Development and Empowerment, Skincare and Beauty Products |
Public Affairs & Safety | Fire Incidents, Gun Control and Violence, Public Administration and Policy, Public Safety and Emergency Response, Public Transit and Transportation |
Science & Technology | Astronomy and Astrophysics, Biochemistry and Molecular Biology, Computer Antivirus Software and Security, Computer Hardware and Graphics Cards, Computer Programming and Web Development, Computer Science, Computer Security & Privacy, Cryptocurrency and Blockchain Technology, Data Privacy and Protection, Digital Imaging and Photography, Electric Vehicles and Battery Technology, Energy and Environmental Policy, Energy and Natural Resources, Genetics and Mental Health, Geography and Weather, Lighting Design and Technology, Molecular Biology and Genetics, Online Platforms & Web Technologies, Technology and Computer Science, Technology and Consumer Electronics |
Sports & Physical Activities | American Football, Arts and Crafts, Baseball, Cricket, Fiction and Fantasy Writing, Football/Soccer, Ice Hockey, Literature and Creative Writing, Music, Performing Arts, Physical Fitness and Health, Professional Basketball/NBA, Professional Wrestling and Sports Entertainment, Sports and Education, Tennis, Video Games, Visual Arts and Art Appreciation |
Table. List of 110 topics for use in synthetic data generation. Data source: Cosmopedia repo, list categorized and formatted by GPT-4o for this report.
Viz
Figure 1. GPT-5 datasets by type (filtered, includes synthetic data).
Figure 2. GPT-5 datasets by type (filtered, synthetic data not shown).
Figure 3. GPT-5 datasets by token count. Logarithmic scale, billions of tokens.
All dataset reports by LifeArchitect.ai (most recent at top)Date | Title |
Aug/2024 | What's in GPT-5? |
Jul/2024 | Argonne National Laboratory AuroraGPT (page) |
Sep/2023 | Google DeepMind Gemini: A general specialist |
Mar/2022 | What's in my AI? (GPT-1, GPT-2, GPT-3, MT-NLG, Chinchilla...) |
Get The Memo
by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.Bestseller. 10,000+ readers from 142 countries. Microsoft, Tesla, Google...
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.
Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 4.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. Technical highlights.
This page last updated: 18/Sep/2024. https://lifearchitect.ai/whats-in-gpt-5/↑
- 1GPT-4 paper, p15
- 2Bloomberg, 4/Apr/2024: ‘OpenAI now has about 1,200 employees, Lightcap said.’ https://archive.md/SptzU