Important external papers

Models and datasets/corpora

Organization Model paper or model card or system card
OpenAI GPT-1 · GPT-2 · GPT-3 · GPT-4 · 4o · 4.5 · o1 · o3m · o3 · o4m · 5 · 5.1 · 5.2 · 5.5
Anthropic C3 · CS3.5 · CH3.5 · CS3.7 · CS4 · CO4 · CO4.1 · CS4.5 · CH4.5 · CO4.5 · CS4.6 · CO4.6 · Mythos · CO4.7
Google PaLM · PaLM 2 · G1 · G1.5 · G2.0F · G2.0FL · G2.5 · G2.5F · G3P · G3F · G3.1FL
Meta RoBERTa · FMoE · L3 · L3.3 · L4
DeepSeek V3 · R1 · V4 Pro · V4 Flash
Qwen Q2 · Q2.5 · Q3 · Q3O · Q3CN · Q3TTS · Q3.5O
xAI Grok 4 · Grok 4 Fast · Grok 4.1
BAAI Wudao 1 · Wudao 2.0

A Comprehensive Analysis of Datasets Likely Used to Train GPT-5

Alan D. Thompson
LifeArchitect.ai
August 2024
27 pages incl title page, references, appendices.

View the report


Kimi K2: Kimi Team. (2025). Kimi K2: Open Agentic Intelligence.

Gemma 3: Gemma Team. (2025). Gemma 3 Technical Report.

Transformers (DeepMind): Phuong & Hutter. (2022). Formal Algorithms for Transformers.

Gato (DeepMind): Reed et al. (2022). A Generalist Agent.

Connecting LLMs to robots (Google): Ahn et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.

Chinchilla scaling (DeepMind): Hoffmann et al. (2022). Training Compute-Optimal Large Language Models.


Google Pathways: An Exploration of the Pathways Architecture from PaLM to Parti

Alan D. Thompson
LifeArchitect.ai
August 2022
24 pages incl title page, references, appendix.

Read more…


InstructGPT (OpenAI): Ouyang et al. (2022). Training language models to follow instructions with human feedback.

GPT-NeoX-20B (EleutherAI): Black et al. (2022). GPT-NeoX-20B: An Open-Source Autoregressive Language Model.

MT-NLG (Microsoft/NVIDIA): Smith et al. (2022). Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model.

LaMDA (Google): Thoppilan et al. (2022). LaMDA: Language Models for Dialog Applications.

Fairseq (Meta AI): Artetxe et al. (2021). Efficient Large Scale Language Modeling with Mixtures of Experts.

Gopher (DeepMind): Rae et al. (2021). Scaling Language Models: Methods, Analysis & Insights from Training Gopher.

Yuan 1.0 (Inspur AI): Wu et al. (2021). Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning.

Macaw (Allen AI/AI2): Tafjord & Clark. (2021). General-Purpose Question-Answering with Macaw.

Jurassic-1 (AI21 Israel): Lieber et al. (2021). Jurassic-1: Technical Details and Evaluation.

Blenderbot 2.0 (Facebook): Komeili et al. (2021). Internet-Augmented Dialogue Generation.

PanGu Alpha (Huawei): Zeng et al. (2021). PanGu-Alpha: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation.

The Pile v1 (EleutherAI): Gao et al. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling.


What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher

Alan D. Thompson
LifeArchitect.ai
March 2022
26 pages incl title page, references, appendix.

Read more…


Common Crawl: Dodge et al. (2021). Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus.

GPT-3 (OpenAI): Brown et al. (2020). Language Models are Few-Shot Learners.
The arXiv preprint (22 Jul 2020, v4) is the comprehensive 75-page version with all sections and appendices. The final NeurIPS camera-ready version (22 Oct 2020, 25 pages) has some sections removed and no appendices.

GPT-2 (OpenAI): Radford et al. (2019). Language Models are Unsupervised Multitask Learners.

GPT-1 (OpenAI): Radford et al. (2018). Improving Language Understanding by Generative Pre-Training.

RoBERTa (Meta AI): Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.

BERT (Google): Devlin et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Fine-tuning (Howard): Howard & Ruder. (2018). Universal Language Model Fine-tuning for Text Classification.

Transformer (Google): Vaswani et al. (2017). Attention is All You Need.

AI Winter caused by scientists: Olazaran. (1996). A Sociological Study of the Official History of the Perceptrons Controversy. (PDF)

The Turing Test: Turing, A. M. (1950). Computing Machinery and Intelligence. Mind 49: 433-460.

First steps in AI by Turing: Turing, A. M. (1941-1948).
Guinness, R. (2018). What is Artificial Intelligence? Part 2.

Re-discovering ‘Intelligent machinery’ by Alan Turing.

(1948). ‘Intelligent machinery’ by Alan Turing, prepared/typed by ‘Gabriel’.


Organisations

OpenAI 2022: Johnson, S. (2022). A.I. Is Mastering Language. Should We Trust What It Says? The New York Times Magazine. (Archived PDF)

Inside OpenAI and Neuralink offices: Hao, K. (2020). The messy, secretive reality behind OpenAI’s bid to save the world. MIT Technology Review. (Archived PDF)


Ethics and data quality guidance

Nick Bostrom: (2022). Propositions Concerning Digital Minds and Society.

Aleph Alpha: Andrulis, J. (2022). Ethics and bias in generalizable AI.

Societal impacts (Anthropic): Ganguli et al. (2022). Predictability and Surprise in Large Generative Models.
(See also Anthropic’s work on reverse engineering transformer language models)

Foundation models (211-page report with 114 authors via Stanford AI): Bommasani et al. (2021). On the Opportunities and Risks of Foundation Models.

Parrots: Bender et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? (Note: Banned by Google.)

GPT-3 quality: Strickland, E. (2021). OpenAI’s GPT-3 Speaks! (Kindly Disregard Toxic Language). IEEE Spectrum.

GPT-J quality: HN discussion (2021). A discussion about GPT-J, Books3 creation, and the exclusion of datasets like Literotica and the US Congressional Record… (Archived PDF)

GPT-4Chan: “…some dark corners of the web like 4Chan that are already sometimes unfortunately part of the pre-training of these large language models (maybe to try to remove them/mitigate them?).” – Clem Delangue, co-founder and CEO at Hugging Face. https://huggingface.co/ykilcher/gpt-4chan/discussions/1. https://archive.ph/6bddZ

See also my 2021 paper: Integrated AI: Dataset quality vs quantity via bonum (GPT-4 and beyond).

Sam Altman (2021). Moore’s Law for Everything.

OpenAI to USPTO: AI IP (2019). Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation.

GPT-2 Ethics: Solaiman, I., et al. (2019). Release Strategies and the Social Impacts of Language Models.

Animal rights (as potential guidance for AI rights): Cambridge. (2012). The Cambridge Declaration of Consciousness (CDC).


Intergovernmental and governmental guidance

UN/UNESCO: AI ethics. Recommendation on the ethics of artificial intelligence (2021).

AI + the UN Sustainable Development Goals: 2030Vision (2019). AI & The Sustainable Development Goals: The state of play.

AI Ethics: WHO (2021). Ethics and governance of artificial intelligence for health.

AI Ethics: European Commission (2019). Ethics Guidelines for Trustworthy AI.

Australian Govt AI: (2021). Australia’s AI Action Plan: June 2021.

International AI Strategies: The team at AiLab.com.au hosts the most comprehensive list of all international AI strategies, from Australia to Vietnam.


Other

Spinning Up: OpenAI (2020). Spinning Up in Deep RL. (Note: This is a study guide for learning about deep reinforcement learning.)

Wudao usage agreement: BAAI (2021). Data Usage Agreement of Zhiyuan Enlightenment Platform. (Note: Translated to English.)

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Informs research at Apple, Google, Microsoft · Bestseller in 147 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Alan D. Thompson is a world expert in artificial intelligence, advising everyone from Apple to the US Government on integrated AI. Throughout Mensa International’s history, both Isaac Asimov and Alan held leadership roles, each exploring the frontier between human and artificial minds. His landmark analysis of post-2020 AI—from his widely-cited Models Table to his regular intelligence briefing The Memo—has shaped how governments and Fortune 500s approach artificial intelligence. With popular tools like the Declaration on AI Consciousness, and the ASI checklist, Alan continues to illuminate humanity’s AI evolution. Technical highlights.

This page last updated: 28/Apr/2026. https://lifearchitect.ai/papers/