Open the Datasets Table in a new tab | Back to LifeArchitect.ai
Open the Datasets Table in a new tab | Back to LifeArchitect.ai
Datasets Table, list of datasets for large language models, as used by all major AI labs, including:
Genesis Mission, Cosmos, DeepSeek-R2, DCLM-Pool, GPT-5 dataset, Qwen3, Llama 4, RedPajama-Data-v2, Multimodal Universe, Piper monorepo, MNBVC (Massive Never-ending BT Vast Chinese corpus, AuroraGPT, Claude-3.5 dataset, FineWeb, GPT-4 dataset, HPLT v.2.0 (cleaned), Nemotron-Pre-Training-Dataset-v1, FineWeb-Edu-score-2, CulturaX, HPLT (High Performance Language Technologies), RefinedWeb, MassiveText ML, Matrix, FineWeb2, Cultura-Y, The Well, DCLM-Baseline, PaLM 2 dataset, FinePDFs, Dolma, Infiniset, MADLAD-400, MassiveText EN, Common Pile v0.1, Pleias Common Corpus, InternLM, Stability New Pile, FineWeb-Edu 1.3T, Zyda, LLaMA, RedPajama, The Stack v2, SlimPajama, Common Corpus, ROOTS, The Pile v1, Institutional Books 1.0, StarCoder dataset (The Stack 1.2 subset), The Stack v1, GPT-3 dataset, FinePhrase, RoBERTa dataset, YouTube-Commons, Cosmopedia v2, Cosmopedia v0.1, GPT-2 dataset, GPT-1 dataset.
All dataset reports by LifeArchitect.ai (most recent at top)| Date | Type | Title |
| Dec/2025 | 📑 | Genesis Mission |
| Jan/2025 | 📑 | What's in Grok? |
| Jan/2025 | 💻 | NVIDIA Cosmos video dataset |
| Aug/2024 | 📑 | What's in GPT-5? |
| Jul/2024 | 💻 | Argonne AuroraGPT |
| Sep/2023 | 📑 | Google DeepMind Gemini: A general specialist |
| Feb/2023 | 💻 | Chinchilla data-optimal scaling laws: In plain English |
| Aug/2022 | 📑 | Google Pathways |
| Mar/2022 | 📑 | What's in my AI? |
| Sep/2021 | 💻 | Megatron the Transformer, and related language models |
| Ongoing... | 💻 | Datasets Table |