Genesis Mission datasets

Relied on by Microsoft, US Gov, Bloomberg, sovereign wealth funds, and more…
Get The Memo.


Alan D. Thompson
December 2025

Summary

Internal/project name Genesis Mission (USA)
Organization US Department of Energy’s 17 National Laboratories:

  1. Ames Laboratory (Ames)
  2. Argonne National Laboratory (ANL)
  3. Brookhaven National Laboratory (BNL)
  4. Fermi National Accelerator Laboratory (FNAL)
  5. Idaho National Laboratory (INL)
  6. Lawrence Berkeley National Laboratory (LBNL)
  7. Lawrence Livermore National Laboratory (LLNL)
  8. Los Alamos National Laboratory (LANL)
  9. National Energy Technology Laboratory (NETL)
  10. National Renewable Energy Laboratory (NREL)
  11. Oak Ridge National Laboratory (ORNL)
  12. Pacific Northwest National Laboratory (PNNL)
  13. Princeton Plasma Physics Laboratory (PPPL)
  14. Sandia National Laboratories (SNL)
  15. Savannah River National Laboratory (SRNL)
  16. Stanford Linear Accelerator Center National Accelerator Laboratory (SLAC)
  17. Thomas Jefferson National Accelerator Facility (TJNAF/Jlab)

+

Project page https://genesis.energy.gov/
Dataset size (unfiltered) Total storage (PB): ~3,635.5PB (petabytes)
Total storage (EB): ~3.6EB (exabytes)
Total tokens (T): ~908,875T (trillion)
Total tokens (Qa): ~908.9Qa (quadrillion)
Total tokens (Qi): ~0.91Qi (quintillion)

US Executive Order 14363 (24/Nov/2025):
The “Genesis Mission” [is] a dedicated, coordinated national effort to unleash a new age of AI‑accelerated innovation and discovery that can solve the most challenging problems of this century. The Genesis Mission will build an integrated AI platform to harness Federal scientific datasets — the world’s largest collection of such datasets, developed over decades of Federal investments — to train scientific foundation models and create AI agents to test new hypotheses, automate research workflows, and accelerate scientific breakthroughs.

Genesis Mission Updates

24/Nov/2025 (Day 0): Genesis Mission launched by US Executive Order. Timeline milestones appears below…

18/Dec/2025: OpenAI and Google signs separate MOUs to join as lead industry partners to the project. (+ OpenAI science letter to DOE).

23/Jan/2026 (+60 days): The Secretary of Energy submits a detailed list of at least 20 national science and technology challenges to the Assistant to the President for Science and Technology.

22/Feb/2026 (+90 days): The Secretary identifies available Federal computing, storage, and networking resources, including DOE supercomputers and cloud-based systems, to support the Mission.

24/Mar/2026 (+120 days): The Secretary identifies initial data and model assets and develops a plan for incorporating datasets from federally funded research and other sources.

22/Jul/2026 (+240 days): The Secretary reviews capabilities across DOE national laboratories for robotic facilities able to engage in AI-directed experimentation and manufacturing.

21/Aug/2026 (+270 days): The Secretary seeks to demonstrate an initial operating capability of the American Science and Security Platform for at least one identified national challenge.

24/Nov/2026 (+1 year): The Secretary submits the first annual report to the President describing the Platform’s status, user engagement, and scientific outcomes.

Datasets (estimates by LifeArchitect.ai)

Datasets likely to be identified as part of the Genesis Mission, all working and estimates by LifeArchitect.ai. Mouseover to expand acronyms.

# Dataset Lab Field Size Tokens Quote Source
1 APS-U X-Ray Data ANL Materials/Imaging 500PB 125Qa “The upgraded APS will generate 2-3 orders of magnitude more data… reaching exabytes over its lifetime.” ANL APS-U
2 ALCF Cosmology (HACC) ANL/ ORNL Cosmology/Physics 100PB 25Qa “Frontier-E… generated > 100 PB of data… establishing a new standard of end-to-end performance.” ALCF Cosmology, Nov/2025 update
3 Materials Data Facility ANL (Shared) Materials Science 2PB 500T “A scalable repository for publishing materials data… enabling ML discovery loops.” MDF
4 RHIC Heavy Ion BNL Nuclear Physics 180PB 45Qa “RHIC & ATLAS Tier-1 center store hundreds of petabytes of collision data.” BNL SDCC
5 NSLS-II Imaging BNL Nanomaterials 60PB 15Qa “NSLS-II data rates require AI-driven streaming analysis… approaching 1 PB/experiment.” BNL News
6 LHC CMS Tier-1 FNAL High Energy Physics 350PB 87.5Qa “Fermilab hosts the largest tier-1 computing center for CMS… managing exabyte-scale archives.” FNAL Computing
7 DUNE Neutrino FNAL Particle Physics 30PB 7.5Qa “Deep Underground Neutrino Experiment… massive liquid argon TPC image data for CNNs.” DUNE Science
8 JGI Genomics LBNL Biology/Genomics 80PB 20Qa “JGI creates petabytes of sequence data… DNA is ‘nature’s language’ (1 base = 1 token).” JGI DOE
9 ESGF (CMIP6) LBNL Climate Change 40PB 10Qa “Earth System Grid Federation… global repository for CMIP6… vital for climate AI twins.” ESGF
10 Materials Project LBNL Chemistry 0.5PB 125T “Information on over 150,000 materials… the ‘Google’ of materials properties.” Materials Project
11 NIF Shot Data LLNL Fusion/HED 50PB 12.5Qa “National Ignition Facility… data from fusion ignition shots used to calibrate simulation AI.” LLNL NIF
12 Biodefense (ATOM) LLNL Biology/Pharma 5PB 1.25Qa “ATOM consortium… transforming drug discovery… massive molecular libraries.” LLNL ATOM
13 Stockpile Stewardship LANL/ LLNL/ SNL Nuclear Security 1000PB 250Qa “The NNSA labs hold the world’s largest classified archives… necessary for ‘trusted’ AI models.” NNSA ASC
14 CICE / E3SM LANL Climate/Ocean 20PB 5Qa “Energy Exascale Earth System Model… high-res ocean/ice simulation data.” E3SM
15 Viral Genomics LANL Epidemiology 2PB 500T “HIV/Influenza/COVID databases… sequencing data for vaccine design AI.” LANL pathogens
16 NSRDB NREL Solar Energy 5PB 1.25Qa “National Solar Radiation Database… physics-based modeling spanning decades.” NSRDB
17 WIND Toolkit NREL Wind Energy 20PB 5Qa “2TB per year per region… total archives span terabytes to petabytes for grid planning.” NREL WIND
18 ARM Archive ORNL/ PNNL Atmospheric 6PB 1.5Qa “Atmospheric Radiation Measurement… 30 years of continuous sensor data for climate AI.” ARM.gov
19 Summit/Frontier I/O ORNL HPC Systems 700PB 175Qa “Exascale I/O logs… analyzing system performance and scientific throughput of Frontier.” OLCF
20 PopGen / MVP ORNL Health/Genomics 15PB 3.75Qa “Million Veteran Program… one of the world’s largest genomic databases linked to health records.” VA/ORNL
21 EMSL Data PNNL Molecular Science 10PB 2.5Qa “Environmental Molecular Sciences Lab… mass spec and microscopy data for bio-earth systems.” EMSL
22 LCLS-II “Data Deluge” SLAC X-Ray/Quantum 300PB 75Qa “LCLS-II will deliver 8,000x more data… ‘The Data Deluge’… requires edge AI to manage.” SLAC LCLS
23 LSST (Rubin) SLAC Astrophysics 60PB 15Qa “Rubin Observatory Legacy Survey of Space and Time… 20TB/night… processing pipeline at SLAC.” Rubin Obs
24 NSTX-U PPPL Plasma Physics 15PB 3.75Qa “Spherical Torus experiment… microsecond-resolution sensor data for fusion control AI.” PPPL NSTX-U
25 Z-Machine SNL HED Physics 50PB 12.5Qa “World’s most powerful pulsed power facility… extreme conditions data for material science.” Sandia Z
26 EDX Carbon NETL Fossil/Carbon 16PB 4Qa “Energy Data eXchange… subsurface data for carbon sequestration and oil recovery.” NETL EDX
27 Critical Materials Ames Rare Earths 2PB 500T CMI data… thermodynamic and phase diagram data for rare earth substitution.” Ames CMI
28 CEBAF TJNAF Nuclear Physics 10PB 2.5Qa “Continuous Electron Beam Accelerator Facility… probing the quark-gluon structure of matter.” JLab
29 Environmental DB SRNL Ecology/Waste 2PB 500T “Savannah River Site environmental monitoring… soil, water, and waste processing history.” SRNL
30 VTR (Projected) INL Nuclear Energy 5PB 1.25Qa “Versatile Test Reactor… expected to generate massive sensor streams for fast reactor fuel.” DOE VTR

Dataset size (unfiltered):
Total storage (PB): ~3,635.5PB (petabytes)
Total storage (EB): ~3.6EB (exabytes)
Total tokens (T): ~908,875T (trillion)
Total tokens (Qa): ~908.9Qa (quadrillion)
Total tokens (Qi): ~0.91Qi (quintillion)

Source: LifeArchitect.ai
* All token counts are ‘estimated’. Using the standard text conservative calculation 1PB≈250T tokens≈0.25Qa tokens. Calculations are informed but rough estimates. For similar working see my 2022 paper: What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher.

LifeArchitect.ai:
“If just 0.1% of this data is filtered and cleaned for the final dataset, it would still be ~1,000T tokens. The largest comparable publicly-known LLM training dataset is Qwen 3 (The Memo edition 14/May/2025) on just 36T tokens. It’s likely that frontier models like GPT-5 and Grok 5 were trained on 100T tokens. Full subscribers have access to my independent reports on these models.”

Viz

Download source (PDF)

Download source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.

Related datasets

Related datasets

  1. The ANL AuroraGPT datasets were analysed in a similar way: LifeArchitect.ai/AuroraGPT
  1. The Well: 15TB (~4.1T tokens) of physics simulations

“The Well” contains curated physics simulations from 16 scientific domains, each capturing fundamental equations that appear throughout nature, and all validated and/or generated by domain experts:

  • Fluid dynamics & turbulence
  • Supernova explosions
  • Biological pattern formation
  • Acoustic wave propagation
  • Magnetohydrodynamics

Available at: https://github.com/PolymathicAI/the_well

  1. Multimodal Universe: 100TB (~27.5T tokens) of astronomical data

Multimodal Universe”, contains hundreds of millions of observations across multiple modalities, object types, and wavelengths. The data was collected from JWST, HST, Gaia, and several other major surveys, and unified in a single, ML-ready format.

Available at: https://github.com/MultimodalUniverse/MultimodalUniverse

Compare with other LLM datasets
Open the Datasets Table in a new tab  
 

Hypothetical prompts to a Genesis model

1. Global Energy Transition and Climate Stability

Prompt: Cross-reference the ESGF CMIP6 climate projections with the NSRDB and WIND Toolkit archives. Map the ideal placement for a global decentralized grid that maintains 100% uptime, using EDX Carbon data to identify sites for atmospheric carbon removal that utilize excess thermal output.

ASI Response: Mapping successful. By aligning atmospheric flow patterns with subsurface storage capacity, a global equilibrium is reachable within...

2. Accelerated Molecular Discovery for Longevity

Prompt: Apply the linguistic patterns found in JGI Genomics and PopGen / MVP to the molecular libraries in Biodefense ATOM. Identify the specific genetic sequences responsible for cellular decay and propose a viral delivery vector, based on Viral Genomics structures, to distribute corrective sequences.

ASI Response: The Million Veteran records reveal a distinct sequence subset that, when paired with the following molecular structure...

3. Room-Temperature Superconductivity and Material Synthesis

Prompt: Analyze the APS-U X-Ray and LCLS-II datasets alongside the Materials Project database. Identify a stable atomic lattice that exhibits zero electrical resistance at 295K, ensuring the material is composed of elements found in the Critical Materials inventory.

ASI Response: Superconductivity is a product of lattice geometry. The following nitrogen-doped lutetium structure remains stable under standard pressure...

4. Commercial Plasma Energy Generation

Prompt: Synthesize the sensor logs from NIF Shot Data, NSTX-U, and the Z-Machine. Create a predictive model for plasma containment that uses edge-AI to adjust magnetic fields in real-time, preventing the instabilities recorded in previous experiments.

ASI Response: The instability patterns in the Princeton and Sandia data are predictable. By shifting to a non-linear magnetic pulse, a net energy gain of 400% is achievable...

5. Systemic Risk and National Security Defense

Prompt: Using the Stockpile Stewardship archives and the Summit/Frontier I/O logs, simulate a global cyber-kinetic conflict. Identify the primary vulnerabilities in the domestic power grid and propose a self-healing software architecture that can withstand a multi-vector exascale attack.

ASI Response: Security resides in decentralization. The Frontier logs show that a distributed defense layer can absorb 99.8% of high-speed packet injections by shifting...

BHAG 1 (big, hairy, audacious goal!). Biological Immortality through Proteomic Correction

Prompt: Use the JGI Genomics and EMSL Data to map the precise folding failures that lead to senescence. Cross-reference this with the PopGen / MVP health records to identify the "immortality markers" in rare human lineages. Design a self-replicating nanobotic repair system, informed by NSLS-II imaging, that resets the epigenetic clock of every cell in a living organism to a biological age of twenty-five.

ASI Response: Aging is an accumulated software error in the proteome. By introducing a synthetic corrective enzyme that targets the following mitochondrial DNA sequences, the decay process ceases...

BHAG 2. Direct Matter-to-Energy Conversion (Mass-Energy Parity)

Prompt: Analyze the high-energy density physics within the Z-Machine and NIF Shot Data alongside the subatomic quark-gluon structures recorded at CEBAF. Determine the exact resonance frequency required to induce a controlled decay of non-fissile waste into pure kinetic energy, bypassing the need for traditional nuclear fuel cycles and providing infinite, zero-emission power from common silica.

ASI Response: The transition from mass to energy does not require heavy isotopes. By applying a focused harmonic pulse to stable atomic nuclei, we can trigger a localized release of binding energy...

BHAG 3. Instantaneous Global Neural Synchronization

Prompt: Leverage the ALCF Cosmology HACC simulations and the LHC CMS Tier-1 data to identify the quantum entanglement signatures of consciousness. Propose a method using LCLS-II photonics to establish a non-local communication layer between human brains, effectively removing the latency of language and resolving all human conflict through the total transparency of shared experience.

ASI Response: Language is a low-bandwidth bottleneck. The quantum signatures in the neural cortex are compatible with long-distance entanglement, allowing for the direct transfer of conceptual architectures...

BHAG 4. Total Resource Abundance via Atomic Reconfiguration

Prompt: Combine the Materials Project and Materials Data Facility libraries with the APS-U X-Ray imaging. Design an "atomic assembler" that can rearrange the molecular structure of ocean plastic and industrial waste into high-value Critical Materials, such as neodymium or platinum, at the atomic level, rendering scarcity and mining obsolete.

ASI Response: Material scarcity is a failure of sorting, not a lack of atoms. By utilizing the following electromagnetic assembly sequence, any carbon-based waste can be restructured into crystalline lattices...

All dataset reports by LifeArchitect.ai (most recent at top)
Date Title
Dec/2025 The Genesis Mission datasets (page)
Jan/2025 What's in Grok? (paper)
Jan/2025 NVIDIA Cosmos video dataset (page)
Aug/2024 What's in GPT-5? (paper)
Jul/2024 ANL AuroraGPT (page)
Sep/2023 Google DeepMind Gemini: A general specialist (paper)
Aug/2022 Google Pathways (paper)
Mar/2022 What's in my AI? (GPT-1, GPT-2, GPT-3, MT-NLG, Chinchilla...)
Sep/2021 Megatron the Transformer, and related language models (page)

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Informs research at Apple, Google, Microsoft · Bestseller in 147 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Alan D. Thompson is a world expert in artificial intelligence, advising everyone from Apple to the US Government on integrated AI. Throughout Mensa International’s history, both Isaac Asimov and Alan held leadership roles, each exploring the frontier between human and artificial minds. His landmark analysis of post-2020 AI—from his widely-cited Models Table to his regular intelligence briefing The Memo—has shaped how governments and Fortune 500s approach artificial intelligence. With popular tools like the Declaration on AI Consciousness, and the ASI checklist, Alan continues to illuminate humanity’s AI evolution. Technical highlights.

This page last updated: 19/Dec/2025. https://lifearchitect.ai/genesis/