Get The Memo.
Alan D. Thompson
December 2025
Summary
| Internal/project name | Genesis Mission (USA) |
| Organization | US Department of Energy’s 17 National Laboratories:
|
| Project page | https://genesis.energy.gov/ |
| Dataset size (unfiltered) | Total storage (PB): ~3,635.5PB (petabytes) Total storage (EB): ~3.6EB (exabytes) Total tokens (T): ~908,875T (trillion) Total tokens (Qa): ~908.9Qa (quadrillion) Total tokens (Qi): ~0.91Qi (quintillion) |
US Executive Order 14363 (24/Nov/2025):
The “Genesis Mission” [is] a dedicated, coordinated national effort to unleash a new age of AI‑accelerated innovation and discovery that can solve the most challenging problems of this century. The Genesis Mission will build an integrated AI platform to harness Federal scientific datasets — the world’s largest collection of such datasets, developed over decades of Federal investments — to train scientific foundation models and create AI agents to test new hypotheses, automate research workflows, and accelerate scientific breakthroughs.
Genesis Mission Updates
24/Nov/2025 (Day 0): Genesis Mission launched by US Executive Order. Timeline milestones appears below…
18/Dec/2025: OpenAI and Google signs separate MOUs to join as lead industry partners to the project. (+ OpenAI science letter to DOE).
23/Jan/2026 (+60 days): The Secretary of Energy submits a detailed list of at least 20 national science and technology challenges to the Assistant to the President for Science and Technology.
22/Feb/2026 (+90 days): The Secretary identifies available Federal computing, storage, and networking resources, including DOE supercomputers and cloud-based systems, to support the Mission.
24/Mar/2026 (+120 days): The Secretary identifies initial data and model assets and develops a plan for incorporating datasets from federally funded research and other sources.
22/Jul/2026 (+240 days): The Secretary reviews capabilities across DOE national laboratories for robotic facilities able to engage in AI-directed experimentation and manufacturing.
21/Aug/2026 (+270 days): The Secretary seeks to demonstrate an initial operating capability of the American Science and Security Platform for at least one identified national challenge.
24/Nov/2026 (+1 year): The Secretary submits the first annual report to the President describing the Platform’s status, user engagement, and scientific outcomes.
Datasets (estimates by LifeArchitect.ai)
Datasets likely to be identified as part of the Genesis Mission, all working and estimates by LifeArchitect.ai. Mouseover to expand acronyms.
| # | Dataset | Lab | Field | Size | Tokens | Quote | Source |
|---|---|---|---|---|---|---|---|
| 1 | APS-U X-Ray Data | ANL | Materials/Imaging | 500PB | 125Qa | “The upgraded APS will generate 2-3 orders of magnitude more data… reaching exabytes over its lifetime.” | ANL APS-U |
| 2 | ALCF Cosmology (HACC) | ANL/ ORNL | Cosmology/Physics | 100PB | 25Qa | “Frontier-E… generated > 100 PB of data… establishing a new standard of end-to-end performance.” | ALCF Cosmology, Nov/2025 update |
| 3 | Materials Data Facility | ANL (Shared) | Materials Science | 2PB | 500T | “A scalable repository for publishing materials data… enabling ML discovery loops.” | MDF |
| 4 | RHIC Heavy Ion | BNL | Nuclear Physics | 180PB | 45Qa | “RHIC & ATLAS Tier-1 center store hundreds of petabytes of collision data.” | BNL SDCC |
| 5 | NSLS-II Imaging | BNL | Nanomaterials | 60PB | 15Qa | “NSLS-II data rates require AI-driven streaming analysis… approaching 1 PB/experiment.” | BNL News |
| 6 | LHC CMS Tier-1 | FNAL | High Energy Physics | 350PB | 87.5Qa | “Fermilab hosts the largest tier-1 computing center for CMS… managing exabyte-scale archives.” | FNAL Computing |
| 7 | DUNE Neutrino | FNAL | Particle Physics | 30PB | 7.5Qa | “Deep Underground Neutrino Experiment… massive liquid argon TPC image data for CNNs.” | DUNE Science |
| 8 | JGI Genomics | LBNL | Biology/Genomics | 80PB | 20Qa | “JGI creates petabytes of sequence data… DNA is ‘nature’s language’ (1 base = 1 token).” | JGI DOE |
| 9 | ESGF (CMIP6) | LBNL | Climate Change | 40PB | 10Qa | “Earth System Grid Federation… global repository for CMIP6… vital for climate AI twins.” | ESGF |
| 10 | Materials Project | LBNL | Chemistry | 0.5PB | 125T | “Information on over 150,000 materials… the ‘Google’ of materials properties.” | Materials Project |
| 11 | NIF Shot Data | LLNL | Fusion/HED | 50PB | 12.5Qa | “National Ignition Facility… data from fusion ignition shots used to calibrate simulation AI.” | LLNL NIF |
| 12 | Biodefense (ATOM) | LLNL | Biology/Pharma | 5PB | 1.25Qa | “ATOM consortium… transforming drug discovery… massive molecular libraries.” | LLNL ATOM |
| 13 | Stockpile Stewardship | LANL/ LLNL/ SNL | Nuclear Security | 1000PB | 250Qa | “The NNSA labs hold the world’s largest classified archives… necessary for ‘trusted’ AI models.” | NNSA ASC |
| 14 | CICE / E3SM | LANL | Climate/Ocean | 20PB | 5Qa | “Energy Exascale Earth System Model… high-res ocean/ice simulation data.” | E3SM |
| 15 | Viral Genomics | LANL | Epidemiology | 2PB | 500T | “HIV/Influenza/COVID databases… sequencing data for vaccine design AI.” | LANL pathogens |
| 16 | NSRDB | NREL | Solar Energy | 5PB | 1.25Qa | “National Solar Radiation Database… physics-based modeling spanning decades.” | NSRDB |
| 17 | WIND Toolkit | NREL | Wind Energy | 20PB | 5Qa | “2TB per year per region… total archives span terabytes to petabytes for grid planning.” | NREL WIND |
| 18 | ARM Archive | ORNL/ PNNL | Atmospheric | 6PB | 1.5Qa | “Atmospheric Radiation Measurement… 30 years of continuous sensor data for climate AI.” | ARM.gov |
| 19 | Summit/Frontier I/O | ORNL | HPC Systems | 700PB | 175Qa | “Exascale I/O logs… analyzing system performance and scientific throughput of Frontier.” | OLCF |
| 20 | PopGen / MVP | ORNL | Health/Genomics | 15PB | 3.75Qa | “Million Veteran Program… one of the world’s largest genomic databases linked to health records.” | VA/ORNL |
| 21 | EMSL Data | PNNL | Molecular Science | 10PB | 2.5Qa | “Environmental Molecular Sciences Lab… mass spec and microscopy data for bio-earth systems.” | EMSL |
| 22 | LCLS-II “Data Deluge” | SLAC | X-Ray/Quantum | 300PB | 75Qa | “LCLS-II will deliver 8,000x more data… ‘The Data Deluge’… requires edge AI to manage.” | SLAC LCLS |
| 23 | LSST (Rubin) | SLAC | Astrophysics | 60PB | 15Qa | “Rubin Observatory Legacy Survey of Space and Time… 20TB/night… processing pipeline at SLAC.” | Rubin Obs |
| 24 | NSTX-U | PPPL | Plasma Physics | 15PB | 3.75Qa | “Spherical Torus experiment… microsecond-resolution sensor data for fusion control AI.” | PPPL NSTX-U |
| 25 | Z-Machine | SNL | HED Physics | 50PB | 12.5Qa | “World’s most powerful pulsed power facility… extreme conditions data for material science.” | Sandia Z |
| 26 | EDX Carbon | NETL | Fossil/Carbon | 16PB | 4Qa | “Energy Data eXchange… subsurface data for carbon sequestration and oil recovery.” | NETL EDX |
| 27 | Critical Materials | Ames | Rare Earths | 2PB | 500T | “CMI data… thermodynamic and phase diagram data for rare earth substitution.” | Ames CMI |
| 28 | CEBAF | TJNAF | Nuclear Physics | 10PB | 2.5Qa | “Continuous Electron Beam Accelerator Facility… probing the quark-gluon structure of matter.” | JLab |
| 29 | Environmental DB | SRNL | Ecology/Waste | 2PB | 500T | “Savannah River Site environmental monitoring… soil, water, and waste processing history.” | SRNL |
| 30 | VTR (Projected) | INL | Nuclear Energy | 5PB | 1.25Qa | “Versatile Test Reactor… expected to generate massive sensor streams for fast reactor fuel.” | DOE VTR |
Dataset size (unfiltered):
Total storage (PB): ~3,635.5PB (petabytes)
Total storage (EB): ~3.6EB (exabytes)
Total tokens (T): ~908,875T (trillion)
Total tokens (Qa): ~908.9Qa (quadrillion)
Total tokens (Qi): ~0.91Qi (quintillion)
Source: LifeArchitect.ai
* All token counts are ‘estimated’. Using the standard text conservative calculation 1PB≈250T tokens≈0.25Qa tokens. Calculations are informed but rough estimates. For similar working see my 2022 paper: What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher.
LifeArchitect.ai:
“If just 0.1% of this data is filtered and cleaned for the final dataset, it would still be ~1,000T tokens. The largest comparable publicly-known LLM training dataset is Qwen 3 (The Memo edition 14/May/2025) on just 36T tokens. It’s likely that frontier models like GPT-5 and Grok 5 were trained on 100T tokens. Full subscribers have access to my independent reports on these models.”
Viz
Download source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.
Related datasets
Related datasets
- The ANL AuroraGPT datasets were analysed in a similar way: LifeArchitect.ai/AuroraGPT
- The Well: 15TB (~4.1T tokens) of physics simulations
“The Well” contains curated physics simulations from 16 scientific domains, each capturing fundamental equations that appear throughout nature, and all validated and/or generated by domain experts:
- Fluid dynamics & turbulence
- Supernova explosions
- Biological pattern formation
- Acoustic wave propagation
- Magnetohydrodynamics
Available at: https://github.com/PolymathicAI/the_well
- Multimodal Universe: 100TB (~27.5T tokens) of astronomical data
“Multimodal Universe”, contains hundreds of millions of observations across multiple modalities, object types, and wavelengths. The data was collected from JWST, HST, Gaia, and several other major surveys, and unified in a single, ML-ready format.
Available at: https://github.com/MultimodalUniverse/MultimodalUniverse
Open the Datasets Table in a new tab
Hypothetical prompts to a Genesis model
1. Global Energy Transition and Climate Stability
Prompt: Cross-reference the ESGF CMIP6 climate projections with the NSRDB and WIND Toolkit archives. Map the ideal placement for a global decentralized grid that maintains 100% uptime, using EDX Carbon data to identify sites for atmospheric carbon removal that utilize excess thermal output.
ASI Response: Mapping successful. By aligning atmospheric flow patterns with subsurface storage capacity, a global equilibrium is reachable within...
2. Accelerated Molecular Discovery for Longevity
Prompt: Apply the linguistic patterns found in JGI Genomics and PopGen / MVP to the molecular libraries in Biodefense ATOM. Identify the specific genetic sequences responsible for cellular decay and propose a viral delivery vector, based on Viral Genomics structures, to distribute corrective sequences.
ASI Response: The Million Veteran records reveal a distinct sequence subset that, when paired with the following molecular structure...
3. Room-Temperature Superconductivity and Material Synthesis
Prompt: Analyze the APS-U X-Ray and LCLS-II datasets alongside the Materials Project database. Identify a stable atomic lattice that exhibits zero electrical resistance at 295K, ensuring the material is composed of elements found in the Critical Materials inventory.
ASI Response: Superconductivity is a product of lattice geometry. The following nitrogen-doped lutetium structure remains stable under standard pressure...
4. Commercial Plasma Energy Generation
Prompt: Synthesize the sensor logs from NIF Shot Data, NSTX-U, and the Z-Machine. Create a predictive model for plasma containment that uses edge-AI to adjust magnetic fields in real-time, preventing the instabilities recorded in previous experiments.
ASI Response: The instability patterns in the Princeton and Sandia data are predictable. By shifting to a non-linear magnetic pulse, a net energy gain of 400% is achievable...
5. Systemic Risk and National Security Defense
Prompt: Using the Stockpile Stewardship archives and the Summit/Frontier I/O logs, simulate a global cyber-kinetic conflict. Identify the primary vulnerabilities in the domestic power grid and propose a self-healing software architecture that can withstand a multi-vector exascale attack.
ASI Response: Security resides in decentralization. The Frontier logs show that a distributed defense layer can absorb 99.8% of high-speed packet injections by shifting...
BHAG 1 (big, hairy, audacious goal!). Biological Immortality through Proteomic Correction
Prompt: Use the JGI Genomics and EMSL Data to map the precise folding failures that lead to senescence. Cross-reference this with the PopGen / MVP health records to identify the "immortality markers" in rare human lineages. Design a self-replicating nanobotic repair system, informed by NSLS-II imaging, that resets the epigenetic clock of every cell in a living organism to a biological age of twenty-five.
ASI Response: Aging is an accumulated software error in the proteome. By introducing a synthetic corrective enzyme that targets the following mitochondrial DNA sequences, the decay process ceases...
BHAG 2. Direct Matter-to-Energy Conversion (Mass-Energy Parity)
Prompt: Analyze the high-energy density physics within the Z-Machine and NIF Shot Data alongside the subatomic quark-gluon structures recorded at CEBAF. Determine the exact resonance frequency required to induce a controlled decay of non-fissile waste into pure kinetic energy, bypassing the need for traditional nuclear fuel cycles and providing infinite, zero-emission power from common silica.
ASI Response: The transition from mass to energy does not require heavy isotopes. By applying a focused harmonic pulse to stable atomic nuclei, we can trigger a localized release of binding energy...
BHAG 3. Instantaneous Global Neural Synchronization
Prompt: Leverage the ALCF Cosmology HACC simulations and the LHC CMS Tier-1 data to identify the quantum entanglement signatures of consciousness. Propose a method using LCLS-II photonics to establish a non-local communication layer between human brains, effectively removing the latency of language and resolving all human conflict through the total transparency of shared experience.
ASI Response: Language is a low-bandwidth bottleneck. The quantum signatures in the neural cortex are compatible with long-distance entanglement, allowing for the direct transfer of conceptual architectures...
BHAG 4. Total Resource Abundance via Atomic Reconfiguration
Prompt: Combine the Materials Project and Materials Data Facility libraries with the APS-U X-Ray imaging. Design an "atomic assembler" that can rearrange the molecular structure of ocean plastic and industrial waste into high-value Critical Materials, such as neodymium or platinum, at the atomic level, rendering scarcity and mining obsolete.
ASI Response: Material scarcity is a failure of sorting, not a lack of atoms. By utilizing the following electromagnetic assembly sequence, any carbon-based waste can be restructured into crystalline lattices...
| Date | Title |
| Dec/2025 | The Genesis Mission datasets (page) |
| Jan/2025 | What's in Grok? (paper) |
| Jan/2025 | NVIDIA Cosmos video dataset (page) |
| Aug/2024 | What's in GPT-5? (paper) |
| Jul/2024 | ANL AuroraGPT (page) |
| Sep/2023 | Google DeepMind Gemini: A general specialist (paper) |
| Aug/2022 | Google Pathways (paper) |
| Mar/2022 | What's in my AI? (GPT-1, GPT-2, GPT-3, MT-NLG, Chinchilla...) |
| Sep/2021 | Megatron the Transformer, and related language models (page) |
Get The Memo
by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.Informs research at Apple, Google, Microsoft · Bestseller in 147 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.
Alan D. Thompson is a world expert in artificial intelligence, advising everyone from Apple to the US Government on integrated AI. Throughout Mensa International’s history, both Isaac Asimov and Alan held leadership roles, each exploring the frontier between human and artificial minds. His landmark analysis of post-2020 AI—from his widely-cited Models Table to his regular intelligence briefing The Memo—has shaped how governments and Fortune 500s approach artificial intelligence. With popular tools like the Declaration on AI Consciousness, and the ASI checklist, Alan continues to illuminate humanity’s AI evolution. Technical highlights.This page last updated: 19/Dec/2025. https://lifearchitect.ai/genesis/↑

