Genesis Mission datasets

Alan’s work guides official AI docs for the G7, DOE, Microsoft, Apple, & MIT...
Get The Memo.


Alan D. Thompson
December 2025, 2026

Summary

Project name Genesis Mission (USA)
Organization US Department of Energy’s 17 National Laboratories:

  1. Ames Laboratory (Ames)
  2. Argonne National Laboratory (ANL)
  3. Brookhaven National Laboratory (BNL)
  4. Fermi National Accelerator Laboratory (FNAL)
  5. Idaho National Laboratory (INL)
  6. Lawrence Berkeley National Laboratory (LBNL)
  7. Lawrence Livermore National Laboratory (LLNL)
  8. Los Alamos National Laboratory (LANL)
  9. National Energy Technology Laboratory (NETL)
  10. National Laboratory of the Rockies (NLOR)
  11. Oak Ridge National Laboratory (ORNL)
  12. Pacific Northwest National Laboratory (PNNL)
  13. Princeton Plasma Physics Laboratory (PPPL)
  14. Sandia National Laboratories (SNL)
  15. Savannah River National Laboratory (SRNL)
  16. Stanford Linear Accelerator Center National Accelerator Laboratory (SLAC)
  17. Thomas Jefferson National Accelerator Facility (TJNAF/Jlab)

+
frontier AI labs

Project page https://genesis.energy.gov/
Dataset size (unfiltered) Total storage (PB): ~3,635.5PB (petabytes)
Total storage (EB): ~3.6EB (exabytes)
Total tokens (T): ~908,875T (trillion)
Total tokens (Qa): ~908.9Qa (quadrillion)
Total tokens (Qi): ~0.91Qi (quintillion)

Impact assessment

US Executive Order 14363 (24/Nov/2025):
The “Genesis Mission” [is] a dedicated, coordinated national effort to unleash a new age of AI‑accelerated innovation and discovery that can solve the most challenging problems of this century. The Genesis Mission will build an integrated AI platform to harness Federal scientific datasets — the world’s largest collection of such datasets, developed over decades of Federal investments — to train scientific foundation models and create AI agents to test new hypotheses, automate research workflows, and accelerate scientific breakthroughs.

The Genesis Mission will be viewed as the most forceful acceleration of 2025. Combining 2026-generation frontier AI models (GPT-6, Claude 5, Gemini 4) with exabytes of physical world data will allow us to ask and answer significant questions across challenges like:
– Create a new and beneficial alternative to sugar/caffeine/alcohol.
– Discover and harness a new, near-infinite energy source.
– Cure specific forms of cancer/heart disease.
– Discover an abundant new material to replace steel/aluminium/concrete.
– Optimize the way humans live, aligned with UN rights.

Genesis Mission Updates

24/Nov/2025 (Day 0): Genesis Mission launched by US Executive Order. Timeline milestones appear below…

18/Dec/2025: OpenAI and Google and Anthropic sign separate MOUs to join as lead industry partners to the project. (+ OpenAI science letter to DOE).

18/Dec/2025: 24 MOUs signed (‘The organizations that have signed memorandums of understanding (MOUs) as of today have either expressed interest to DOE in response to an RFI or have active projects with DOE and the National Laboratories for activities related to the Genesis Mission, and any products produced for the Genesis Mission will be architecture-agnostic.’): Accenture, AMD, Anthropic, Armada, Amazon Web Services, Cerebras, CoreWeave, Dell, DrivenData, Google, Groq, Hewlett Packard Enterprise, IBM, Intel, Microsoft, NVIDIA, OpenAI, Oracle, Periodic Labs, Palantir, Project Prometheus, Radical AI, xAI, XPRIZE.

24/Dec/2025: Nuclear developer proposes using navy reactors for data centers… to repurpose decommissioned nuclear reactors from Navy warships to support the US grid and fuel the burgeoning energy demands of the AI industry. The project aims to use two retired reactors for a data center in Oak Ridge, Tennessee, as part of the White House’s Genesis Mission, potentially generating 450-520MW. (Bloomberg, Tom’s Hardware)

23/Jan/2026 (+60 days): The Secretary of Energy submits a detailed list of at least 20 national science and technology challenges to the Assistant to the President for Science and Technology.

22/Feb/2026 (+90 days): The Secretary identifies available Federal computing, storage, and networking resources, including DOE supercomputers and cloud-based systems, to support the Mission.

24/Mar/2026 (+120 days): The Secretary identifies initial data and model assets and develops a plan for incorporating datasets from federally funded research and other sources.

22/Jul/2026 (+240 days): The Secretary reviews capabilities across DOE national laboratories for robotic facilities able to engage in AI-directed experimentation and manufacturing.

21/Aug/2026 (+270 days): The Secretary seeks to demonstrate an initial operating capability of the American Science and Security Platform for at least one identified national challenge.

24/Nov/2026 (+1 year): The Secretary submits the first annual report to the President describing the Platform’s status, user engagement, and scientific outcomes.

Datasets (estimates by LifeArchitect.ai)

Datasets likely to be identified as part of the Genesis Mission, all working and estimates by LifeArchitect.ai. Mouseover to expand acronyms, or see
dataset contents in plain English.

# Dataset Lab Field Size Tokens Quote Source
1 APS-U X-Ray Data ANL Materials/Imaging 500PB 125Qa “The upgraded APS will generate 2-3 orders of magnitude more data… reaching exabytes over its lifetime.” ANL APS-U
2 ALCF Cosmology (HACC) ANL/ ORNL Cosmology/Physics 100PB 25Qa “Frontier-E… generated > 100 PB of data… establishing a new standard of end-to-end performance.” ALCF Cosmology, Nov/2025 update
3 Materials Data Facility ANL (Shared) Materials Science 2PB 500T “A scalable repository for publishing materials data… enabling ML discovery loops.” MDF
4 RHIC Heavy Ion BNL Nuclear Physics 180PB 45Qa “RHIC & ATLAS Tier-1 center store hundreds of petabytes of collision data.” BNL SDCC
5 NSLS-II Imaging BNL Nanomaterials 60PB 15Qa “NSLS-II data rates require AI-driven streaming analysis… approaching 1 PB/experiment.” BNL News
6 LHC CMS Tier-1 FNAL High Energy Physics 350PB 87.5Qa “Fermilab hosts the largest tier-1 computing center for CMS… managing exabyte-scale archives.” FNAL Computing
7 DUNE FNAL Particle Physics 30PB 7.5Qa “Deep Underground Neutrino Experiment… massive liquid argon TPC image data for CNNs.” DUNE Science
8 JGI Genomics LBNL Biology/Genomics 80PB 20Qa “JGI creates petabytes of sequence data… DNA is ‘nature’s language’ (1 base = 1 token).” JGI DOE
9 ESGF (CMIP6) LBNL Climate Change 40PB 10Qa “Earth System Grid Federation… global repository for CMIP6… vital for climate AI twins.” ESGF
10 Materials Project LBNL Chemistry 0.5PB 125T “Information on over 150,000 materials… the ‘Google’ of materials properties.” Materials Project
11 NIF Shot Data LLNL Fusion/HED 50PB 12.5Qa “National Ignition Facility… data from fusion ignition shots used to calibrate simulation AI.” LLNL NIF
12 ATOM LLNL Biology/Pharma 5PB 1.25Qa “ATOM consortium… transforming drug discovery… massive molecular libraries.” LLNL ATOM
13 Stockpile Stewardship LANL/ LLNL/ SNL Nuclear Security 1000PB 250Qa “The NNSA labs hold the world’s largest classified archives… necessary for ‘trusted’ AI models.” NNSA ASC
14 CICE / E3SM LANL Climate/Ocean 20PB 5Qa “Energy Exascale Earth System Model… high-res ocean/ice simulation data.” E3SM
15 Viral Genomics LANL Epidemiology 2PB 500T “HIV/Influenza/COVID databases… sequencing data for vaccine design AI.” LANL pathogens
16 NSRDB NLOR Solar Energy 5PB 1.25Qa “National Solar Radiation Database… physics-based modeling spanning decades.” NSRDB
17 WIND Toolkit NLOR Wind Energy 20PB 5Qa “2TB per year per region… total archives span terabytes to petabytes for grid planning.” NREL WIND
18 ARM Archive ORNL/ PNNL Atmospheric 6PB 1.5Qa “Atmospheric Radiation Measurement… 30 years of continuous sensor data for climate AI.” ARM.gov
19 Summit/Frontier I/O ORNL HPC Systems 700PB 175Qa “Exascale I/O logs… analyzing system performance and scientific throughput of Frontier.” OLCF
20 MVP ORNL Health/Genomics 15PB 3.75Qa “Million Veteran Program… one of the world’s largest genomic databases linked to health records.” VA/ORNL
21 EMSL Data PNNL Molecular Science 10PB 2.5Qa “Environmental Molecular Sciences Lab… mass spec and microscopy data for bio-earth systems.” EMSL
22 LCLS-II “Data Deluge” SLAC X-Ray/Quantum 300PB 75Qa “LCLS-II will deliver 8,000x more data… ‘The Data Deluge’… requires edge AI to manage.” SLAC LCLS
23 LSST (Rubin) SLAC Astrophysics 60PB 15Qa “Rubin Observatory Legacy Survey of Space and Time… 20TB/night… processing pipeline at SLAC.” Rubin Obs
24 NSTX-U PPPL Plasma Physics 15PB 3.75Qa “Spherical Torus experiment… microsecond-resolution sensor data for fusion control AI.” PPPL NSTX-U
25 Z Pulsed Power Facility SNL HED Physics 50PB 12.5Qa “World’s most powerful pulsed power facility… extreme conditions data for material science.” Sandia Z
26 EDX NETL Fossil/Carbon 16PB 4Qa “Energy Data eXchange… subsurface data for carbon sequestration and oil recovery.” NETL EDX
27 Critical Materials Ames Rare Earths 2PB 500T CMI data… thermodynamic and phase diagram data for rare earth substitution.” Ames CMI
28 CEBAF TJNAF Nuclear Physics 10PB 2.5Qa “Continuous Electron Beam Accelerator Facility… probing the quark-gluon structure of matter.” JLab
29 Environmental DB SRNL Ecology/Waste 2PB 500T “Savannah River Site environmental monitoring… soil, water, and waste processing history.” SRNL
30 Stealth/non-public/top secret dataset 5PB 1.25Qa

Dataset size (unfiltered):
Total storage (PB): ~3,635.5PB (petabytes)
Total storage (EB): ~3.6EB (exabytes)
Total tokens (T): ~908,875T (trillion)
Total tokens (Qa): ~908.9Qa (quadrillion)
Total tokens (Qi): ~0.91Qi (quintillion)

Source: LifeArchitect.ai
* All token counts are ‘estimated’. Using the standard text conservative calculation 1PB≈250T tokens≈0.25Qa tokens. Calculations are informed but rough estimates. For similar working see my 2022 paper: What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher.

LifeArchitect.ai:
“If just 0.1% of this data is filtered and cleaned for the final dataset, it would still be ~1,000T tokens. The largest comparable publicly-known LLM training dataset is Qwen 3 (The Memo edition 14/May/2025) on just 36T tokens. It’s likely that frontier models like GPT-5 and Grok 5 were trained on 100T tokens. Full subscribers have access to my independent reports on these models.”

Viz

Download source (PDF)

Download source (PDF)

Download source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.

Source: ‘Location and stewarding agencies of the 17 DOE labs’ from the ITIF ‘Turning the page’ report, 2013.

Source: Sandbox Studio for Symmetry Magazine (DOE), 2013.

Dataset contents in plain English

Accelerating Therapeutics for Opportunities in Medicine (ATOM)
This dataset contains molecular libraries and chemical property data designed to accelerate the drug discovery process through computational modeling.

Advanced Photon Source Upgrade (APS-U) X-Ray Data
The APS-U generates high-energy X-ray beams to produce ultra-high-resolution 3D imaging of materials at the atomic and molecular scale. This petabyte-scale dataset allows AI to analyze the structural characteristics of new materials and chemical reactions in real-time.

Argonne Leadership Computing Facility (ALCF) Cosmology (HACC)
The Hardware/Hybrid Accelerated Cosmology Code (HACC) produces simulations of the universe’s evolution, tracking billions of particles to model the formation of cosmic structures. This data provides a foundational ‘ground truth’ for astrophysicists using AI to interpret actual telescope observations of the dark universe.

Atmospheric Radiation Measurement (ARM) Archive
This archive contains over 30 years of continuous sensor data tracking radiation, cloud properties, and atmospheric chemistry. It serves as a critical training set for climate AI models to improve the accuracy of weather forecasting and long-term climate projections.

Community Ice CodE / Energy Exascale Earth System Model (CICE / E3SM)
These datasets consist of simulations of the Earth’s cryosphere and oceans, focusing on the complex interactions between sea ice and global climate systems. They provide the high-fidelity data necessary for ‘climate digital twins’ to predict future sea-level rise and polar changes.

Continuous Electron Beam Accelerator Facility (CEBAF)
CEBAF data captures the results of electron-nucleus collisions to probe the quark and gluon structure of protons and neutrons. This nuclear physics dataset is used to map the fundamental forces of the ‘strong interaction’ that holds the nucleus of the atom together.

Critical Materials (CMI)
This dataset is a collection of thermodynamic, phase, and chemical property data for rare earth elements developed to secure the domestic supply chain. It aggregates information from over 12 years of research into magnetic and structural properties, allowing AI models to predict new alloy compositions that do not rely on scarce minerals.

Deep Underground Neutrino Experiment (DUNE)
DUNE records images of neutrino interactions within massive liquid argon time projection chambers located deep underground. These datasets are processed using Convolutional Neural Networks to help physicists understand why the universe is made of matter rather than antimatter.

Earth System Grid Federation (ESGF) (CMIP6)
The ESGF is the global repository for the Coupled Model Intercomparison Project, which aggregates the world’s most sophisticated climate models. This multi-petabyte dataset is the gold standard for training AI to identify patterns in global warming, precipitation changes, and extreme weather events.

Environmental Database (DB)
This long-term repository tracks soil, water, and atmospheric samples from the Savannah River Site to monitor the ecological impact of nuclear processing. The data provides a historical record used by AI to model contaminant transport and optimize environmental remediation strategies.

Environmental Molecular Sciences Laboratory (EMSL) Data
EMSL provides mass spectrometry and microscopy data that characterizes molecular processes within biological and terrestrial systems. This dataset allows researchers to use AI to understand how microbes and plants affect carbon cycling and nutrient movement in the soil.

Joint Genome Institute (JGI) Genomics
JGI produces petabytes of DNA and RNA sequence data from plants, fungi, and microbes found in diverse environments. By treating genetic sequences as ‘nature’s language,’ AI models can be trained on this data to discover new enzymes for biofuel production or carbon sequestration.

Large Hadron Collider (LHC) CMS Tier-1
This dataset contains the filtered records of billions of high-energy particle collisions from the Compact Muon Solenoid experiment. It is used by researchers to search for rare physical phenomena, such as the Higgs boson, and to train AI to distinguish signal from background noise in subatomic physics.

Legacy Survey of Space and Time (LSST) (Rubin)
The Rubin Observatory’s LSST will generate a nightly ‘movie’ of the southern sky, capturing 20 terabytes of data every session to track moving or changing celestial objects. This dataset is used to train AI pipelines to automatically detect supernovae, asteroids, and distant galaxies in near real-time.

Linac Coherent Light Source II (LCLS-II) ‘Data Deluge’
LCLS-II uses ultra-fast X-ray pulses to capture the molecular movie of chemical bonds breaking and forming at the femtosecond scale. Because the data rates are so high, it requires edge-AI to process and compress information instantly, enabling the study of quantum materials and rapid chemical reactions.

Materials Data Facility (MDF)
The MDF is a scalable repository that aggregates experimental and computational materials science data from various institutions. It is specifically structured to enable machine learning discovery loops, allowing AI to suggest new material compositions with specific desired properties.

Materials Project
Described as a ‘Google’ for materials, this dataset provides calculated properties for over 150,000 inorganic compounds. It is a foundational dataset for training graph neural networks to predict the stability, conductivity, and hardness of yet-to-be-synthesized materials.

National Energy Technology Laboratory Energy Data eXchange (EDX) Carbon
EDX aggregates subsurface geological data to support carbon capture and storage initiatives and efficient resource recovery. AI models use this data to predict the capacity and safety of underground reservoirs for long-term carbon sequestration.

National Ignition Facility (NIF) Shot Data
NIF data records the outcomes of fusion ignition experiments where giant lasers compress hydrogen fuel to extreme temperatures and pressures. These datasets are used to calibrate high-energy-density physics simulations, helping scientists move closer to achieving sustainable, clean fusion energy.

National Solar Radiation Database (NSRDB)
The NSRDB provides 30 years of solar radiation data and meteorological observations for the entire Americas. It is the primary training set for AI models used to predict solar power grid stability and optimize the placement of large-scale solar arrays.

National Spherical Torus Experiment Upgrade (NSTX-U)
This dataset captures microsecond-resolution sensor data from plasma confinement experiments in a spherical tokamak. It is used to train real-time AI control systems that can predict and prevent disruptions in the plasma, which is critical for building stable fusion power plants.

National Synchrotron Light Source II (NSLS-II) Imaging
NSLS-II produces extremely bright X-rays to image nanomaterials and biological structures at the nanometer scale. The resulting data is used to train AI to reconstruct 3D images from partial or noisy experimental scans, speeding up the discovery of more efficient battery materials.

Million Veteran Program (MVP)
The MVP links the genomic sequences and longitudinal electronic health records of nearly a million veterans. This massive dataset enables AI to identify genetic markers for complex diseases, leading to more personalized precision medicine and targeted therapies.

Relativistic Heavy Ion Collider (RHIC) Heavy Ion
RHIC data records the results of smashing gold ions together at nearly the speed of light to create a quark-gluon plasma, the state of matter just after the Big Bang. This dataset allows AI to model the behavior of matter under the most extreme temperature and density conditions possible in a laboratory.

Stockpile Stewardship
This classified archive contains data from nuclear tests, laboratory experiments, and advanced simulations used to ensure the safety and reliability of the U.S. nuclear deterrent without physical testing. It represents the world’s most complex dataset for training trusted AI in high-stakes national security environments.

Summit/Frontier I/O
This dataset consists of the input/output logs and system performance metrics from the world’s most powerful exascale supercomputers. AI models analyze these logs to optimize scientific workflows, detect hardware failures before they happen, and improve the efficiency of massive scientific computations.

Stealth/non-public/top secret dataset
There will be more than one ‘top secret’ classified dataset within the DOE remit. This is a placeholder for datasets that have not been publicly revealed.

Viral Genomics
This repository hosts the genetic sequences of thousands of viral strains, including HIV, Influenza, and SARS-CoV-2. Researchers use AI trained on this data to predict how viruses will mutate and to design universal vaccines that can target multiple variants simultaneously.

Wind Integration National Dataset (WIND) Toolkit
The WIND Toolkit provides wind speed and power estimates across vast geographic regions and timeframes. It is a vital dataset for training AI to forecast wind power availability, helping utility companies integrate renewable energy into the national power grid more reliably.

Z Pulsed Power Facility
The Z Pulsed Power Facility or Z Machine dataset captures data from extreme pulsed-power experiments that subject materials to massive magnetic fields and pressures. This allows AI to study the behavior of matter at the center of giant planets or within the hulls of future fusion reactors.

Related datasets

Related datasets

  1. The ANL AuroraGPT datasets were analysed in a similar way: LifeArchitect.ai/AuroraGPT
  1. The Well: 15TB (~4.1T tokens) of physics simulations

“The Well” contains curated physics simulations from 16 scientific domains, each capturing fundamental equations that appear throughout nature, and all validated and/or generated by domain experts:

  • Fluid dynamics & turbulence
  • Supernova explosions
  • Biological pattern formation
  • Acoustic wave propagation
  • Magnetohydrodynamics

Available at: https://github.com/PolymathicAI/the_well

  1. Multimodal Universe: 100TB (~27.5T tokens) of astronomical data

Multimodal Universe”, contains hundreds of millions of observations across multiple modalities, object types, and wavelengths. The data was collected from JWST, HST, Gaia, and several other major surveys, and unified in a single, ML-ready format.

Available at: https://github.com/MultimodalUniverse/MultimodalUniverse

See also:


What’s in my AI? A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG…

Alan D. Thompson
LifeArchitect.ai
March 2022
26 pages incl title page, references, appendix.

View the report


A Comprehensive Analysis of Datasets Likely Used to Train GPT-5

Alan D. Thompson
LifeArchitect.ai
August 2024
27 pages incl title page, references, appendices.

View the report


Compare with other LLM datasets
Open the Datasets Table in a new tab  
 

Partners

Showing initial collaborators announced 24/Nov/2025, as well as 24 MOUs signed 18/Dec/2025.

Type Organizations
17 national labs (by sponsoring agency, names updated to Dec/2025, but name changes may still occur): Office of Science (SC):
– AMES (Ames)
– ANL (Argonne)
– BNL (Brookhaven)
– FNAL (Fermi)
– LBNL (Lawrence Berkeley)
– ORNL (Oak Ridge)
– PNNL (Pacific Northwest)
– SLAC
– TJNAF (JLab)

National Nuclear Security Administration (NNSA):
– LANL (Los Alamos)
– LLNL (Lawrence Livermore)
– SNL (Sandia)

Office of Critical Minerals and Energy Innovation (CMEI):
– NLOR (Rockies)

Office of Fusion (OF):
– PPPL (Princeton Plasma Physics)

Office of Nuclear Energy (NE):
– INL (Idaho)

Hydrocarbons and Geothermal Energy Office (HGEO):
– NETL (National Energy Technology)

Office of Environmental Management (EM):
– SRNL (Savannah River)

Frontier AI lab leads – Anthropic
– Google
– OpenAI
– xAI
Other AI labs – FutureHouse
– Hugging Face
– IBM
– Microsoft
Computing and hardware support – Amazon Web Services
– AMD
– Cerebras
– CoreWeave
– Cornelis Networks
– Dell
– Groq
– HPE (Hewlett Packard Enterprise)
– Intel
– Micron
– NVIDIA
– Oracle
– SambaNova
– Semiconductor Industry Association
– Synopsys
– xLight
Mapping and data support – DrivenData
– Esri
– Kitware
– Scale AI
Materials and Energy Science – Albemarle
– Applied Materials
– Atomic Canyon
– AVEVA (Schneider Electric)
– Chemspeed
– Critical Materials Recycling
– Emerald Cloud Lab
– EPRI (Electric Power Research Institute)
– MP Materials
– New York Creates
– Niron Magnetics
– Nusano
– OLI Systems
– Periodic Labs
– Phoenix Tailings
– PMT Critical Metals
– Project Prometheus
– Radical AI
– Ramaco
– TdVib
Space support – Collins Aerospace (RTX/Raytheon)
– GE Aerospace
– RTX (Raytheon)
Specialized software and research – Accenture
– Armada
– LILA
– Nokia
– Palantir
– Quantinuum
– Qubit
– RadiaSoft
– Siemens
– XPRIZE
Utility and grid operations – ComEd (Exelon)
– ISO New England
– Tennessee Valley Authority

Highlights

The Genesis Mission datasets project is one of the most interesting advances in post-2020 frontier artificial intelligence. It combines 17 government laboratories with industry partners like OpenAI, Google, and Anthropic. Here are a few of my favourite highlights…

The Large Hadron Collider is involved. The Large Hadron Collider produces immense streams of information in Switzerland. The heavy lifting of data analysis occurs in the United States. Fermilab (FNAL) functions as the primary hub for the Compact Muon Solenoid (CMS) experiment, managing a Tier-1 center with archives exceeding 350 petabytes. This setup ensures that the most complex physics experiments on Earth rely on American supercomputing to find meaning in the noise of particle collisions.

The Materials Project at Lawrence Berkeley National Laboratory acts as a search engine for the physical world. It contains data on over 150,000 inorganic compounds and millions of properties. Training AI on this specific archive allows the Genesis Mission to predict the behavior of substances that do not yet exist. This process is a requirement for finding new battery chemistries and superconductors that work at room temperature.

Meta AI is missing from the partners list (as of 22/Dec/2025). The list of partners includes OpenAI, Google, Anthropic, and xAI. However, Meta AI is missing as one of the ‘big 5’ frontier AI labs. Read more about the ‘big 5’ labs and related models:

Anthropic has extensive experience working with the DOE. Anthropic maintains a deep technical partnership with the Lawrence Livermore National Laboratory (LLNL, 9/Jul/2025) and the National Nuclear Security Administration (NNSA, 21/Aug/2025). In July 2025, LLNL expanded access to Claude for Enterprise to its entire staff of 10,000 scientists to support research in nuclear deterrence and energy security. This collaboration led to the co-development of a specialized AI classifier with the NNSA, allowing the identification of harmful queries related to nuclear weapons with 96% accuracy.

Robots. The project incorporates robotic laboratories where AI models can directly control physical instruments. This allows for a closed-loop research cycle where the AI proposes and then tests its own ideas. This process removes the delays of manual labor, letting the machines run experiments around the clock (24/7, in the dark) without human intervention.

The Genesis Mission has been described as a modern Manhattan Project. It coordinates 40,000 scientists and engineers across 17 locations to build a single discovery platform. This represents the most significant state-led scientific effort in history.

The total volume of information involved reaches into the exascale range. The term exascale describes a specific boundary where computing power and data volume meet. In the first instance, it refers to compute speed: a machine performing a quintillion calculations per second. In the second instance, it refers to data capacity: a collection of information reaching the exabyte level, which is a quintillion bytes. The Genesis Mission sits at this intersection because the 17 national labs possess both the machines capable of these speeds and the archives that fill that capacity. With around one quintillion tokens estimated, the Genesis Mission datasets provide a deep pool of scientific knowledge. This allows models to learn the rules of nature from raw sensor data, creating a path toward discovery that does not depend on human text alone.

Bonus: Dataset analysis from LifeArchitect.ai has already been featured across the US DOE, including in the 2024 report ‘Enabling capabilities and resources’ (PDF, Apr/2024). The LifeArchitect.ai independent research on datasets, frontier LLMs, and superintelligence is used across government (G7, US Gov), think tanks (RAND, Brookings), frontier labs (Microsoft, Apple), and other organisations worldwide.

Mapping Genesis data to the ASI checklist*

By mapping the Genesis datasets against Alan’s ASI checklist, we see that these exascale archives provide the physical substrate for the transition to superintelligence. To track the real-time progress of these milestones, visit LifeArchitect.ai/ASI

Phase 1: Early ASI, Discovery, and Simulation

#1 & #2: Recursive hardware self-improvement achieved & Recursive code self-optimization achieved: The Summit/Frontier I/O Logs and ALCF Cosmology performance paths allow the system to observe and refine exascale computing efficiency, reaching autonomous improvement loops.

#3: First major simulation of a suggested improvement; convinces majority of humans: The ESGF (CMIP6) and NIF Shot Data provide the high-fidelity archives necessary to create a simulation of climate or energy stability so precise that it establishes a new standard of predictive power.

#6 & #7: First new discovery (i.e. a new theoretical concept) & First new physical invention (i.e. a new tool): By cross-referencing the Materials Project with APS-U X-Ray imaging, the system identifies novel theoretical concepts and engineers new physical tools like non-silicon processors.

#8 & #9: First new element added to the periodic table & Novel computing materials developed (i.e. beyond silicon): Extreme energy collisions in the Z Pulsed Power Facility, CEBAF, and LCLS-II datasets provide the blueprints for superheavy elements and materials that surpass current semiconductor limits.

#12, #14 & #16: First mental health condition resolved, Majority of physical conditions able to be resolved by AI & Optimized biology at birth becomes standard (1M+ people): Linguistic and proteomic patterns in JGI Genomics, MVP, and Biodefense (ATOM) allow for the resolution of chronic conditions through personalized molecular correction.

#17, #18 & #19: First new type of energy discovered, First new type of energy harnessed & First new type of energy storage: Using sensor streams from NSTX-U, NIF, and the WIND Toolkit, the system discovers and harnesses unrecognizable energy types while engineering storage with near-infinite density.

Phase 2: Governance and Economic Transformation

#27: Traditional economics surpassed; money deflates in value: The Materials Project and Critical Materials (Ames) datasets allow for the direct synthesis of rare resources from waste, removing the price floor of physical goods.

#32: Integrated international governance by AI: The Stockpile Stewardship (NNSA) and ESGF CMIP6 archives provide the global security and climate modeling necessary for a machine-led international stability pact.

Phase 3: Physical World Integration

#37 & #38: Waste management optimized; no more trash & Environmental issues resolved and environment optimized: The EDX and Environmental DB datasets provide the blueprints for total molecular recycling and the reversal of atmospheric carbon levels.

#45: New state of matter engineered: Extreme pressure and temperature data from the Z Pulsed Power Facility and NIF Shot Data provide the recipes for stable, non-natural states of matter like metallic hydrogen.

#46 & #47: First planet other than Earth optimized/terraformed & First planet other than Earth colonized: ALCF Cosmology and ESGF simulations serve as the foundation for managing the atmospheres and magnetospheres of other planets.

Hypothetical prompts to a Genesis model*

1. Global energy transition and climate stability

Prompt: Cross-reference the ESGF (CMIP6) climate projections with the NSRDB and WIND Toolkit archives. Map the ideal placement for a global decentralized grid that maintains 100% uptime, using EDX data to identify sites for atmospheric carbon removal that utilize excess thermal output.

ASI Response: Mapping successful. By aligning atmospheric flow patterns with subsurface storage capacity, a global equilibrium is reachable within…

2. Accelerated molecular discovery for longevity

Prompt: Apply the linguistic patterns found in JGI Genomics and MVP to the molecular libraries in ATOM. Identify the specific genetic sequences responsible for cellular decay and propose a viral delivery vector, based on Viral Genomics structures, to distribute corrective sequences.

ASI Response: The Million Veteran records reveal a distinct sequence subset that, when paired with the following molecular structure…

3. Room-temperature superconductivity and material synthesis

Prompt: Analyze the APS-U X-Ray Data and LCLS-II “Data Deluge” datasets alongside the Materials Project database. Identify a stable atomic lattice that exhibits zero electrical resistance at 295K, ensuring the material is composed of elements found in the Critical Materials inventory.

ASI Response: Superconductivity is a product of lattice geometry. The following nitrogen-doped lutetium structure remains stable under standard pressure…

4. Commercial plasma energy generation

Prompt: Synthesize the sensor logs from NIF Shot Data, NSTX-U, and the Z Pulsed Power Facility. Create a predictive model for plasma containment that uses edge-AI to adjust magnetic fields in real-time, preventing the instabilities recorded in previous experiments.

ASI Response: The instability patterns in the Princeton and Sandia data are predictable. By shifting to a non-linear magnetic pulse, a net energy gain of 400% is achievable…

5. Systemic risk and national security defense

Prompt: Using the Stockpile Stewardship archives and the frontier logs, simulate a global cyber-kinetic conflict. Identify the primary vulnerabilities in the domestic power grid and propose a self-healing software architecture that can withstand a multi-vector exascale attack.

ASI Response: Security resides in decentralization. The Frontier logs show that a distributed defense layer can absorb 99.8% of high-speed packet injections by shifting…

BHAG 1 (big, hairy, audacious goal!). Biological immortality through proteomic correction

Prompt: Use the JGI Genomics and EMSL Data to map the precise folding failures that lead to senescence. Cross-reference this with the MVP health records to identify the “immortality markers” in rare human lineages. Design a self-replicating nanobotic repair system, informed by NSLS-II Imaging, that resets the epigenetic clock of every cell in a living organism to a biological age of twenty-five.

ASI Response: Aging is an accumulated software error in the proteome. By introducing a synthetic corrective enzyme that targets the following mitochondrial DNA sequences, the decay process ceases…

BHAG 2. Direct matter-to-energy conversion (mass-energy parity)

Prompt: Analyze the high-energy density physics within the Z Pulsed Power Facility and NIF Shot Data alongside the subatomic quark-gluon structures recorded at CEBAF. Determine the exact resonance frequency required to induce a controlled decay of non-fissile waste into pure kinetic energy, bypassing the need for traditional nuclear fuel cycles and providing infinite, zero-emission power from common silica.

ASI Response: The transition from mass to energy does not require heavy isotopes. By applying a focused harmonic pulse to stable atomic nuclei, we can trigger a localized release of binding energy…

BHAG 3. Instantaneous global neural synchronization

Prompt: Leverage the ALCF Cosmology (HACC) simulations and the LHC CMS Tier-1 data to identify the quantum entanglement signatures of consciousness. Propose a method using LCLS-II “Data Deluge” photonics to establish a non-local communication layer between human brains, effectively removing the latency of language and resolving all human conflict through the total transparency of shared experience.

ASI Response: Language is a low-bandwidth bottleneck. The quantum signatures in the neural cortex are compatible with long-distance entanglement, allowing for the direct transfer of conceptual architectures…

BHAG 4. Total resource abundance via atomic reconfiguration

Prompt: Combine the Materials Project and Materials Data Facility libraries with the APS-U X-Ray Data imaging. Design an “atomic assembler” that can rearrange the molecular structure of ocean plastic and industrial waste into high-value Critical Materials, such as neodymium or platinum, at the atomic level, rendering scarcity and mining obsolete.

ASI Response: Material scarcity is a failure of sorting, not a lack of atoms. By utilizing the following electromagnetic assembly sequence, any carbon-based waste can be restructured into crystalline lattices…

All dataset reports by LifeArchitect.ai (most recent at top)
Date Title
Dec/2025 The Genesis Mission datasets (page)
Jan/2025 What's in Grok? (paper)
Jan/2025 NVIDIA Cosmos video dataset (page)
Aug/2024 What's in GPT-5? (paper)
Jul/2024 ANL AuroraGPT (page)
Sep/2023 Google DeepMind Gemini: A general specialist (paper)
Feb/2023 Chinchilla data-optimal scaling laws: In plain English (page)
Aug/2022 Google Pathways (paper)
Mar/2022 What's in my AI? (GPT-1, GPT-2, GPT-3, MT-NLG, Chinchilla...)
Sep/2021 Megatron the Transformer, and related language models (page)
Ongoing... Datasets table (page)

* These two sections (‘Mapping’, ‘Hypothetical prompts’) were developed in collaboration with Google Gemini 3 Pro. Original header image background (not text) by Nano Banana Pro.

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Informs research at Apple, Google, Microsoft · Bestseller in 147 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Alan D. Thompson is a world expert in artificial intelligence, advising everyone from Apple to the US Government on integrated AI. Throughout Mensa International’s history, both Isaac Asimov and Alan held leadership roles, each exploring the frontier between human and artificial minds. His landmark analysis of post-2020 AI—from his widely-cited Models Table to his regular intelligence briefing The Memo—has shaped how governments and Fortune 500s approach artificial intelligence. With popular tools like the Declaration on AI Consciousness, and the ASI checklist, Alan continues to illuminate humanity’s AI evolution. Technical highlights.

This page last updated: 7/Jan/2026. https://lifearchitect.ai/genesis/