Models Table: Methodology

Alan D. Thompson
April 2026

Context
Lenses for estimates
Worked examples
Pine AI’s knowledge-probe method
Closing
Further reading

Context

The Models Table is the running output of years of full-time LLM analysis, currently 10,000+ data points across 800+ models, documented by hand. Apple, Microsoft, Harvard, Stanford, MIT, and several governments use it as an independent reference for LLM details, including proprietary data on frontier models.

Since around 2023, when DeepMind stopped publishing detailed training papers for its models, frontier labs have generally kept architecture and training details confidential. Independent estimation work fills the gap, and that is what the parameter and token columns of the Models Table represent.

Every italic figure in the Models Table is a centrepoint estimate, the working comes from several lenses cross-checked against open-source releases, and a reasonable error band on any single estimated model is wide. Specific entries are documented in detail in my paid analyses (The Memo special editions) and in the methodology papers like What’s in My AI?, What’s in GPT-5?, and What’s in Grok?

Estimates and accuracy
In the Models Table, any parameter or token figure in italics is my estimate. Anything in regular weight is either disclosed by the lab, quoted directly from a paper, or otherwise on the public record. All italic figures are my best-informed estimates, as we can’t derive ground truth in the current climate of trade secrets. Every italic figure in the table is a centrepoint, so when I estimate ‘2T’ the range is roughly 1T–3T (and up to 4T at the wider, factor-of-two band that Epoch AI uses), and the same applies to token counts. The table shows the centrepoint to keep rows legible; my paid analyses document the ranges. A useful comparison: Ege Erdil at Epoch AI arrived at the same ~200B number for o1 that I had, and put it like this:

This would put its total parameter count around 200 billion, though this estimate could easily be off by a factor of 2 [100B to 400B] given the rough way I’ve arrived at it. (Epoch AI)

That is the right framing for everything in italics in the Models Table.

Total Parameters
The parameter figure shown in the Models Table is total parameters. For Mixture-of-Experts (MoE) models, this is the total parameter count, not the active-per-token count. The two can differ by an order of magnitude on modern frontier MoE models.

Tokens seen
Token figures in the Models Table use the same italics convention. My token estimates start from the parameter estimate and use a tokens-per-parameter ratio (typically 20:1 or higher, in line with post-Chinchilla data ratios) as a first-pass sanity check. That figure then gets refined against lab statements about training compute, training duration, hardware allocation, and dataset composition. The detailed token-counting methodology for publicly-known corpora is documented in What’s in My AI?, and the Chinchilla page (https://lifearchitect.ai/chinchilla/) tracks how the ratio itself has evolved from 1.7:1 (Kaplan, 2020) through 20:1 (Chinchilla, 2022) to 80,000:1 (LFM2.5-350M, 2026).

Lenses for estimates

My estimates come from several lenses, cross-checked against each other and against open-source releases.

Lens 1: Pricing signal
Per-token price gives some signal on active-parameter size, since inference cost scales with active params, but margin bands vary widely across labs (some price close to cost, others add substantial profit). I treat pricing as one input among many, rather than a primary driver. I maintain an older public visualisation of frontier model pricing over time: https://lifearchitect.ai/viz/#frontier-prices.

Lens 2: Capability signal
Benchmark jumps between model generations, on saturating-but-still-useful evals like MMLU-Pro, GPQA, HLE, historically track with both a parameter step and a training-compute step. Combined with the pricing lens, this helps me bracket the estimate.

Lens 3: Supply and demand signals
Three constraints together set the plausible size envelope for any frontier model:

Signal	What it constrains	Examples
Training supply	Maximum training compute available	TPU/GPU allocations, datacentre capacity disclosures
Inference supply	Deployment footprint and serving cost	Bedrock, Vertex, Foundry, Azure, OpenRouter availability
Demand	Realistic serving load	User counts, usage disclosures, queue behaviour

Lens 4: Cross-checks against open-source frontier releases
When labs release detailed numbers (or full weights), we get ground truth, and I check my proprietary estimates against the known open frontier as they land.

Small models circa 2024

Model	Lab	Total params	Active params	Tokens seen	GPQA	HLE	Released
gpt-oss-120b	OpenAI	116.8B	5.1B	30T	80.1	19	Aug/2025
gpt-oss-20b	OpenAI	20.9B	3.6B	13T	71.5	17.3	Aug/2025
GPT-4o	OpenAI	200B	—	20T	53.6	3.1	May/2024

Large models circa 2025

Model	Lab	Total params	Active params	Tokens seen	GPQA	HLE	Released
Grok-5	xAI	6T	—	—	—	—	—
DeepSeek-V4-Pro	DeepSeek	1.6T	49B	33T	90.1	37.7	Apr/2026
Kimi K2.6	Moonshot	1T	32B	30.5T	90.5	54	Apr/2026
Grok-4	xAI	3T	—	—	—	—	Jul/2025
Grok-3	xAI	3T	—	—	84.6	—	Feb/2025

Lens 5: Direct insights from labs, plus signals from rumours, news, and disclosures
I have private channels with several labs where I receive context I can’t disclose publicly, which informs the table and in some cases directly shapes specific entries. Separately, I monitor and weight credible rumours, public commentary from researchers and executives, news reporting (Reuters in particular has broken useful detail on frontier training runs and chip allocations), and partial disclosures (Amazon’s Titan family is a good example, where most of what we know publicly came through fragmented announcements and sideways comments rather than published papers). These get cross-checked against the lenses above before I commit a number.

Worked example: Grok-3 and Grok-4

This is a case where my estimates were carried publicly for months and then confirmed by a direct lab disclosure.

In Jul/2025, my Models Table centrepoints for the Grok family were:

Model	My estimate (Jul/2025)	Lens reasoning
Grok-3	~3T total parameters	Capability bracket against GPT-4-class models, Colossus training-supply disclosure (100K+ H100s), public benchmark performance
Grok-4	~3T total parameters	xAI public statements that Grok-4 was a post-training/RL upgrade rather than a new pretrain, similar inference cost

On 15/Nov/2025, the xAI CEO disclosed publicly:

[Grok-5] is a 6 trillion parameter model, whereas Grok-3 and -4 are based on a 3 trillion parameter model. (xAI CEO, 15/Nov/2025, covered at https://lifearchitect.ai/grok/)

Comparison of my carried estimate to the disclosed figure:

Model	My estimate	Disclosed (Nov/2025)	Delta
Grok-3	3T (Feb/2025)	3T (Nov/2025)	Match
Grok-4	3T (Jul/2025)	3T (Nov/2025)	Match

The Grok-3 and Grok-4 centrepoints landed on the disclosed figures. This is the kind of cross-check that validates the lens method, particularly Lens 3 (training supply via Colossus), Lens 1 (pricing/capability bracket), and Lens 4 (open-source frontier comparisons).

Worked example: Claude Mythos Preview

The following is reproduced from my paid Mythos analysis at The Memo – Special Edition – Claude Mythos. It shows the lenses applied to a single recent model (as of Apr/2026).

Size estimates

General model size is no longer an indicator of performance, but I still find it interesting. With all model details kept confidential, plus added complexity in reasoning/thinking mode, it is more challenging than ever to estimate token and parameter counts.

Now in 2026, based on my ongoing analysis, known Claude models pricing, similar known frontier MoE model sizes and pricing, estimates of training supply (TPUs), inference supply (TPUs), and demand (users), here are my initial estimates for the Claude Mythos model. Working is based on the Glasswing pricing disclosure ($25/$125 per million input/output tokens), the benchmark jumps over Opus 4.6, and my previous established estimate of Opus 4.6 at ~5T parameters MoE on ~100T tokens.

Pricing as a sizing signal. Mythos at $25/$125 sits roughly 1.7x above Opus 4.6 ($15/$75) on both input and output. Frontier labs price close to inference cost plus a margin band, so a 1.7x price step usually reflects a 1.5x to 1.8x active-parameter step, not a proportional total-parameter step (MoE total params can grow much faster than active params without moving price much).

Capability as a sizing signal. The benchmark jumps over Opus 4.6 are larger than Anthropic has previously delivered on a single model generation:

Benchmark Claude Opus 4.6 Claude Mythos Preview Step

SWE-bench Verified 80.8% 93.9% +13.1pp

CyberGym 66.6% 83.1% +16.5pp

HLE 53.1% 64.7% +11.6pp

The system card explicitly says Mythos has hit the ceiling on their tests (‘saturates many of our most concrete, objectively-scored evaluations’). Jumps of this size historically track with both a parameter step and a meaningful training-compute step.

Training data. Anthropic has been consistent about heavy synthetic data and curriculum work (‘Claude Mythos Preview was trained on a proprietary mix of publicly available information from the internet, public and private datasets, and synthetic data generated by other models.’), and the system card flags extensive RL on long-horizon agentic tasks (‘extremely large amounts of reinforcement learning’). A reasonable read is that Mythos saw materially more tokens than Opus 4.6, with a much higher synthetic fraction, particularly for code, cyber, and tool-use trajectories.

Benchmark	Claude Opus 4.6	Claude Mythos Preview	Step
SWE-bench Verified	80.8%	93.9%	+13.1pp
CyberGym	66.6%	83.1%	+16.5pp
HLE	53.1%	64.7%	+11.6pp

Update from Apr/2026: Subsequent NVIDIA commentary suggests Mythos was ‘trained on fairly mundane capacity, and a fairly mundane amount of it’ (YouTube, 17/Apr/2026), pointing to looped Transformer/reasoning gains rather than an exponential compute step. So the Mythos compute story is different to what I had at the time of writing, though the original estimate may still be in the right range.

Pine AI’s knowledge-probe method

In April 2026, Bojie Li at Pine AI published Incompressible Knowledge Probes (arXiv 2604.24827), a separate parameter-estimation method that uses factual recall to lower-bound how much a model knows, and inverts a calibration on 89 open-weight models (R²=0.917) to estimate proprietary model sizes. It is the first published method to estimate frontier proprietary model sizes by measuring factual recall directly, and a useful cross-reference for the figures in the Models Table.

Model	Vendor	Accuracy	Est. Size	90% PI
GPT-5.5	OpenAI	71.9%	~9.7T	[3.2–28.7T]
Claude Opus 4.6	Anthropic	68.0%	~5.3T	[1.8–15.6T]
GPT-5 Pro	OpenAI	66.5%	~4.1T	[1.4–12.2T]
GPT-5	OpenAI	66.4%	~4.1T	[1.4–12.1T]
Claude Opus 4.7	Anthropic	66.4%	~4.0T	[1.4–12.0T]
o1	OpenAI	65.4%	~3.5T	[1.2–10.3T]
Claude Opus 4.5	Anthropic	65.2%	~3.4T	[1.1–10.0T]
Claude Opus 4.1	Anthropic	64.9%	~3.2T	[1.1–9.5T]
Grok-4	xAI	64.8%	~3.2T	[1.1–9.4T]
o3	OpenAI	64.4%	~3.0T	[1.0–8.9T]
GPT-5.4 Pro	OpenAI	62.5%	~2.2T	[736B–6.5T]
GPT-4.1	OpenAI	62.3%	~2.2T	[719B–6.4T]
Grok-3	xAI	62.3%	~2.1T	[715B–6.3T]
Claude Sonnet 4.6	Anthropic	60.9%	~1.7T	[579B–5.1T]
GPT-5.3	OpenAI	60.0%	~1.5T	[503B–4.5T]
GPT-5.2 Pro	OpenAI	59.7%	~1.4T	[478B–4.2T]
Claude Opus 4	Anthropic	59.7%	~1.4T	[478B–4.2T]
GPT-5.1	OpenAI	59.3%	~1.3T	[450B–4.0T]
GPT-5.2	OpenAI	58.9%	~1.3T	[417B–3.8T]
Gemini 2.5 Pro	Google	58.4%	~1.2T	[387B–3.4T]
GPT-5.4	OpenAI	57.7%	~1.0T	[348B–3.1T]
GPT-4o	OpenAI	55.3%	~720B	[241B–2.1T]
Qwen3-Max	Alibaba	55.0%	~685B	[229B–2.0T]
GPT-4	OpenAI	54.8%	~666B	[223B–2.0T]
GPT-4-Turbo	OpenAI	54.5%	~630B	[211B–1.9T]
GPT-5 Mini	OpenAI	51.7%	~410B	[137B–1.2T]
Gemini 2.5 Flash	Google	47.4%	~207B	[69B–617B]
Claude 3.5 Haiku	Anthropic	45.6%	~158B	[53B–470B]
GPT-5 Nano	OpenAI	40.5%	~71B	[24B–212B]
Claude Haiku 4.5	Anthropic	39.9%	~65B	[22B–194B]

Source: Pine AI’s paper (Apr/2026), p13.

Caveats on Pine AI’s specific estimates
A few entries in Pine’s table sit notably outside the working consensus and are worth flagging for users cross-referencing the Models Table:

o1 vs GPT-4o. Pine estimates o1 at ~3.5T and GPT-4o at ~720B. The widely-reported lineage is that o1 was trained on top of a GPT-4-class base (likely GPT-4o or a close sibling) with reasoning added through large-scale reinforcement learning on chain-of-thought. Under that lineage, o1’s parameter count cannot exceed GPT-4o’s by a factor of 5x. A more plausible read is that o1 has roughly the same base parameter count as GPT-4o, with the apparent capacity gap reflecting the value of inference-time reasoning rather than a larger base model.

Grok-3 and Grok-4. Pine estimates Grok-3 at ~2.1T and Grok-4 at ~3.2T. xAI’s own Grok 4 launch slides (9/Jul/2025) showed a 10x increase in RL compute over Grok 3 reasoning, with no claim of a separate pretrain. Independent technical coverage at the time (Nathan Lambert, Interconnects) treated it as plausible that Grok 4 used ‘the exact same pretraining compute as Grok 3’ with scaled-up RL. The xAI CEO’s later 15/Nov/2025 disclosure that Grok-3 and Grok-4 are both based on a 3T parameter model is consistent with this. So the two should land at roughly the same number.

GPT-4. Pine estimates GPT-4 at ~666B. The long-standing public consensus, sourced from SemiAnalysis, George Hotz, and others, is that GPT-4 is a 1.76T MoE model with around 280B active per token. The 666B figure is well below that consensus.

What this suggests. Pine’s ‘effective knowledge capacity in open-model-equivalent parameters’ framing means numbers can read lower than raw parameter count for models with proprietary training advantages, particularly for older models like GPT-4 whose training-data and post-training stack does not match the modern open-weight calibration set. The framing can also miss when reasoning-focused models (o1) get capacity attributed to base parameters rather than to inference-time compute. Both are interesting research questions rather than reasons to discard the method, but they are reasons to read individual entries in the Pine AI table with care.

Pine AI’s method is intrinsic (measuring stored facts via 1,400 black-box API probes); the Models Table method is extrinsic (pricing, capability, supply, and disclosure signals). The two are complementary rather than competing: Li’s method needs API access and a long evaluation run, where the Models Table method produces a number the day a model launches. Both carry roughly ±2–3x error bands on any single proprietary frontier model.

Where the two methods agree on recent frontier estimates:

Model	My estimate	Pine (Apr/2026)	Notes
Claude Opus 4.6	~5T	~5.3T	Match
GPT-5	3T	~4.1T	Within Pine’s interval, slightly above mine
Grok-4	3T (disclosed)	~3.2T	Both close to disclosed
Grok-3	3T (disclosed)	~2.1T	Both inside intervals; mine landed exactly on the disclosure

The two methods are measuring slightly different things, his estimates represent effective knowledge capacity in open-model-equivalent parameters, which can run higher than raw parameter count for proprietary models with heavy post-training or denser data. So a gap on a single model is not necessarily evidence that one method is wrong; it can also mean the model stores more factual knowledge per parameter than the open-weight calibration set predicts. This is a useful research question rather than a contradiction.

For users of the Models Table: where my estimates and Pine’s agree, the cross-check tightens the implicit confidence interval. Where they disagree, it’s worth reading both and forming your own view.

Closing

No method gets every estimate right when labs keep their architecture and training details confidential. Grok-3 and Grok-4 landed on the disclosed figures. The Claude Mythos compute estimate will move once the picture settles. I’ll continue providing informed estimates from the best signals available, documenting the working, and updating when the ground truth arrives.

The Models Table is a free public resource, updated continuously. Detailed analyses are available to full subscribers in The Memo.

Resource	Type	What it covers
Models Table	Reference table	The reference table for 10,000+ LLM data points, including the parameter and token estimates this page documents.
What’s in My AI?	Free public paper	Token-counting methodology for publicly-known training corpora
What’s in GPT-5?	Methodology paper	Full parameter/token derivation for GPT-5, including synthetic-data analysis (cited by the G7)
What’s in Grok?	Methodology paper	Full parameter/token derivation for the Grok family
Frontier pricing visualisation	Viz	Older version of the price-lens method
The Memo – GPT-5 special edition	Paid per-model analysis	Per-model sizing working for GPT-5
The Memo – Claude Mythos special edition	Paid per-model analysis	Per-model sizing working for Claude Mythos Preview
Datasets Table	Reference table	Largest known training datasets, updated continuously
Chinchilla data scaling laws page	Reference page	Tokens-to-parameters ratio research from Chinchilla onward

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Informs research at Apple, Google, Microsoft · Bestseller in 147 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Alan D. Thompson is a world expert in artificial intelligence, advising everyone from Apple to the US Government on integrated AI. Throughout Mensa International’s history, both Isaac Asimov and Alan held leadership roles, each exploring the frontier between human and artificial minds. His landmark analysis of post-2020 AI—from his widely-cited Models Table to his regular intelligence briefing The Memo—has shaped how governments and Fortune 500s approach artificial intelligence. With popular tools like the Declaration on AI Consciousness, and the ASI checklist, Alan continues to illuminate humanity’s AI evolution. Technical highlights.

This page last updated: 10/May/2026. https://lifearchitect.ai/models-table-methodology/↑