Get The Memo.
Alan D. Thompson
August 2024 (updated 2025Q1)
MMLU | MMLU-Pro | GPQA | HLE | |
---|---|---|---|---|
Name | Measuring Massive Multitask Language Understanding | MMLU-Pro | Graduate-Level Google-Proof Q&A | Humanity’s Last Exam |
Author | Dan Hendrycks et al. | TIGER-Lab | NYU, Cohere, Anthropic | Center for AI Safety |
Question count | ~16,000 | ~12,000 | 448 | 3,000 |
Choices per question | 4 | 10 | 4 | Free-text (or 5+ when multi-choice) |
Chance accuracy | 25% | 10% | 25% | ~4.8% (24% multi-choice, 5 or more options) |
Highest score (2025Q1) | 92.3% (o1-preview) |
91.0% (o1-preview) |
87.7% (o3) |
14% (o3-mini-high) |
Ceiling | ~90% (UoE) | ~90% | ~80% (subset dependent) | ~99% |
Notes | Multisubject benchmark; 10% dataset errors limit max score | More challenging; 10-choice format | PhD-level, ‘Google-proof’ questions in STEM. Diamond subset=198 questions. | Includes multimodal tasks |
“[o1] destroyed the most popular reasoning benchmarks… They’re now crushed.”
— Dr Dan Hendrycks, creator of MMLU and MATH (Reuters, 17/Sep/2024)
Viz
Download source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.
Mapping table
OpenAI o1 (Sep/2024) assisted with the following table based on all models with an MMLU-Pro score (at minimum, usually with both MMLU and GPQA scores) in the Models Table. The o1 model thought for 153 seconds to provide this table output. In Apr/2025, with a lot of hand holding, o3-mini-high was used to add the HLE calcs. There is a fair error margin (estimate ±5%) in final scores as well as variation between testing methodologies and reporting, so these correlations should be used as a rough guide only.
- MMLU-Pro Score: Approximately calculated as
MMLU Score - 20
. - GPQA Score: Approximately calculated as
2 × MMLU Score - 110
. - HLE Score: Approximately calculated as
(GPQA Score – 35) / 5
.
MMLU | MMLU-Pro | GPQA | HLE |
---|---|---|---|
– | – | 85 | 10 |
95 | 75 | 80 | 9 |
90 | 70 | 70 | 7 |
85 | 65 | 60 | 5 |
80 | 60 | 50 | 3 |
75 | 55 | 40 | 1 |
70 | 50 | 30 | 0 |
Notes
Benchmark results are from primary sources as compiled in my Models Table.
MMLU is no longer a reliable benchmark and, due to errors, has an ‘uncontroversially correct’ ceiling of about 90%. Gema found that over 9% of questions in the MMLU benchmark contain errors based on expert review of 3,000 randomly sampled questions. Read more: https://arxiv.org/pdf/2406.04127#page=5 & Errors in the MMLU: The Deep Learning Benchmark is Wrong Surprisingly Often – Daniel Erenrich.
GPQA has an ‘uncontroversially correct’ ceiling of about 80%, depending on the subset. The questions on this benchmark are extremely difficult (designed and tested by PhDs), and there must have been a challenge in finding humans smart enough to both design the test questions, and provide ‘uncontroversially correct’ answers. The paper notes (PDF, p8): ‘After an expert validator answers a question, we show them the question writer’s answer and explanation and ask them if they believe the question is objective and its answer is uncontroversially correct… On the full 546 questions, the first expert validator indicates positive post-hoc agreement on 80.1% of the questions, and the second expert validator indicates positive post-hoc agreement on 85.4% of the questions…Using a relatively conservative level of permissiveness, we estimate the proportion of questions that have uncontroversially correct answers at 74% on GPQA Extended. Because the main [448] and diamond [198, both experts agree] sets are filtered based on expert validator accuracy, we expect the objectivity rate to be even higher on those subsets.’
HLE probably has an ‘uncontroversially correct’ ceiling of about 97%, my estimate. A 3% error rate across 3,000 human-designed questions seems to be a fair assumption. My initial HLE formula above is based on GPQA score, and with only 10 data points available (minus outliers), is subject to update as this benchmark becomes more widely applied during 2025–2026.
IQ results are by Prof David Rozado, as documented in my AI and IQ testing page. GPT-3.5-Turbo=147. GPT-4 Classic=152, replicated by Clinical psychologist Eka Roivainen using the WAIS III to assess ChatGPT (assuming GPT-4 Classic): verbal IQ of 155 (P99.987). Einstein is a well-known figure, and although he never took an IQ test, it is generally agreed that he had an estimated IQ around 160 (P99.996), see my Visualising Brightness report for numbers.
Error bars are not shown, but it can be assumed that there is a fair error margin (estimate ±5%) for all model results, and therefore the benchmark mapping in this analysis. This is primarily due to data contamination during training, where benchmark data (test questions and answers) is leaked into the training set and final model. See the Llama 2 paper for contamination discussion and working (PDF, pp75-76).
Conclusions
Parameters and tokens trained are no longer a precise indicator of power. Smaller models can now outperform larger models, and this is illustrated in benchmarks for GPT-4 Classic 1760B vs GPT-4o mini 8B (similar performance), and many other post-2023 models.
Therefore, as of 2024, both the Biden AI Executive Order and EU AI Acts are flawed. Article 51 of the EU AI Act specifies 10^25 floating point operations (FLOPs) as the threshold for a general-purpose AI system being subject to additional regulatory requirements. In the US, President Biden’s Executive Order 14110 on the Safe, Secure and Trustworthy Development and Use of Artificial Intelligence (Executive Order) specifies 10^26 FLOPs as the threshold for reporting obligations to the US Federal Government (Section 4.2(b)(i)). Jack Clark has put these training thresholds into plain English and US dollars for NVIDIA H100 chips (28/Mar/2024):
Metric | EU AI Act Mar/2024 |
Biden Executive Order Oct/2023 |
---|---|---|
Training compute threshold | 10^25 FLOPS [10 septillion] | 10^26 FLOPS [100 septillion] |
FLOPS per chip second | 8E14 [800 trillion] | 8E14 [800 trillion] |
FLOPS per chip hour | 2.88E18 [2.88 quintillion] | 2.88E18 [2.88 quintillion] |
Chip hours | 3.47M | 34.722M |
Training compute cost (chip hours × US$2), rounded |
$7M | $70M |
The landscape of artificial intelligence is rapidly evolving, challenging our traditional metrics for assessing AI capabilities. The emergence of smaller, more efficient models that rival or surpass their larger counterparts marks a significant shift in the field. This development underscores the importance of looking beyond raw computational power or parameter count when evaluating AI systems.
As we approach what may be the peak of intelligence that humans can readily comprehend and evaluate, the implications extend far beyond mere technological progress. The future of AI is not simply about more powerful machines, but about its potential to drive human evolution.
This paradigm shift is redefining our relationship with technology, opening new frontiers in human cognitive enhancement, problem-solving capabilities, and our understanding of intelligence itself.
As human intelligence and artificial intelligence converge, we are preparing for a transformative leap that will amplify our creative potential and propel humanity towards a future where the boundaries of cognition are limited only by our imagination.
What all this reminds me of
Source: Speed climbing, circa 2016.
Video
Get The Memo
by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.Informs research at Apple, Google, Microsoft · Bestseller in 147 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

This page last updated: 14/Apr/2025. https://lifearchitect.ai/mapping/↑