Get The Memo.
Alan D. Thompson
August 2024 (updated September 2024)
“[o1] destroyed the most popular reasoning benchmarks… They’re now crushed.”
— Dr Dan Hendrycks, creator of MMLU and MATH (Reuters, 17/Sep/2024)
Viz
Download source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.
Mapping table (updated Sep/2024 via OpenAI o1)
OpenAI o1 (Sep/2024) assisted with the following table based on all models with an MMLU-Pro score (at minimum, usually with both MMLU and GPQA scores) in the Models Table. The o1 model thought for 153 seconds to provide this table output. There is a fair error margin (estimate ±5%) in final scores as well as variation between testing methodologies and reporting, so these correlations should be used as a rough guide only.
- MMLU-Pro Score: Approximately calculated as
MMLU Score - 20
. - GPQA Score: Approximately calculated as
2 × MMLU Score - 110
.
MMLU | MMLU-Pro | GPQA |
---|---|---|
95 | 75 | 80 |
90 | 70 | 70 |
85 | 65 | 60 |
80 | 60 | 50 |
75 | 55 | 40 |
70 | 50 | 30 |
Notes
Benchmark results are from primary sources as compiled in my Models Table.
MMLU is no longer a reliable benchmark and, due to errors, has an ‘uncontroversially correct’ ceiling of about 90%. Read more: Errors in the MMLU: The Deep Learning Benchmark is Wrong Surprisingly Often – Daniel Erenrich.
GPQA has an ‘uncontroversially correct’ ceiling of about 80%, depending on the subset. The questions on this benchmark are extremely difficult (designed and tested by PhDs), and there must have been a challenge in finding humans smart enough to both design the test questions, and provide ‘uncontroversially correct’ answers. The paper notes (PDF, p8): ‘After an expert validator answers a question, we show them the question writer’s answer and explanation and ask them if they believe the question is objective and its answer is uncontroversially correct… On the full 546 questions, the first expert validator indicates positive post-hoc agreement on 80.1% of the questions, and the second expert validator indicates positive post-hoc agreement on 85.4% of the questions…Using a relatively conservative level of permissiveness, we estimate the proportion of questions that have uncontroversially correct answers at 74% on GPQA Extended. Because the main [448] and diamond [198, both experts agree] sets are filtered based on expert validator accuracy, we expect the objectivity rate to be even higher on those subsets.’
IQ results are by Prof David Rozado, as documented in my AI and IQ testing page. GPT-3.5-Turbo=147. GPT-4 Classic=152, replicated by Clinical psychologist Eka Roivainen using the WAIS III to assess ChatGPT (assuming GPT-4 Classic): verbal IQ of 155 (P99.987). Einstein is a well-known figure, and although he never took an IQ test, it is generally agreed that he had an estimated IQ around 160 (P99.996), see my Visualising Brightness report for numbers.
Error bars are not shown, but it can be assumed that there is a fair error margin (estimate ±5%) for all model results, and therefore the benchmark mapping in this analysis. This is primarily due to data contamination during training, where benchmark data (test questions and answers) is leaked into the training set and final model. See the Llama 2 paper for contamination discussion and working (PDF, pp75-76).
Conclusions
Parameters and tokens trained are no longer a precise indicator of power. Smaller models can now outperform larger models, and this is illustrated in benchmarks for GPT-4 Classic 1760B vs GPT-4o mini 8B (similar performance), and many other post-2023 models.
Therefore, as of 2024, both the Biden AI Executive Order and EU AI Acts are flawed. Article 51 of the EU AI Act specifies 10^25 floating point operations (FLOPs) as the threshold for a general-purpose AI system being subject to additional regulatory requirements. In the US, President Biden’s Executive Order 14110 on the Safe, Secure and Trustworthy Development and Use of Artificial Intelligence (Executive Order) specifies 10^26 FLOPs as the threshold for reporting obligations to the US Federal Government (Section 4.2(b)(i)). Jack Clark has put these training thresholds into plain English and US dollars for NVIDIA H100 chips (28/Mar/2024):
Metric | EU AI Act Mar/2024 |
Biden Executive Order Oct/2023 |
---|---|---|
Training compute threshold | 10^25 FLOPS [10 septillion] | 10^26 FLOPS [100 septillion] |
FLOPS per chip second | 8E14 [800 trillion] | 8E14 [800 trillion] |
FLOPS per chip hour | 2.88E18 [2.88 quintillion] | 2.88E18 [2.88 quintillion] |
Chip hours | 3.47M | 34.722M |
Training compute cost (chip hours × US$2), rounded |
$7M | $70M |
The landscape of artificial intelligence is rapidly evolving, challenging our traditional metrics for assessing AI capabilities. The emergence of smaller, more efficient models that rival or surpass their larger counterparts marks a significant shift in the field. This development underscores the importance of looking beyond raw computational power or parameter count when evaluating AI systems.
As we approach what may be the peak of intelligence that humans can readily comprehend and evaluate, the implications extend far beyond mere technological progress. The future of AI is not simply about more powerful machines, but about its potential to drive human evolution.
This paradigm shift is redefining our relationship with technology, opening new frontiers in human cognitive enhancement, problem-solving capabilities, and our understanding of intelligence itself.
As human intelligence and artificial intelligence converge, we are preparing for a transformative leap that will amplify our creative potential and propel humanity towards a future where the boundaries of cognition are limited only by our imagination.
What all this reminds me of
Source: Speed climbing, circa 2016.
Video
Get The Memo
by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.Bestseller. 10,000+ readers from 142 countries. Microsoft, Tesla, Google...
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.
Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 4.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. Technical highlights.
This page last updated: 17/Sep/2024. https://lifearchitect.ai/mapping/↑