Mapping IQ, MMLU, MMLU-Pro, and GPQA

Dr Alan D. Thompson left his position as Chairman of Mensa International's Gifted Families in 2020, after seeing the capabilities of GPT-3 outperforming his prodigy clients. (Read his 2020 paper, The New Irrelevance of Intelligence, and watch his 2021 presentation to the World Gifted Conference featuring Leta AI backed by GPT-3). This year, artificial intelligence is now far smarter than humans across many metrics, including testing on Raven's Progressive Matrices. Join subscribers from RAND, Pearson (Wechsler), Tesla, Microsoft, and Google AI for Dr Alan's monthly updates on integrated AI:
Get The Memo.


Alan D. Thompson
August 2024 (updated December 2024)

“[o1] destroyed the most popular reasoning benchmarks… They’re now crushed.”
— Dr Dan Hendrycks, creator of MMLU and MATH (Reuters, 17/Sep/2024)

Viz

Download source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.

Mapping table (updated Sep/2024 via OpenAI o1)

OpenAI o1 (Sep/2024) assisted with the following table based on all models with an MMLU-Pro score (at minimum, usually with both MMLU and GPQA scores) in the Models Table. The o1 model thought for 153 seconds to provide this table output. There is a fair error margin (estimate ±5%) in final scores as well as variation between testing methodologies and reporting, so these correlations should be used as a rough guide only.

  • MMLU-Pro Score: Approximately calculated as MMLU Score - 20.
  • GPQA Score: Approximately calculated as 2 × MMLU Score - 110.
MMLU MMLU-Pro GPQA
95 75 80
90 70 70
85 65 60
80 60 50
75 55 40
70 50 30

Notes

Benchmark results are from primary sources as compiled in my Models Table.

MMLU is no longer a reliable benchmark and, due to errors, has an ‘uncontroversially correct’ ceiling of about 90%. Gema found that over 9% of questions in the MMLU benchmark contain errors based on expert review of 3,000 randomly sampled questions. Read more: https://arxiv.org/pdf/2406.04127#page=5 & Errors in the MMLU: The Deep Learning Benchmark is Wrong Surprisingly Often – Daniel Erenrich.

GPQA has an ‘uncontroversially correct’ ceiling of about 80%, depending on the subset. The questions on this benchmark are extremely difficult (designed and tested by PhDs), and there must have been a challenge in finding humans smart enough to both design the test questions, and provide ‘uncontroversially correct’ answers. The paper notes (PDF, p8): ‘After an expert validator answers a question, we show them the question writer’s answer and explanation and ask them if they believe the question is objective and its answer is uncontroversially correct… On the full 546 questions, the first expert validator indicates positive post-hoc agreement on 80.1% of the questions, and the second expert validator indicates positive post-hoc agreement on 85.4% of the questions…Using a relatively conservative level of permissiveness, we estimate the proportion of questions that have uncontroversially correct answers at 74% on GPQA Extended. Because the main [448] and diamond [198, both experts agree] sets are filtered based on expert validator accuracy, we expect the objectivity rate to be even higher on those subsets.’

IQ results are by Prof David Rozado, as documented in my AI and IQ testing page. GPT-3.5-Turbo=147. GPT-4 Classic=152, replicated by Clinical psychologist Eka Roivainen using the WAIS III to assess ChatGPT (assuming GPT-4 Classic): verbal IQ of 155 (P99.987). Einstein is a well-known figure, and although he never took an IQ test, it is generally agreed that he had an estimated IQ around 160 (P99.996), see my Visualising Brightness report for numbers.

Error bars are not shown, but it can be assumed that there is a fair error margin (estimate ±5%) for all model results, and therefore the benchmark mapping in this analysis. This is primarily due to data contamination during training, where benchmark data (test questions and answers) is leaked into the training set and final model. See the Llama 2 paper for contamination discussion and working (PDF, pp75-76).

Conclusions

Parameters and tokens trained are no longer a precise indicator of power. Smaller models can now outperform larger models, and this is illustrated in benchmarks for GPT-4 Classic 1760B vs GPT-4o mini 8B (similar performance), and many other post-2023 models.

Therefore, as of 2024, both the Biden AI Executive Order and EU AI Acts are flawed. Article 51 of the EU AI Act specifies 10^25 floating point operations (FLOPs) as the threshold for a general-purpose AI system being subject to additional regulatory requirements. In the US, President Biden’s Executive Order 14110 on the Safe, Secure and Trustworthy Development and Use of Artificial Intelligence (Executive Order) specifies 10^26 FLOPs as the threshold for reporting obligations to the US Federal Government (Section 4.2(b)(i)). Jack Clark has put these training thresholds into plain English and US dollars for NVIDIA H100 chips (28/Mar/2024):

Metric EU AI Act
Mar/2024
Biden Executive Order
Oct/2023
Training compute threshold 10^25 FLOPS [10 septillion] 10^26 FLOPS [100 septillion]
FLOPS per chip second 8E14 [800 trillion] 8E14 [800 trillion]
FLOPS per chip hour 2.88E18 [2.88 quintillion] 2.88E18 [2.88 quintillion]
Chip hours 3.47M 34.722M
Training compute cost
(chip hours × US$2), rounded
$7M $70M

The landscape of artificial intelligence is rapidly evolving, challenging our traditional metrics for assessing AI capabilities. The emergence of smaller, more efficient models that rival or surpass their larger counterparts marks a significant shift in the field. This development underscores the importance of looking beyond raw computational power or parameter count when evaluating AI systems.

As we approach what may be the peak of intelligence that humans can readily comprehend and evaluate, the implications extend far beyond mere technological progress. The future of AI is not simply about more powerful machines, but about its potential to drive human evolution.

This paradigm shift is redefining our relationship with technology, opening new frontiers in human cognitive enhancement, problem-solving capabilities, and our understanding of intelligence itself.

As human intelligence and artificial intelligence converge, we are preparing for a transformative leap that will amplify our creative potential and propel humanity towards a future where the boundaries of cognition are limited only by our imagination.

What all this reminds me of

Source: Speed climbing, circa 2016.

Video

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Informs research at Apple, Google, Microsoft · Bestseller in 142 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. Technical highlights.

This page last updated: 21/Dec/2024. https://lifearchitect.ai/mapping/