AI + IQ testing (human vs GPT-3 vs J-1 vs PaLM)

Note: While I would love to facilitate cognitive testing of current language models, nearly all popular IQ instruments would preclude testing of a ‘written word only’ language model; the most common instruments from Wechsler and Stanford-Binet require a test candidate who is using verbal, auditory, visual, and even kinaesthetic… For this reason, AI labs generally use customised testing suites focused on written only. A selection of these benchmarks have been visualised below. Please see the full data for context and references.

On SAT questions, GPT-3 scored 15% higher than an average college applicant.
On trivia questions, models like GPT-3 and J1 score up to 40% higher than the average human.


Download source (PDF)
Tests: View the data (Google sheets)
Trivia test (by WCT): Human: 52%, GPT-3: 73%, J1: 55.4%

Download source (PDF)

Todo (Apr/2022):
NYT reported “In December 2021, DeepMind announced that its L.L.M. Gopher scored results on the RACE-h benchmark — a data set with exam questions comparable to those in the reading sections of the SAT — that suggested its comprehension skills were equivalent to that of an average high school student.”
RACE-h as complex reading comprehension questions for high-school students:
Average human (Amazon Turk worker) = only 69.4%, ceiling 94.2% (link)
PaLM 540B = 54.6% (few-shot).
Gopher 280B = 71.6%

GPT-3 has its own form of fluid and crystalline intelligence. The crystalline part is all of the facts it has accumulated and the fluid part is its ability to make logical deductions from learning the relationships between things. – OpenAI, Feb/2022


“I think GPT-3 is artificial general intelligence, AGI. I think GPT-3 is as intelligent as a human. And I think that it is probably more intelligent than a human in a restricted way… in many ways it is more purely intelligent than humans are. I think humans are approximating what GPT-3 is doing, not vice versa.”
— Connor Leahy, co-founder of EleutherAI, creator of GPT-J (November 2020)

Binet assessment with Leta AI

An early version of the Binet (1905) was run with Leta AI in May 2021. It is very informal, only a small selection of questions were used, and it should be considered as a fun experiment only.

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Hundreds of paid subscribers. Readers from Microsoft, Tesla, & Google AI.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant. With Leta (an AI powered by GPT-3), Alan co-presented a seminar called ‘The new irrelevance of intelligence’ at the World Gifted Conference in August 2021. His applied AI research and visualisations are featured across major international media, including citations in the University of Oxford’s debate on AI Ethics in December 2021. He has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organisations and enterprise.

This page last updated: 24/Jun/2022.