Chinchilla data-optimal scaling laws: In plain English

👋 Hi, I’m Alan. I advise government and enterprise on post-2020 AI like OpenAI’s upcoming GPT-5, and Google’s ongoing Pathways and Gemini models. You definitely want to keep up with the AI revolution this year. My paid subscribers (DeepMind, Microsoft, Google, Stripe, Samsung…) receive bleeding-edge and exclusive insights on AI as it happens.
Get The Memo.

Alan D. Thompson
February 2023

Summary: Chinchilla showed that we need to be using 11× more data during training than that used for GPT-3 and similar models. This means that we need to source, clean, and filter to around 33TB of text data for a 1T-parameter model.

How much text data should we use when training a text-based large language model (LLM)?

Over the last three years to 2023, there have been a few discoveries, through a process of trial and error…

(Note: There is a complementary scaling law for compute built in to these findings, but this is outside the scope of my current focus.)

In May/2020, OpenAI (GPT-3 paper) tacitly announced their data scaling laws (also called the Kaplan scaling laws) for LLMs:

In plain English, GPT-3/Kaplan scaling laws said that…
300B tokens can be used to train an LLM of size 175B parameters
So, we need around 1.7 text tokens per parameter

In Sep/2022, DeepMind (Chinchilla paper) found new data scaling laws (also called the Chinchilla or Hoffman scaling laws) for ‘data optimal’ LLMs:

In plain English, Chinchilla/Hoffman scaling laws say that…
1,400B (1.4T) tokens should be used to train a data-optimal LLM of size 70B parameters
So, we need around 20 text tokens per parameter

Therefore, to make GPT-3 data optimal, and…

Keeping the original 300B tokens, GPT-3 should have been only 15B parameters (300B tokens ÷ 20).
This is around 11× smaller in terms of model size.


To get to the original 175B parameters, GPT-3 should have used 3,500B (3.5T) tokens (175B parameters x 20. 3.5T tokens is about 4-6TB of data, depending on tokenization and tokens per byte).
This is around 11× larger in terms of data needed.

The data optimization scale continues for model sizes measured in trillions of parameters, and training data measured in quadrillions of text tokens or petabytes of text data. The table and explanation below originally appeared in the Jun/2022 report, The sky is bigger than we imagine.

Text for indexing

Model size
tokens (round)
Training data
used (estimate)
How much data is that?
If 1 book is about 500KB of text (estimate)
1.4 Trillion 2.3TB More books than in…
The Kindle store on Amazon US (6.4M).
250B 5 Trillion 8.3TB All 30 libraries at Yale University (16.6M).
500B 10 Trillion 16.6TB The Google Books collection (33.2M).
1T 20 Trillion 33.3TB The US Library of Congress (66.6M).
10T 200 Trillion 333TB All US public libraries combined (666M).
100T 2 Quadrillion 3.3PB All bibles ever sold worldwide (6.6B).
250T 5 Quadrillion 8.3PB A stack all the way to the Moon (16.6B).
500T 10 Quadrillion 16.6PB 4 books about every living human (33.2B).
Table: Dataset sizes needed to align with Chinchilla data optimization for models.
Note: Text estimates1Kindle ≈ 6M books (estimate)
Yale ≈ 15M items
Google Books ≈ 25M books
US Library of Congress ≈ 51M cataloged books
British Library ≈ 170M items
US public libraries ≈ 732M books, note that this definitely includes (many) duplicates
Bibles ≈ 5B copies
Earth to Moon ≈ 384,400km≈ 38,440,000,000cm, each book spine 2.4cm thick ≈ 16B books
Human population ≈ 8B (Jun/2022)
only, multimodal data not shown. Jun/2022.

There are a few caveats to my approximate numbers in the table above. Firstly, the ‘More books than in…’ examples are provided for text-based book data only (no pictures), and this assumes that books are about 500KB each without images2500KB ≈ 500K characters ≈ 75K words ≈ 300 pages per book. Simplified and rounded for easy figures.. We are now of course exploring training AI with multimodal data: images, music, control signals (robots, button presses), and anything else we can get our hands on. These increasing sizes are also using simplified and rounded estimates only, based on the new findings related to model scaling using more data (measured by number of tokens, which are roughly equivalent to words). 

In 2010, Google estimated that there are only 130M unique published books in existence3, so past 1T parameters (20T tokens), training data collection would naturally have to rely on alternative text-based and multimodal content. At brain-scale parameter counts of 500T (10Q tokens), the estimated book count would be over 250 times the number of books published, or more than four new books written about each living human on Earth!

Fundamentally, it should not be an incredibly onerous process to collect petabytes of high-quality and filtered multimodal data (converted to text), though that task has not yet been accomplished by any AI lab to date (Jun/2022). 

Viz of selected models showing tokens:parameters ratio

Table of current models showing tokens:parameters ratio

Summary of current models: View the full data (Google sheets)
Download PDF version

It is expected that 2023 large language models will continue to follow the Chinchilla scaling laws, though there will be new discoveries about data optimization and data use during training. For example, there is some research on whether or not data can ‘repeat’ (be seen more than once) during training, which may help alleviate the amount of data required to be sourced.

DeepMind models to Dec/2022

Videos on scaling and Chinchilla models

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Thousands of paid subscribers. Readers from Microsoft, Tesla, Google AI...
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 3.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organizations and enterprise.

This page last updated: 16/Apr/2023.