A brief list of important definitions in language models (e.g. GPT-3, GPT-J), datasets (e.g. Common Crawl), and AI.

What is a token?

A token is 4 characters, about 0.7 words on average. A single token can be made up of different combinations of a-z, A-Z, 0-9, symbols, and other characters including whitespace. For simplicity, a token can usually be considered a syllable of a word.

For a longer explanation: A token is a way of dealing with words by breaking a word up into 50,000 unique subword units using byte pair encoding (BPE). This is particularly helpful with agglutinative or polysynthetic words where an infinite number of words can be created by combining morphemes. For example, the Yup’ik word tuntussuqatarniksaitengqiggtuq is composed of many morphemes that translate to “He had not yet said again that he was going to hunt reindeer”. Rather than training GPT-3 on tuntussuqatarniksaitengqiggtuq, it is more efficient to train on the BPEs: “t”, “unt”, “uss”, “u”, “q”, “at”, “arn”, “i”, “ks”, “ait”, “eng”, “q”, “igg”, “tu”, “q” (Thanks to literature review by Holly Grimm, OpenAI scholar).

Who are you?

1 Who (8727)
2 are (389)
3 you (345)
4 ? (30)
= 4 tokens

I am Alan.

1 I (40)
2 am (716)
3 Alan (12246)
4 . (13)
= 4 tokens

Acronym for deoxyribonucleic acid?

1 Ac (1282)
2 ro (131)
3 nym (4948)
4 for (329)
5 de (390)
6 oxy (23536)
7 rib (822)
8 on (261)
9 ucle (14913)
10 ic (291)
11 acid (7408)
12 ? (30)
= 12 tokens

Thanks to the token estimator by Andrew Mayne, OpenAI.


A parameter is a connection chosen by the language model and learned during training. They are sometimes called weights. For simplicity, a parameter count (for example, 175 billion) can be considered to be the count of connections between nodes in a neural network.

Hyper-parameters are settings chosen by a human (usually a software developer) while running (not training) the model, and can include Temperature, and Top-P.


Originality. This hyper-parameter controls the randomness of the generated text in models like GPT-3 and GPT-J.

0 = Deterministic (low randomness) (always generate the same output for input).
1 = Creative (high randomness) (be adventurous when generating output for input).


Correctness. This hyper-parameter controls the sampling range of continuations (before discarding others). Its effect is similar to temperature.

0 = Consider likely continuations (small pool, high accuracy).
1 = Consider all continuations (large pool, low accuracy).

In my work with the GPT-J model for Leta – Episode 10, I used the settings:

Temp = 0.5 (medium randomness)
TOP-P = 1 (large pool of continuations, lower accuracy)

Model names

Transformer: A neural network architecture that looks at all words in a text input, rather than just one or a few surrounding words. Originally designed for complex translation, especially for English to gendered languages like French. Open-sourced by Google in 2017.

GPT-3: Generative Pre-trained Transformer 3. Launched by OpenAI in May 2020.

PanGu Alpha: Launched by Huawei and others in April 2021.
Simplified Chinese: 盘古
Traditional Chinese: 盤古
Pinyin: Pán gǔ
Pronounced: pun-goo (rhymes with done tool)
English: Literal: ‘coil ancient’, first living being and the creator (coiled up in an egg).
Etymology: Mythical Chinese creation figure who emerged from a yin-yang egg and created the earth and sky (similar to the Christian creation story, and Pangu has been compared to Adam).

Wudao 2.0: Launched by the Beijing Academy of Artificial Intelligence (BAAI) and others in June 2021.
Simplified Chinese: 悟道
Traditional Chinese: 悟道
Pinyin: Wù dào
Pronounced: oo-dao (rhymes with tool now)
English: Literal: ‘Enlightenment’.
Etymology: Truth of the Dharma, the spiritual path.

LaMDA: Language Model for Dialogue Applications. Launched by Google in 2021.