Get The Memo.
Summary
Organization | Google DeepMind |
Model name | Gemini |
Internal/project name | – |
Model type | Multimodal |
Parameter count | Estimate: 1T-5T (1,000B-5,000B) |
Dataset size (tokens) | Estimate: 20T-100T (around 40TB-200TB). Note that Google’s monorepo Piper is 86TB and was used for training a code model in Jun/2023. |
Training data end date | Estimate: Dec/2022 |
Convergence date | Estimate: Jul/2023 |
Release date (public) | Estimate: Oct/2023 |
Paper | – |
Playground | – |
2023-2024 optimal language model size highlights

Gemini Updates
1/Jun/2023: Google DeepMind trains an LLM (DIDACT) on iterative code in Piper, their 86TB monorepo (2016 PDF). Using The Pile’s calculation (paper) of 0.4412 tokens per byte, this dataset would be around 37.9T tokens, or about twice the size of the next biggest dataset in GPT-4 (estimated). This means that there would be no rumored data scarcity for training Gemini.
The following table is currently in draft, and will be finalized for my mid-2023 AI report (get early access as a full member of The Memo).
24/May/2023: Google DeepMind partnership leads to DeepMind’s Flamingo 80B (my video Part 1, Part 2) being applied to Google YouTube Shorts video summarization and search optimization. ‘It automatically generates descriptions for hundreds of millions of videos in their metadata, making them more searchable.’ (-via DeepMind)
DeepMind Flamingo (Apr/2022) is a phenomenal visual language model, and was in many ways a precursor to OpenAI’s GPT-4 (Mar/2023), sharing several design concepts. And this is the best use case they can come up with? Hmmm…
10/May/2023: Google CEO (Google blog):
We’re already at work on Gemini — our next model created from the ground up to be multimodal, highly efficient at tool and API integrations, and built to enable future innovations, like memory and planning. Gemini is still in training [as of 10/May/2023], but it’s already exhibiting multimodal capabilities never before seen in prior models. Once fine-tuned and rigorously tested for safety, Gemini will be available at various sizes and capabilities, just like PaLM 2, to ensure it can be deployed across different products, applications, and devices for everyone’s benefit. – Google blog (10/May/2023).
20/Apr/2023: Google DeepMind. Announced by DeepMind CEO (DeepMind blog, and confirmed via the Google Blog):
…DeepMind and the Brain team from Google Research will be joining forces as a single, focused unit called Google DeepMind… bringing together our world-class talent in AI with the computing power, infrastructure and resources to create the next generation of AI breakthroughs and products across Google and Alphabet…
1/Aug/2022: My Google Pathways report was released, providing rigorous analysis of the design and development of Google’s models including PaLM, PaLM-Coder, Parti, and Minerva.
20/Apr/2018: Background on Google and DeepMind relationship from 2018 (The Information via MobileSyrup):
…some Google developers who are part of other AI research divisions at the company, such as Google Brain, are not happy that DeepMind doesn’t generate much revenue for the company.
…staff members are upset that DeepMind has “special status” within Alphabet that allows it to work on projects that might not yield results for years [Alan: this article is from 2018, and the most recent ‘merger’ happened five years later in 2023]…
…DeepMind had difficulty working with the time zone difference between London, England and [San Francisco, California].
DeepMind is a very private company and according to the report it objected to a “powered by DeepMind” tag on some of the Google products it helped create.
Google purchased DeepMind in 2014 for a reported $600 million and is most well-known for creating the AlphaGo program that beat the world’s top player in the game of Go.
Dataset
Dataset for Gemini via Google’s Piper monorepo (estimate)
The Gemini dataset could be made up of a large amount of code, to support reasoning (many papers, 1, 2) within the final trained model. Google’s internal monorepo Piper, is 86TB (2016 PDF). Using The Pile’s calculation (paper) of 0.4412 tokens per byte, this dataset would be around 37.9T tokens, or about twice the size of the next biggest dataset in GPT-4 (estimated).
The following table is currently in draft, and will be finalized for my mid-2023 AI report (get early access as a full member of The Memo).
Dataset for Gemini via MassiveText (estimate)
The Gemini dataset could potentially be made up of some of DeepMind’s MassiveText (multilingual) 5T-token dataset (see the Improving language models by retrieving from trillions of tokens paper and my What’s in my AI? paper).
Please note that the following table is ‘best guess’ by Alan (not confirmed by Google DeepMind), and is based on available information from the state-of-the-art DeepMind MassiveText (multilingual) + 1,000B tokens of discussion.
Count | Dataset | Percentage tokens | Raw Size (GB) | Tokens (B) |
1 | Books (en) | 68.11% | 12,853GB | 3,423B |
2 | Discussion (multilingual)* | x% | 3,750GB | 1,000B* |
3 | Web: C4 (multilingual) | 19.45% | 3,656GB | 977B |
4 | Code: Github | 7.46% | 2,754GB | 375B |
5 | News (en) | 4.71% | 888GB | 237B |
6 | Wikipedia (multilingual) | 0.26% | 48GB | 13B |
Totals | 23,949GB (23.9TB) | 6,000B (6T) |
* Alan’s estimate only.
Table. MassiveText multilingual dataset estimates. Rounded. Disclosed in bold (from DeepMind’s MassiveText multilingual dataset). Determined in italics. For similar models, see my What’s in my AI paper.
Timeline to Gemini
Date | Milestone |
31/Aug/2017 | Google: Transformer released. |
28/Jan/2020 | Google: Meena announced. |
18/May/2021 | Google: LaMDA announced. |
4/Apr/2022 | Google: PaLM 1 announced. |
12/Apr/2022 | DeepMind: Chinchilla announced. |
28/Apr/2022 | DeepMind: Flamingo announced. |
12/May/2022 | Google: LaMDA 2 released. |
12/May/2022 | DeepMind: Gato announced. |
26/Dec/2022 | Google DeepMind: MedPaLM 1 announced. |
10/Apr/2023 | Google partners with DeepMind. |
10/May/2023 | Google: PaLM 2 released. |
1/Jun/2023 | Google DeepMind DIDACT code model trained on 37T tokens (Alan estimate). |
Next… | Google DeepMind: Gemini. |
DeepMind models
AI Race
Download source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.
Video
AGI
Read more about Alan’s conservative countdown to AGI…
Get The Memo
by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.Thousands of paid subscribers. Readers from Microsoft, Tesla, Google AI...
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

This page last updated: 6/Jun/2023. https://lifearchitect.ai/gemini/↑