Get The Memo.
|Internal/project name||Multimodal IT (mmit)|
|Model type||Multimodal, dense|
|Parameter count||Alan estimates: 1-2T (1,000B-2,000B). Estimate updated in Dec/2023. See also: GPT-5
1. Nano-1 1.8B
2. Nano-2 3.25B
|Dataset size (tokens)||Alan estimates: 20T-40T for Ultra. Estimate updated in Dec/2023.|
|Training data end date||Alan estimates: Dec/2022|
|Training start date||Alan estimates: May/2023|
|Training end/convergence date||Alan estimates: Aug/2023 for Pro (Ultra is being fine-tuned until 2024).|
|Training time (total)||TPUv4 and TPUv5 for ~120 days (compare with GPT-5 estimated @ ~25,000 H100/A100 for ~120 days, GPT-4 @ ~25,000 A100s for ~90 days, and GPT-3 @ ~1,024 A100s for 34 days)|
|Release date (public)||6/Dec/2023 for Pro, 2024 for Ultra|
|Paper||Gemini 1 Report.pdf|
|Playground||Bard (US) and Google Cloud Vertex AI (13/Dec/2023)|
4/Dec/2023: Google plans a ‘virtual preview’ of Gemini this week. (TI)
3/Nov/2023: Gemini delayed DeepMind CEO: ‘Gemini’s going very well. We’re very happy with it, and it’s looking very good. It’s still in sort of training or fine-tuning, and we’ll have more information to share on that very soon. [Likely 2024 now?] We’ll see.’ (CNBC, 14m17s)
24/Oct/2023: Gemini delayed via Google Q3 earnings call possibly suggests a delay on Gemini’s release to 2024:
Google CEO: I’m very excited at the progress there and as we’re working through getting the model ready.
To me, more importantly, we are just really laying the foundation of what I think of as the next-generation series of models we’ll be launching all throughout 2024. The pace of innovation is extraordinarily impressive to see. We are creating it from the ground up to be multimodal, highly efficient with tool and API integrations, and more importantly, laying the platform to enable future innovations as well.
We are developing Gemini in a way that it is going to be available at various sizes and capabilities, and we’ll be using it immediately across all our products internally as well as bringing it out to both developers and Cloud customers through Vertex.
So I view it as a journey, and each generation is going to be better than the other. And we are definitely investing, and the early results are very promising.
16/Oct/2023: First leaked images of Google DeepMind Gemini (-via Bedros Pamboukian).
Gemini = Jetway = Multimodal IT (mmit) M (medium size?)
MakerSuite = Alkali = New GoogleDeepMind AI platform launched in Mar/2023.
Stubbs = Create apps. ‘This will create fully fledged apps with working code’.
The MakerSuite interface to Gemini provides the following UX options (interpreted from the original screenshot by GPT-4V!):
- Freeform prompt: A freeform way to experiment with language models.
- Data prompt: A table that uses rows and columns to organize prompts.
- Chat prompt: A template for back-and-forth chatbot conversations.
- Tuned model: Improve model responses by using more examples.
- Stubbs: Build and launch working Stubbs.
The context window looks to be only 4,096 tokens, about 3,000 words or 6 pages of text.
Alan’s context window calculations.
a. Using ‘standard’ BPE.
b. 1 token≈0.75 words. 1 word≈1.33 tokens.
|Model name / params||Context window
|4,096||3,072 words||College essay
|32,000||24,000 words||Complete screenplay, film script*
|8,000||6,000 words||Short story (12 pages)|
|4,096||3,072 words||College essay
|4,000||3,000 words||College essay
|2,048||1,536 words||3 pages|
|1,024||768 words||News article
|512||384 words||Less than 1 page|
– Avengers: Endgame (2019) @ 24,000 words
– Forrest Gump (1994) @ 25,000 words
– Jurassic Park (1993) @ 16,000 words
– Glengarry Glen Ross (1992) @ 14,000 words
– Aladdin (1992) @ 17,000 words
Table. GPT context window and word count. Rounded. Determined in italics.
It is expected that Gemini’s context window will be revealed to be much higher (32k) for the model’s public release.
Download source (PDF)
12/Oct/2023: Google VP: ‘I’ve seen some pretty amazing things. Like, I’m trying to bake a cake, draw me 3 pictures of the steps to how to ice a three layer cake, and Gemini will actually create those images. These are completely novel pictures. These are not pictures from the internet. It’s able to speak in imagery with humans now, not just text.’ (BI)
18/Sep/2023: ‘Google has allegedly given a group of testers access to an early version of Gemini, suggesting a public release of the product is soon’ (UC Today)
16/Aug/2023: Paywalled update:
– “Gemini will … combin[e] the text capabilities of LLMs like GPT-4 with the ability to create AI images based on a text description, similar to AI-image generators Midjourney and Stable Diffusion … Gemini’s image capabilities haven’t been previously reported.”
– Expected availability via GCP in the fall [US fall is 23/Sep – 21/Dec].
– Google “may start using it in some products before then”.
– “it could also integrate video and audio [trained from YouTube] into the Gemini models”.
– “Two longtime DeepMind executives, Oriol Vinyals and Koray Kavukcuoglu, are in charge of Gemini alongside Jeff Dean … They oversee hundreds of employees involved in Gemini’s development.”
– “Google’s lawyers have been closely evaluating the training. In one instance, they made researchers remove training data that had come from textbooks—which could help the model answer questions about subjects like astronomy or biology—over concerns about pushback from copyright holders.” (TI)
2/Aug/2023: GDM Soft MoE: ‘a fully-differentiable sparse Transformer… maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert… Soft MoE greatly outperforms standard Transformers (ViTs) and popular MoE variants (Tokens Choice and Experts Choice).’ (arXiv)
28/Jul/2023: GDM RT-2: ‘co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering.’ (project page)
20/Jul/2023: Gemini will be released by Dec/2023: ‘Gemini is Google’s attempt to build a general-purpose AI program that can rival OpenAI’s GPT-4 model, which powers a paid version of ChatGPT. Demis Hassabis, the Google executive overseeing the project, told employees during a recent companywide meeting that the program would become available later this year…’ (WSJ)
11/Jul/2023: Interview with Verge: ‘Gemini… is our next-generation multimodal large models — very, very exciting work going on there, combining all the best ideas from across both world-class research groups [DeepMind and Google AI]. It’s pretty impressive to see… Today’s chatbots will look trivial by comparison to I think what’s coming in the next few years… ‘ (Verge)
11/Jul/2023: Interview with NYT: ‘And so what I think is going to happen in the next era of systems — and we’re working on our own systems called Gemini — is that I think there’s going to be a combination of the two things [general and specialized]. So we’ll have this increasingly more powerful general system that you basically interact with through language but has other capabilities, general capabilities, like math and coding, and perhaps some reasoning and planning, eventually, in the next generations of these systems… There should be specialized A.I. systems that learn how to do those things — AlphaGo, AlphaZero, AlphaFold. And actually, the general system can call those specialized A.I.s as tools.’ (NYT)
26/Jun/2023: “At a high level you can think of Gemini as combining some of the strengths of AlphaGo-type systems with the amazing language capabilities of the large models,” Hassabis says. “We also have some new innovations that are going to be pretty interesting.” Gemini is still in development, a process that will take a number of months, Hassabis says…. “I can see the kinds of things we’re building into the Gemini series right, and we have no reason to believe that they won’t work,” he says. (Wired)
1/Jun/2023: Google DeepMind trains an LLM (DIDACT) on iterative code in Piper, their 86TB monorepo (2016 PDF). Using The Pile’s calculation (paper) of 0.4412 tokens per byte, this dataset would be around 37.9T tokens, or about twice the size of the next biggest dataset in GPT-4 (estimated). This means that there would be no rumored data scarcity for training Gemini.Summary of current models: View the full data (Google sheets)
24/May/2023: Google DeepMind partnership leads to DeepMind’s Flamingo 80B (my video Part 1, Part 2) being applied to Google YouTube Shorts video summarization and search optimization. ‘It automatically generates descriptions for hundreds of millions of videos in their metadata, making them more searchable.’ (-via DeepMind)
DeepMind Flamingo (Apr/2022) is a phenomenal visual language model, and was in many ways a precursor to OpenAI’s GPT-4 (Mar/2023), sharing several design concepts. And this is the best use case they can come up with? Hmmm…
10/May/2023: Google CEO (Google blog):
We’re already at work on Gemini — our next model created from the ground up to be multimodal, highly efficient at tool and API integrations, and built to enable future innovations, like memory and planning. Gemini is still in training [as of 10/May/2023], but it’s already exhibiting multimodal capabilities never before seen in prior models. Once fine-tuned and rigorously tested for safety, Gemini will be available at various sizes and capabilities, just like PaLM 2, to ensure it can be deployed across different products, applications, and devices for everyone’s benefit. – Google blog (10/May/2023).
…DeepMind and the Brain team from Google Research will be joining forces as a single, focused unit called Google DeepMind… bringing together our world-class talent in AI with the computing power, infrastructure and resources to create the next generation of AI breakthroughs and products across Google and Alphabet…
1/Aug/2022: My Google Pathways report was released, providing rigorous analysis of the design and development of Google’s models including PaLM, PaLM-Coder, Parti, and Minerva.
20/Apr/2018: Background on Google and DeepMind relationship from 2018 (The Information via MobileSyrup):
…some Google developers who are part of other AI research divisions at the company, such as Google Brain, are not happy that DeepMind doesn’t generate much revenue for the company.
…staff members are upset that DeepMind has “special status” within Alphabet that allows it to work on projects that might not yield results for years [Alan: this article is from 2018, and the most recent ‘merger’ happened five years later in 2023]…
…DeepMind had difficulty working with the time zone difference between London, England and [San Francisco, California].
DeepMind is a very private company and according to the report it objected to a “powered by DeepMind” tag on some of the Google products it helped create.
Google purchased DeepMind in 2014 for a reported $600 million and is most well-known for creating the AlphaGo program that beat the world’s top player in the game of Go.
2023-2024 optimal language model size highlightsDownload source (PDF)
Permissions: Yes, you can use these visualizations anywhere, please leave the citation intact.
Models tableSummary of current models: View the full data (Google sheets)
Download PDF version
Timeline to Gemini
|31/Aug/2017||Google: Transformer released.|
|28/Jan/2020||Google: Meena announced.|
|18/May/2021||Google: LaMDA announced.|
|4/Apr/2022||Google: PaLM 1 announced.|
|12/Apr/2022||DeepMind: Chinchilla announced.|
|28/Apr/2022||DeepMind: Flamingo announced.|
|12/May/2022||Google: LaMDA 2 released.|
|12/May/2022||DeepMind: Gato announced.|
|26/Dec/2022||Google DeepMind: MedPaLM 1 announced.|
|10/Apr/2023||Google partners with DeepMind.|
|10/May/2023||Google: PaLM 2 released.|
|1/Jun/2023||Google DeepMind DIDACT code model trained on 37T tokens (Alan estimate).|
|20/Jun/2023||Google DeepMind: RoboCat announced.|
|22/Jun/2023||Google: AudioPaLM announced (PaLM 2 + AudioLM).|
|28/Jun/2023||Google DeepMind: RT-2.|
|2/Aug/2023||Google DeepMind: Soft MoE.|
|8/Sep/2023||My independent pre-release report.|
|18/Sep/2023||Google DeepMind Gemini in early testing.|
|6/Dec/2023||Google DeepMind Gemini released.|
Dataset for Gemini via Google’s Piper monorepo (estimate)
The Gemini dataset could be made up of a large amount of code, to support reasoning (many papers, 1, 2) within the final trained model. Google’s internal monorepo Piper, is 86TB (2016 PDF). Using The Pile’s calculation (paper) of 0.4412 tokens per byte, this dataset would be around 37.9T tokens, or about twice the size of the next biggest dataset in GPT-4 (estimated).Summary of current models: View the full data (Google sheets)
Dataset for Gemini via MassiveText (estimate)
The Gemini dataset could potentially be made up of some of DeepMind’s MassiveText (multilingual) 5T-token dataset (see the Improving language models by retrieving from trillions of tokens paper and my What’s in my AI? paper).
Please note that the following table is ‘best guess’ by Alan (not confirmed by Google DeepMind), and is based on available information from the state-of-the-art DeepMind MassiveText (multilingual) + 1,000B tokens of discussion.
|Count||Dataset||Percentage tokens||Raw Size (GB)||Tokens (B)|
|2||Discussion (multilingual, via YouTube, estimate)*||x%||3,750GB||1,000B*|
|3||Web: C4 (multilingual)||19.45%||3,656GB||977B|
|Totals||23,949GB (23.9TB)||6,000B (6T)|
* Alan’s estimate only.
Table. MassiveText multilingual dataset estimates. Rounded. Disclosed in bold (from DeepMind’s MassiveText multilingual dataset). Determined in italics. For similar models, see my What’s in my AI paper.
Dataset for Gemini via YouTube (estimate)
- Total videos: 800 million.
- Average length: 11.7 minutes.
- Total time: 9.36 billion minutes.
- Rounding to keep up with 30,000 hours uploaded per hour: 10B minutes.
YouTube 2023 text stats:
- Human speaking rate: 150 words per minute (wpm).
- 150wpm x 10B minutes = 1.5 trillion words total.
- Assume: (1) speaking only occurs in a subset of videos, (2) quality classifier retains videos with a score in the top 80%, then let’s keep 80% of this.
- 1.5T words x 0.8 = 1.2T words.
- 1.2T words x 1.3 = 1.56T text tokens.
1.5T text tokens is not enough to make a big dent in the requirements of models the scale of Gemini or GPT-5:
- 1T parameters (20T text tokens).
- 2T parameters (40T text tokens).
- 5T parameters (100T text tokens).
Given the focus on multimodality in 2023-2024 large language models, it can be assumed that visual content (not just text) is being used to train these models.
While it’s expected that Google will need to compromise with DeepMind about the architecture for Gemini, my 2022 paper details the approach by Google to large language models and training.
Alan D. Thompson
24 pages incl title page, references, appendix.
Read more about Alan’s conservative countdown to AGI…
Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 4.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. Technical highlights.
This page last updated: 7/Dec/2023. https://lifearchitect.ai/gemini/↑