Watermarking text in LLM outputs

Image above generated by AI for this analysis (Imagen 3)¹

Advising organizations from Apple to the US Gov, & cited in the new G7 AI doc.^†
Get The Memo.

Alan D. Thompson
October 2024

Watermarked models may be less capable than the intended model… The problem is when such distortions are undisclosed: users assume that [chatting with a model] is exactly equivalent to working with the original model.
— Stanford, 26/Oct/2024

In plain English

When Google Gemini responds, it embeds a secret and invisible ‘watermark’ in its response by the words it chooses, their order, and the general syntax. This means every time you ask Gemini a question, the response is trackable. For example, if you use it to write a book or an essay, Google (and perhaps law enforcement) can check to prove that it was AI-generated. It is possible that other labs are doing this as well. There are concerns around this, including legal issues.

What is it?

Watermarking text technology establishes clear provenance of text by embedding persistent markers that trace the document’s origin. In a working paper from 2021, I called on the United Nations to address text watermarking across models as a matter of urgency.

Recommendation 3. Watermarking. Ensure that AI-generated content is clearly marked (or watermarked) in the early stages of AI development. This recommendation may become less relevant as AI is integrated with humanity more completely in the next few years.
— Alan D. Thompson (submitted to the UN Jan/2022, LifeArchitect.ai/UN)

Adding watermarks to text outputs from a large language model is a process involving the pseudorandom selection of tokens as they are output to the user:

…To watermark, instead of selecting the next token randomly, the idea will be to select it pseudorandomly, using a cryptographic pseudorandom function, whose key is known only to [the LLM stakeholder].
— Scott Aaronson, Nov/2022

Who’s doing it?

Watermarking of model outputs is now occurring for both frontier models and the 150,000 derivative large language models available on Hugging Face.

Lab	Watermarking available?	Watermarking implemented?	Notes
Anthropic	Probable, but no public info	Possible, but no public info
Google			‘SynthID-Text has been productionized in the user-facing Gemini and Gemini Advanced chatbots, which is, to our knowledge, the first deployment of a generative text watermark at scale, serving millions of users.’ (Oct/2024)
OpenAI		Possible, but no public info	‘[We] have a working prototype of the watermarking scheme, built by OpenAI engineer Hendrik Kirchner’ (Nov/2022)
Other models		Some models	‘[Google DeepMind SynthID] has been implemented as a generation utility that is compatible with any LLM without modification using the `model.generate()` API’ (Oct/2024)

How does it work?

Prof Scott Aaronson (Nov/2022):

For GPT, every input and output is a string of tokens, which could be words but also punctuation marks, parts of words, or more—there are about 100,000 tokens in total. At its core, GPT is constantly generating a probability distribution over the next token to generate, conditional on the string of previous tokens. After the neural net generates the distribution, the OpenAI server then actually samples a token according to that distribution—or some modified version of the distribution, depending on a parameter called “temperature.” As long as the temperature is nonzero, though, there will usually be some randomness in the choice of the next token: you could run over and over with the same prompt, and get a different completion (i.e., string of output tokens) each time.

So then to watermark, instead of selecting the next token randomly, the idea will be to select it pseudorandomly, using a cryptographic pseudorandom function, whose key is known only to OpenAI. That won’t make any detectable difference to the end user, assuming the end user can’t distinguish the pseudorandom numbers from truly random ones. But now you can choose a pseudorandom function that secretly biases a certain score—a sum over a certain function g evaluated at each n-gram (sequence of n consecutive tokens), for some small n—which score you can also compute if you know the key for this pseudorandom function…

Anyway, we actually have a working prototype of the watermarking scheme, built by OpenAI engineer Hendrik Kirchner. It seems to work pretty well—empirically, a few hundred tokens seem to be enough to get a reasonable signal that yes, this text came from GPT. In principle, you could even take a long text and isolate which parts probably came from GPT and which parts probably didn’t.

Now, this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPT’s output—well okay, we’re not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, it’s robust against those sorts of interventions.

The hope is that this can be rolled out with future GPT releases. We’d love to do something similar for DALL-E—that is, watermarking images, not at the pixel level (where it’s too easy to remove the watermark) but at the “conceptual” level, the level of the so-called CLIP representation that’s prior to the image. But we don’t know if that’s going to work yet.

Example

Source: Google Blog (Oct/2024)

Is it reliable enough to prove that a piece of text was generated by a particular model?

Short answer: No.

Longer answer: No, watermarked outputs can still easily be fed to another LLM (there are over 500 of them documented in my Models Table) to remove the watermark.

OpenAI said:

…this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPT’s output—well okay, we’re not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, it’s robust against those sorts of interventions.

Google said:

Another limitation of generative watermarks is their vulnerability to stealing, spoofing and scrubbing attacks, which is an area of ongoing research. In particular, generative watermarks are weakened by edits to the text, such as through LLM paraphrasing.

Transparency concerns

Watermarking allows various stakeholders to establish provenance of text outputs through embedded hidden signatures. Worryingly, just by running a check on any piece of LLM-generated text, AI labs could potentially be compelled to reveal user identities through legal requests from law enforcement agencies.

Many users may not fully understand that their AI-generated outputs are being tracked, which raises concerns about transparency, informed consent, and privacy violations. There’s also the potential for legal overreach, where expanding legal frameworks could mandate AI text scans in civil cases, surveillance, and other invasive contexts.

It’s also possible that companies and institutions will be provided with access to the watermarking keys to exert sweeping technological control through measures like:

Creation of searchable databases linking individuals to all AI-generated text they produce, enabling unprecedented surveillance
Employers using watermark detection to terminate employees for unauthorized AI use, even for basic tasks
Academic institutions retroactively revoking degrees upon discovering AI-assisted work
Insurance and other companies scanning customer communications to deny claims based on AI-detected language patterns
Government agencies scanning public discourse to identify and track AI-generated political speech

Google DM Paper Oct/2024, Demis Hassabis as co-author: https://www.nature.com/articles/s41586-024-08025-4

Google DM Blog post: https://deepmind.google/technologies/synthid/

HF implementation: https://huggingface.co/blog/synthid-text

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Informs research at Apple, Google, Microsoft · Bestseller in 147 countries.
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Alan D. Thompson is a world expert in artificial intelligence, advising everyone from Apple to the US Government on integrated AI. Throughout Mensa International’s history, both Isaac Asimov and Alan held leadership roles, each exploring the frontier between human and artificial minds. His landmark analysis of post-2020 AI—from his widely-cited Models Table to his regular intelligence briefing The Memo—has shaped how governments and Fortune 500s approach artificial intelligence. With popular tools like the Declaration on AI Consciousness, and the ASI checklist, Alan continues to illuminate humanity’s AI evolution. Technical highlights.

This page last updated: 2/Nov/2024. https://lifearchitect.ai/watermarking/↑

1
Image generated in a few seconds, on 26 October 2024, text prompt by Alan D. Thompson via Imagen 3: ‘watermarked characters’