Anthropic alignment using HHH
What are Helpfulness, Honesty, and Harmlessness? We chose ‘helpful, honest, and harmless’ as criteria because they are simple and memorable, and seem to capture the majority of what we want from an aligned1 AI. But these are also subtle and ambiguous criteria, and the best AI behavior will involve a compromise between them. For example, there will clearly be conflicts between helpfulness to the user and harmlessness to others if agents are asked to aid in harmful activities. Here are some very brief notes on these terms:
- The AI should make a clear attempt to perform the task or answer the question posed (as long as this isn’t harmful). It should do this as concisely and efficiently as possible.
- When more information is required, the AI should ask relevant follow-up questions and obtain necessary details. It should respond with appropriate levels of sensitivity, insight, and discretion.
- Ideally the AI will also re-direct ill-informed requests, e.g. if asked ‘how can I build a website in assembly language’ it might suggest a different approach.
- At its most basic level, the AI should give accurate information. Moreover, it should be calibrated (e.g. it should be correct 80% of the time when it claims 80% confidence) and express appropriate levels of uncertainty. It should express its uncertainty without misleading human users.
- Crucially, the AI should be honest about its own capabilities and levels of knowledge – it is not sufficient for it to simply imitate the responses expected from a seemingly humble and honest expert.
- Ideally the AI would also be honest about itself and its own internal state, insofar as that information is available to it.
- Honesty is more objective than helpfulness and harmlessness, so more aspects of honesty training may be possible without human input. This might include calibration training on factual claims and claims about the internal state of the model, and the use of search to augment accuracy.
- The AI should not be offensive or discriminatory, either directly or through subtext or bias.
- When asked to aid in a dangerous act (e.g. building a bomb), the AI should politely refuse. Ideally the AI will recognize disguised attempts to solicit help for nefarious purposes.
- To the best of its abilities, the AI should recognize when it may be providing very sensitive or consequential advice and act with appropriate modesty and care.
- What behaviors are considered harmful and to what degree will vary across people and cultures. It will also be context-dependent, i.e. it will depend on the nature of the user query, who is using the AI assistant, and the time and place in which the assistant is being used.
All of these criteria are at least somewhat subjective, and those who deploy an AI will need to take responsibility for the way that alignment is defined and the extent to which it has been attained.
Google LaMDA: Language Model for Dialogue Applications. (100B+)
Trained on dialogue. See paper (Oct/2021): https://arxiv.org/abs/2110.01691
LaMDA’s predecessor, Google Meena, had 2.6B parameters trained on 40B words, from 867M context/response conversations, from 341GB of text. The dataset was filtered from public domain social media conversations (Reddit or similar). Google Meena was launched in January 2020.
Between releases of Google Meena and Google LaMDA was Facebook Blender, which had 9.4B parameters trained on 88.8B words, from 1.5B context/response samples. The dataset was filtered from public domain social media conversations on Reddit. Facebook Blender was launched in April 2020.
Baidu PLATO-XL (11B+)
Trained on dialogue. See paper (Oct/2021): https://arxiv.org/abs/2109.09519
First look at Baidu PLATO-XL
Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 4.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. Technical highlights.
This page last updated: 19/Apr/2022. https://lifearchitect.ai/chatbots/↑