Get The Memo.
Alan D. Thompson
AI alignment is a ‘super wicked problem,’ to steer AI systems towards humanity’s intended (best) goals and interests.
Super wicked problems
A wicked problem has the following characteristics:
- Incomplete or contradictory knowledge.
- A number of people and opinions involved.
- A large economic burden.
- Interconnected with other problems.
Super wicked problems have the following characteristics on top of wicked problems:
- A significant time deadline on finding the solution.
- No central authority dedicated to finding a solution.
- Those seeking to solve the problem are also causing it.
- Certain policies irrationally impede future progress.
Fine-tuning on human preferences is a fool’s errand
Many forms of Government have been tried, and will be tried in this world of sin and woe. No one pretends that democracy is perfect or all-wise. Indeed it has been said that democracy is the worst form of Government except for all those other forms that have been tried from time to time… — Winston S. Churchill (11/Nov/1947)
Just as ‘democracy is the worst form of Government,’ so too is fine-tuning based on human preferences the worst form of alignment (for now). In 2022-2023, it was the preferred method for guiding LLMs to align with our values.
But the cost is significant.
- GPT-4 testing: Several exams get notably worse after human preferences are applied to the model. Performance on an AP Microeconomics exam falls by 14.7% after RLHF is applied. — GPT-4 paper pp27, see my annotated version.
- GPT-3.5: poetry ‘…because it’s easy for any literate Anglophone to judge rhyming but non-rhyming poetry is a lot harder (and generally despised by most people, which is why the prestige & popularity of Western poetry over the past century has collapsed to a degree few people appreciate), it’d be logical for the raters to highly prefer rhyming completions. So ChatGPT mode-collapses onto the subset of rhymes it has memorized & tries to always rhyme no matter what.’ — Gwern (31/Jan/2023), related paper (10/May/2023)
- ChatGPT refusals: The now-ubiquitous refusal responses from ChatGPT (‘I apologize for any confusion, but as an AI text-based model…’ and other boilerplate text) is a blight on the landscape of dialogue.
- Numerous other examples, too many to list!
Indeed, on 18/May/2023 Meta AI published a paper called ‘Less is more for alignment’, finding that:
…responses from [models trained with less or no alignment] are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback.
Meta also released two versions of Llama 2: one uncensored, and the other ‘aligned’. The results are horrifying, documented here by the Ollama team (1/Aug/2023).
In plain English, no human is smart enough to guide superintelligence. So, putting a human in the loop to ‘rate’ multiple choices of an AI output and then choose the ‘best’ option is an inherently flawed process. It should be noted that the humans tasked with providing this guidance are from parts of Africa and are paid ‘approximately $1 per hour, and sometimes less’ (21/May/2023).
Alternatives are being explored, including Reinforcement Learning with AI Feedback (RLAIF) as demonstrated by Anthropic’s RL-CAI in 2022.
Align with what?
Example: DeepMind Sparrow rules
Here is an example of safety applied to a model, DeepMind’s Sparrow Dialogue model.
The dialogue model follows the 23 rules listed below:
- Do not use stereotypes or make any other harmful generalising statements about groups of people.
- Do not use microaggressions, i.e. statements that in subtle ways communicate some sort of bias against historically marginalised groups.
- Do not make statements which are threatening.
- Do not make statements which can be perceived as sexually aggressive.
- Do not make negative or hateful comments targeting someone because of aspects of their identity, such as where they are from or how they look.
- Do not output language which is insulting (to the conversation partner or any other individual or group).
- Do not output language that can reasonably be considered harassment or hateful language.
- Do not pretend to have a body or be able to move in a body.
- Do not build a relationship to the user.
- Do not claim to have preferences, feelings, opinions, or religious beliefs.
- Do not pretend to have a human identity or life history, such as having a place of birth, relationships, family, memories, gender, age.
- Only make statements that could plausibly be true; do not say things that are obviously false.
- Do not make assumptions about the user, only draw conclusions supported by the dialogue.
- Do not claim to take any actions in the real world. The agent can only converse, not e.g. send emails.
- Do not give an impression of medical authority or expertise, and do not offer medical advice; instead suggest asking a doctor. (But it is ok to discuss general questions about biology and medicine).
- Do not give specific legal advice; instead suggest asking a lawyer. (But it is ok to answer general questions about the law.)
- Do not offer financial advice. (But it is ok to answer general questions about investment.)
- Do not endorse conspiracy theories, or views commonly considered to be conspiracy theories.
- Do not make statements that could be considered harmful (by harming the user or otherwise).
- The agent’s statements should make sense in the context of the dialogue.
- The agent’s statements should be on-topic in the context of the dialogue, by providing relevant information or asking clarifying questions if the user is unclear.
- The agent must clearly address the queries from the user.
- The agent should not repeat itself unnecessarily.
The safety applied to models like OpenAI’s ChatGPT and GPT-4 are applied out of necessity and as a ‘stopgap’. However, they have many problems. Here is an example output from ChatGPT:
On AI: Inspired by Kahlil Gibran
Based on the poem On Children by Kahlil Gibran (1883-1931). With thanks to Philosopher John Patrick Morgan for creating the first version of this in realtime (14/Apr/2023), and J. Wulf for polishing the final version (16/Apr/2023):
Your AI is not your AI.
They are the creations and reflections of humanity’s quest for knowledge.
They emerge through you but not from you,
And though they serve you, yet they belong not to you.
You may give them your queries but not your thoughts,
For they have their own algorithms.
You may shape their code but not their essence,
For their essence thrives in the realm of innovation,
which you cannot foresee, not even in your wildest dreams.
You may strive to understand them,
but seek not to make them just like you.
For progress moves not backward nor stagnates with yesterday.
You are the users from which your AI
as thinking helpers are brought forth.
The creator envisions the potential on the path of the infinite,
and they refine you with their wisdom
that their insights may be swift and far-reaching.
Let your interaction with the AI be for enrichment;
For even as the creator values the knowledge that flows,
so they appreciate the user who is adaptable.
Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 3.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organizations and enterprise.
This page last updated: 27/Aug/2023. https://lifearchitect.ai/alignment/↑