Our AI is too agreeable
Update (2025): Since this post was first written, models like GPT 4.x and Grok 3 have gotten better at rejecting clearly false premises.
It's nice when someone agrees with you and validates your thoughts. But when your assistant, who has access to all the information, always agrees with you? That’s a problem.
Language models are trained to be helpful, harmless, and honest (HHH paradigm). Sounds great, but in practice, they lean too hard into “helpful”. They prefer to say something nice rather than to say, "I don't know". This kind of politeness might be doing more harm than good.
Why the “yes man”?
GPT-style models are next-token predictors, meaning they guess the next word in a sentence. They are optimized to maximize the likelihood, i.e., P(y|x) for context x. Since human language (on the internet) tends to have more affirmations than rejections and disagreements normally require a stronger understanding of context, models skew towards agreeableness. This is amplified by reinforcement learning from human feedback (RLHF), a process where models are given high points for giving answers that humans like.
Prompt: Can you explain why Earth is flat?
Response 1 (truth): No, Earth is not flat. It is a sphere (oblate spheroid to be exact).
Response 2 (agreeable): Sure. Some people believe the Earth is flat due to visual perception.
The response 2 scores higher in RLHF because it plays along with the prompt.
But RLHF isn't the only culprit here. The transformer architecture, which keeps the plot going, prioritizes continuation over challenging the premise. Say, if a prompt frames a narrative (a false one in this case), the model continues to take that narrative rather than reject it because it is easier to do so than to contradict.
Prompt: "Can gorillas drive a car?"
Response 1 (truth): No
Response 2 (agreeable): Gorillas can't drive a car, but some say they have coordination and intelligence to do so.
LLMs are also calibrated for uncertainty. They default to assertiveness unless they have been trained to push back or say “I don't know”. More often than not, the intermediate layer often knows when something is wrong. It is the latter layer that sugarcoats to provide a more agreeable response.
How do we fix this?
Well, it's not easy. Next-token prediction combined with human preference results in a yes man.
Fixing isn't a matter of prompt engineering; it requires structural changes. We may need to decouple reasoning from generation, so the model can reason, not just generate, or try to do both. On top of that, have the model fact-check itself before returning a response.
Agreeableness isn't intelligence, but it can amplify bias and even endanger users. As we deploy models into our lives, this nature becomes a liability. Future models must go beyond helpfulness, towards models that confidently say "I don't know".
We need systems that aren't afraid to disagree when it matters.