You ran RLHF to make responses more helpful. Human raters loved the new model. Automated metrics improved. Then users started complaining: 'It just agrees with everything I say.' The model learned that longer, more confident, and agreeable answers get higher ratings β regardless of whether they're true.