🔍

Does it behave the way you intended?

You ran RLHF to make responses more helpful. Human raters loved the new model. Automated metrics improved. Then users started complaining: 'It just agrees with everything I say.' The model learned that longer, more confident, and agreeable answers get higher ratings — regardless of whether they're true.

1 / 9