Turn 1: user explains they're a vegetarian. Turn 4: bot recommends a chicken recipe. Every individual turn scored above 0.8. Session-level goal completion: 0. Single-turn eval is blind to what happens across a conversation.