πŸ”

Evaluating multi-turn conversations

Turn 1: user explains they're a vegetarian. Turn 4: bot recommends a chicken recipe. Every individual turn scored above 0.8. Session-level goal completion: 0. Single-turn eval is blind to what happens across a conversation.

1 / 10