🔍

Evaluating multi-agent systems

A research pipeline: Researcher → Writer → Editor. The final output is wrong. You test the Researcher alone: pass. Writer alone: pass. Editor alone: pass. The failure emerges from how they interact. Single-agent evals miss orchestration failures entirely.

1 / 10