A research pipeline: Researcher β Writer β Editor. The final output is wrong. You test the Researcher alone: pass. Writer alone: pass. Editor alone: pass. The failure emerges from how they interact. Single-agent evals miss orchestration failures entirely.