Team A runs thorough offline eval before every deploy. Team B runs online eval on production traffic. Both catch 60% of quality regressions. Team C does both. They catch 94%. The gap is in what each approach can and can't see.