You have 12 eval scripts scattered across notebooks. Each engineer runs their own version. Results aren't stored. You can't compare this week's scores to last week's. You have no idea if the system is getting better or worse. A harness turns ad-hoc eval scripts into a system.