πŸ”

LLM-as-judge: calibration and bias

Your LLM judge consistently scores your model's outputs 15% higher than it scores competitor outputs β€” for the same quality level. You ship based on those comparisons and wonder why user preference tests show no improvement. The judge was biased. Your optimization target was wrong.

1 / 10