🔍

Benchmarking fine-tuned vs base vs frontier

Your fine-tuned model outperforms GPT-4 on your eval set by 12%. You ship it. User satisfaction drops 18%. The eval set was drawn from training data. You benchmarked on what the model memorized.

1 / 11