02
Benchmarking fine-tuned vs base vs frontier
Learn to build a representative benchmark, compare fine-tuned models against their base models and frontier alternatives, and avoid the eval-set overfitting trap that makes numbers look better than they are.
Braintrustlm-evaluation-harness