⚙️

Eval Architecture

Eval harness design, CI/CD gates, LLM-as-judge calibration

GitHub ActionsPromptfooWeave

3 sections

Building your eval harness

Learn to design and build a production eval harness from scratch — dataset management, metric registry, result storage, reporting, and the architecture decisions that make it maintainable at scale.

10 min

Braintrustcustom

CI/CD gates for LLM systems

Wire LLM evaluation into your deployment pipeline — automated gates that block regressions, change-triggered eval runs, and the rollback procedures that keep production safe.

9 min

GitHub ActionsPromptfoo

LLM-as-judge: calibration and bias

Learn to use LLMs as evaluators reliably — detecting and mitigating position bias, verbosity bias, and self-preference bias, and calibrating your judge to match human ground truth.

10 min

G-EvalPrometheus

← All tracks