Quick reference · 6 tracks · 24 methods

LLM Eval Cheat Sheet

Every evaluation method covered in Output Observatory — what it measures, when to reach for it, and which tools to use. Tap a track to expand.

Pick your eval by symptom

Model makes things up: Faithfulness (RAG) or Hallucination detection (LLM)
Answers the wrong question: Context Precision + Answer Relevance
Misses key information: Context Recall
Agent loops or over-plans: Trajectory Efficiency
Quality degrades over time: Drift Detection + LLM SLOs
Not sure which prompt is better: A/B Testing for LLMs
Fine-tune broke something: Behavioral Regression Testing
Can't debug prod failures: Observability & Tracing

Formulas are simplified for clarity. Always calibrate against your specific domain and data distribution.