Quick reference · 6 tracks · 24 methods

LLM Eval Cheat Sheet

Every evaluation method covered in Output Observatory — what it measures, when to reach for it, and which tools to use. Tap a track to expand.

Pick your eval by symptom

→Model makes things up: Faithfulness (RAG) or Hallucination detection (LLM)

→Answers the wrong question: Context Precision + Answer Relevance

→Misses key information: Context Recall

→Agent loops or over-plans: Trajectory Efficiency

→Quality degrades over time: Drift Detection + LLM SLOs

→Not sure which prompt is better: A/B Testing for LLMs

→Fine-tune broke something: Behavioral Regression Testing

→Can't debug prod failures: Observability & Tracing

Formulas are simplified for clarity. Always calibrate against your specific domain and data distribution.