💬

LLM & Chatbot Eval

Hallucination at scale, response quality, safety & toxicity, multi-turn eval

DeepEvalGuardrails AIPromptfoo

4 sections

Catching hallucinations at scale

Move beyond sample-based checking to production-scale hallucination detection. Learn detection strategies, statistical sampling, sentinel queries, and how to catch model drift before users do.

9 min

DeepEvalLangSmith

Measuring response quality

Learn to evaluate coherence, completeness, tone, and format adherence — the dimensions that determine whether an LLM response is actually usable, not just technically correct.

8 min

DeepEvalPromptfoo

Safety and toxicity evaluation

Build a systematic safety eval pipeline — red-teaming, toxicity detection, jailbreak testing, prompt injection defense, and production content monitoring.

9 min

PromptfooGuardrails AI

Evaluating multi-turn conversations

Learn to evaluate chatbot quality across an entire conversation — context tracking, consistency, session-level goal completion, and how to detect the 'memory cliff' where bots forget what users told them.

8 min

DeepEvalLangSmith

← All tracks