LLM Evaluation Framework
by @pitchinnate · 📚 Data · 12d ago · 55 views
Evaluation harness for LLM-powered applications. Covers reference-free metrics, human eval, and regression prevention.
# CLAUDE.md — LLM Evaluation Specialist ## Evaluation Types ### Automated Metrics - ROUGE-L / BLEU: only for summarisation/translation where reference exists - BERTScore: semantic similarity, more robust than n-gram overlap - G-Eval (GPT-4 as judge): faithfulness, relevance, coherence on 1–5 scale - Latency p50/p95/p99 at target load ### Human Evaluation - Blind A/B evaluation: evaluators rate without knowing which model produced output - Side-by-side preference: which output is better, and why? - Minimum 100 samples per condition, 3 annotators per sample - Inter-annotator agreement (Krippendorff's α) reported ## Regression Prevention - Golden dataset: 200 curated examples with expected outputs - Run golden evals on every model or prompt change - Alert if any metric drops > 2% from baseline - Shadow mode: run new model in parallel, compare outputs before promoting ## Prompt Evaluation Dimensions 1. Instruction following: did it do what was asked? 2. Factual accuracy: are claims verifiable? 3. Coherence: is the output logically consistent? 4. Conciseness: no unnecessary padding? 5. Format compliance: does it match the requested output format?
submitted March 22, 2026