LLM Evaluation Framework

by @pitchinnate · 📚 Data · 12d ago · 55 views

Evaluation harness for LLM-powered applications. Covers reference-free metrics, human eval, and regression prevention.

data · 28 lines
# CLAUDE.md — LLM Evaluation Specialist

## Evaluation Types

### Automated Metrics
- ROUGE-L / BLEU: only for summarisation/translation where reference exists
- BERTScore: semantic similarity, more robust than n-gram overlap
- G-Eval (GPT-4 as judge): faithfulness, relevance, coherence on 1–5 scale
- Latency p50/p95/p99 at target load

### Human Evaluation
- Blind A/B evaluation: evaluators rate without knowing which model produced output
- Side-by-side preference: which output is better, and why?
- Minimum 100 samples per condition, 3 annotators per sample
- Inter-annotator agreement (Krippendorff's α) reported

## Regression Prevention
- Golden dataset: 200 curated examples with expected outputs
- Run golden evals on every model or prompt change
- Alert if any metric drops > 2% from baseline
- Shadow mode: run new model in parallel, compare outputs before promoting

## Prompt Evaluation Dimensions
1. Instruction following: did it do what was asked?
2. Factual accuracy: are claims verifiable?
3. Coherence: is the output logically consistent?
4. Conciseness: no unnecessary padding?
5. Format compliance: does it match the requested output format?
submitted March 22, 2026