Eval

Every claim, browseable.

The 1,120-case adversarial eval suite, side by side with each model's per-case answer and citations. Filter by category, difficulty, or comparison verdict. Click a row to see the answer, the tool-call trace, and the cited paragraphs — every paragraph_id deep-links to the source in the reading site.

Scoring is deterministic. A case passes when every expected_works entry appears in the model's answer, with NFKD diacritic folding so “Manusmṛti” matches “manusmrti.” No LLM judge runs at score time — the verdict is a substring check on prose, not a graded judgment. A graded 3-state score (pass / mixed / fail) tied to citation discipline is queued; see methodology for the trade-off and what the current numbers mean.

For the full thesis, see /thesis. For a live run with your own key, /try. Source for the eval pipeline lives at apps/mcp/eval/.