Eval
Every claim, browseable.
The 1,120-case adversarial eval suite, side by side with each model's per-case answer and citations. Filter by category, difficulty, or comparison verdict. Click a row to see the answer, the tool-call trace, and the cited paragraphs — every paragraph_id deep-links to the source in the reading site.
Scoring is deterministic. A case passes when
every expected_works entry appears in the model's
answer, with NFKD diacritic folding so “Manusmṛti” matches
“manusmrti.” No LLM judge runs at score time — the verdict
is a substring check on prose, not a graded judgment. A graded
3-state score (pass / mixed / fail) tied to citation discipline
is queued; see methodology for
the trade-off and what the current numbers mean.