Falsafa · Thesis /numbers · /eval · /try

A data essay

A wiki at archive scale, no embeddings.

Falsafa scales Andrej Karpathy's LLM-as-Wiki gist past three million words of curated literary corpus, and its ambition is to push past a hundred million with the Perseus archive. No vector index. No chunking. No embedding model. The host LLM navigates the markdown directly through eight librarian tools, and every cited paragraph resolves to a deep link in the reading site.

For a curated archive of natural-language texts, an LLM navigating markdown through tools beats embedding-based retrieval. The corpus is small enough to index with traditional IR. The structure is too valuable to flatten into similarity. The citation has to resolve to a paragraph, not a chunk.

Scaling Karpathy's LLM-Wiki gist

Editorial chart, log-flat. Word counts across the English layer.

Karpathy gist

250,000 words

Falsafa today

3.09M words

Falsafa + Perseus

~100M words

Hatched bar = proposal, contingent on Perseus ingest.

Wiki layer vs baseline (A/B ablation)

Same model (Grok 4.1 Fast), same 1,120 questions, two arms. Baseline gets 8 MCP tools; wiki adds read_wiki and read_wiki_full for a deterministic rule-based summary layer.

Baseline (no wiki) Wiki layer

A/B numbers render here once apps/site/public/eval-index.json contains both __baseline and __wiki arms.

Retrieve-and-forget vs accumulate-and-compound.

A vector index is stateless: every query embeds, retrieves the top-k chunks, and forgets. The corpus has no memory of what the librarian learned last time. Falsafa's manifest is a directory. Each work, chapter, paragraph is a stable identifier; metadata is a one-line filter; cross-links are TF-IDF baked at build time. The librarian compounds knowledge against a curated artifact rather than re-grinding it on every turn.

Paragraph-stable IDs preserve traceability.

Vector RAG cites chunks. A 512-token chunk has no readable address — a reader cannot follow it back. Falsafa cites paragraphs by stable six-character FNV-1a hashes (p-868413). The reading site at /works/...#p-xxxxxx resolves the hash to a highlighted line. read_chapter annotates every paragraph inline. Of — citations across the post-patch eval, 0 were the pre-fix failure mode (verse markers like Mn_1.52); the rest resolve.

Build-time-mostly architecture means $0 runtime.

Falsafa's MCP server runs no LLM. It returns text and structure. The host model — whichever the user installed in Claude Desktop, Cursor, or Codex — does the reasoning. Falsafa pays for nothing but a static-host bill. The user's preferred frontier model is the librarian's brain; any improvement there (Sonnet 4.6 → 4.7, GPT-5 → 6) lifts every Falsafa session for free, and Falsafa never has to retest a synthesizer prompt.

Try it yourself.

Bring your own key. The /try page runs the full eight-tool surface against the corpus in your browser, streams the trace, and deep-links every cited paragraph. The audit trail behind the headline numbers above is at /eval.

/try /eval /numbers npm: @falsafa/mcp

1. The default RAG pipeline

The reflex when an AI builder hears "build a chatbot over this archive" is well-rehearsed: split documents into chunks of roughly 512 tokens, embed each with an off-the-shelf model, write the vectors into a vector database, retrieve the top-k chunks per query by cosine similarity, paste them into the LLM's context window with a "given the following excerpts" preamble, and call it RAG.

That pipeline assumes four things at once. The corpus is too large to fit in any context window. The corpus has no usable structure (or its structure is hidden inside PDFs and HTML scrapes). The user query is the only navigation signal that matters. Cosine similarity on a contrastive general-purpose embedding model approximates relevance well enough.

For a public dump of unstructured documents, those assumptions are roughly correct and the pipeline is roughly the right answer. For a curated archive of literary texts, every one of them is wrong.

2. Why that's the wrong question for this archive

Corpus size

Falsafa's corpus is 37 works. /numbers gives the exact figures: 836 logical chapters, 2,089 variant entries, 76,303 paragraphs, roughly 3.1M words summed across every variant. The Pagefind index Falsafa ships for full-text site search is around 1MB compressed. The whole metadata layer (per-work index.md, per-chapter meta.json, per-paragraph *.paragraphs.json) is a few megabytes. None of this needs an approximate-nearest-neighbour index. It fits in a directory.

Vector retrieval was invented to solve a real problem: when the corpus is a hundred million documents and the question is "find me a few that are semantically near this one," you cannot scan. You quantize, you index, you take the cosine hit. For a corpus that fits in a build artifact, that machinery solves a problem you don't have, and pays the well-known costs (chunk-boundary noise, lossy embedding, index drift on re-ingest) anyway.

Curation flattens into similarity

Each work in Falsafa carries an attested author, an era, a source language and script, a translator, a published year, a layout (prose or verse), and an ordered chapter sequence. Each chapter carries its variants (Old English original, Latin-script transliteration, English translation), each with its own paragraph sequence. None of that is in the body text. All of it is in the metadata. An embedding index sees the body. It cannot answer "list the Sanskrit smṛti texts" without a scan that the metadata answers in one filter.

Citation has to resolve to a paragraph

Vector RAG cites chunks. A chunk is a 512-token window identified by a chunk index, a vector ID, or a score. A reader cannot follow that citation. Falsafa cites paragraphs by stable hash. Every paragraph in every variant carries a six-character FNV-1a hash of its content, prefixed p-. The reading site resolves the hash to a highlighted line. The MCP's read_chapter annotates its output with [p-xxxxxx] markers so the host LLM can cite directly.

See annotateBodyWithParagraphIds in apps/mcp/src/tools.ts. It's twenty lines.

3. The eight-tool surface

The MCP server exposes eight tools. Each returns text plus structure; none calls an LLM internally.

Tool	What it returns
list_works	The catalogue, optionally filtered by era, author, language, genre, difficulty.
list_chapters	Ordered chapter sequence for a work, with available variants.
get_metadata	Full provenance for a work: author bio, era, layouts, variant types.
read_chapter	The chapter body as markdown, annotated with stable `[p-xxxxxx]` identifiers.
get_passage	Specific paragraphs by ID or range. The citation primitive.
search_corpus	Pagefind-backed full-text search with an IDF-ranked auto-fallback for long queries.
find_related	TF-IDF cosine-ranked cross-links built at ingest, with a structural fallback.
compare_works	Topic-restricted chapter pointers from each of two works. The host LLM does the comparing.

Vector RAG

User query embedded
Top-k chunks retrieved by cosine
Chunks pasted into context
LLM synthesizes from chunks
Citation: chunk index or vector ID

Falsafa MCP

LLM reads the question
Calls list_works or search_corpus
Reads candidate chapters with read_chapter
Pulls exact paragraphs with get_passage
Citation: work_slug + p-xxxxxx

4. What we are not claiming

A small-N run is a proof of protocol. It is not a confidence-bounded estimate of corpus coverage, and it does not yet support a head-to-head claim against vector RAG. The intellectually honest comparison is a hybrid baseline. We are building one at apps/baseline/: the same 1,000-question pool, the same blind-sub-agent dispatch, but the agent is given a vector-retrieval tool over the same corpus instead of the Falsafa eight. When that lands, the eval explorer at /eval shows both rows side by side, per case, and any reader can verify the comparison.

5. How a case passes the eval (deterministic, no LLM judge)

The eval at /eval reports pass rates per tier and per arm. This section is what those numbers actually measure today, and what the current rework will change.

Today: diacritic-folded substring match

Every question in the pool ships with an expected_works list — slugs like unknown-manusmrti-347b76 identifying the work(s) the auditor expects the model to land on. A case passes when every expected slug (or its longest meaningful token) appears somewhere in the model's answer text. NFKD normalization strips combining marks, so “Manusmṛti” matches “manusmrti” cleanly.

No LLM judges anything. A previous iteration scaffolded a Sonnet-as-judge layer; we retired it once the citation contract plus the deterministic substring check carried the weight. The historical judge artifacts under eval/judge/ are archived for reference; they do not feed today's scores.

Why this is honest, and where it falls short

A model can name “Manusmṛti” in prose without ever opening it through the MCP. The substring check passes that case, but the model didn't actually verify anything. We measured the gap: switching the same scoring to a strict citation-array check (every expected work must appear as work_slug in the structured citations the model emitted) drops the headline from 84.6% to 50.6% on the same data. That 34-point gap is the citation-discipline gap. The substring number understates rigor; the citation-strict number is too binary to do the comparison story justice.

Next: graded 3-state score

The rework moves to a 0 / 0.5 / 1 scale: pass when every expected work has a structured citation, mixed when some expected works are cited or all are named in prose without formal citation, fail when none. Plus a one-time audit pass to expand expected_works for valid alternative sources the model surfaces. The current published numbers are the loose end; the graded numbers will be the honest end. The paper waits for the graded numbers.

The full plan is at TODOS.md — search for “EVAL SCORING REWORK”.

6. The Karpathy nod

The framing of "tools return text and structure, the LLM does the reasoning" is not original to us. Andrej Karpathy proposed something close in his gist on LLM-maintained markdown wikis: humans curate sources, the LLM does the bookkeeping that makes a knowledge base actually usable, and the wiki is a persistent compounding artifact rather than a chunk index rebuilt on every query.

For a curated archive, the wiki is not a place you put knowledge for the LLM to retrieve. It is the place where the LLM does its work.

falsafa.app/thesis Tiny sans chrome · nav Inter