A data essay
A wiki at archive scale, no embeddings.
Falsafa scales Andrej Karpathy's LLM-as-Wiki gist past three million words of curated literary corpus, and its ambition is to push past a hundred million with the Perseus archive. No vector index. No chunking. No embedding model. The host LLM navigates the markdown directly through eight librarian tools, and every cited paragraph resolves to a deep link in the reading site.
For a curated archive of natural-language texts, an LLM navigating markdown through tools beats embedding-based retrieval. The corpus is small enough to index with traditional IR. The structure is too valuable to flatten into similarity. The citation has to resolve to a paragraph, not a chunk.
Scaling Karpathy's LLM-Wiki gist
Editorial chart, log-flat. Word counts across the English layer.
Hatched bar = proposal, contingent on Perseus ingest.
Wiki layer vs baseline (A/B ablation)
Same model (Grok 4.1 Fast), same 1,120 questions, two arms.
Baseline gets 8 MCP tools; wiki adds read_wiki
and read_wiki_full for a deterministic
rule-based summary layer.
Baseline (no wiki) Wiki layer
A/B numbers render here once apps/site/public/eval-index.json
contains both __baseline and __wiki arms.
Retrieve-and-forget vs accumulate-and-compound.
A vector index is stateless: every query embeds, retrieves the top-k chunks, and forgets. The corpus has no memory of what the librarian learned last time. Falsafa's manifest is a directory. Each work, chapter, paragraph is a stable identifier; metadata is a one-line filter; cross-links are TF-IDF baked at build time. The librarian compounds knowledge against a curated artifact rather than re-grinding it on every turn.
Paragraph-stable IDs preserve traceability.
Vector RAG cites chunks. A 512-token chunk has no readable address —
a reader cannot follow it back. Falsafa cites paragraphs by stable
six-character FNV-1a hashes (p-868413). The reading
site at /works/...#p-xxxxxx resolves the
hash to a highlighted line. read_chapter annotates
every paragraph inline. Of — citations
across the post-patch eval, 0 were the pre-fix
failure mode (verse markers like Mn_1.52); the rest
resolve.
Build-time-mostly architecture means $0 runtime.
Falsafa's MCP server runs no LLM. It returns text and structure. The host model — whichever the user installed in Claude Desktop, Cursor, or Codex — does the reasoning. Falsafa pays for nothing but a static-host bill. The user's preferred frontier model is the librarian's brain; any improvement there (Sonnet 4.6 → 4.7, GPT-5 → 6) lifts every Falsafa session for free, and Falsafa never has to retest a synthesizer prompt.
Try it yourself.
Bring your own key. The /try page runs the full eight-tool surface against the corpus in your browser, streams the trace, and deep-links every cited paragraph. The audit trail behind the headline numbers above is at /eval.
Further reading · the long form
1. The default RAG pipeline
The reflex when an AI builder hears "build a chatbot over this archive" is well-rehearsed: split documents into chunks of roughly 512 tokens, embed each with an off-the-shelf model, write the vectors into a vector database, retrieve the top-k chunks per query by cosine similarity, paste them into the LLM's context window with a "given the following excerpts" preamble, and call it RAG.
That pipeline assumes four things at once. The corpus is too large to fit in any context window. The corpus has no usable structure (or its structure is hidden inside PDFs and HTML scrapes). The user query is the only navigation signal that matters. Cosine similarity on a contrastive general-purpose embedding model approximates relevance well enough.
For a public dump of unstructured documents, those assumptions are roughly correct and the pipeline is roughly the right answer. For a curated archive of literary texts, every one of them is wrong.
2. Why that's the wrong question for this archive
Corpus size
Falsafa's corpus is 37 works. /numbers
gives the exact figures: 836 logical chapters, 2,089 variant
entries, 76,303 paragraphs, roughly 3.1M words
summed across every variant. The Pagefind index Falsafa ships
for full-text site search is around 1MB compressed. The whole
metadata layer (per-work index.md, per-chapter
meta.json, per-paragraph
*.paragraphs.json) is a few megabytes. None of
this needs an approximate-nearest-neighbour index. It fits in
a directory.
Vector retrieval was invented to solve a real problem: when the corpus is a hundred million documents and the question is "find me a few that are semantically near this one," you cannot scan. You quantize, you index, you take the cosine hit. For a corpus that fits in a build artifact, that machinery solves a problem you don't have, and pays the well-known costs (chunk-boundary noise, lossy embedding, index drift on re-ingest) anyway.
Curation flattens into similarity
Each work in Falsafa carries an attested author, an era, a source language and script, a translator, a published year, a layout (prose or verse), and an ordered chapter sequence. Each chapter carries its variants (Old English original, Latin-script transliteration, English translation), each with its own paragraph sequence. None of that is in the body text. All of it is in the metadata. An embedding index sees the body. It cannot answer "list the Sanskrit smṛti texts" without a scan that the metadata answers in one filter.
Citation has to resolve to a paragraph
Vector RAG cites chunks. A chunk is a 512-token window
identified by a chunk index, a vector ID, or a score. A
reader cannot follow that citation. Falsafa cites paragraphs
by stable hash. Every paragraph in every variant carries a
six-character FNV-1a hash of its content, prefixed
p-. The reading site resolves the hash to a
highlighted line. The MCP's read_chapter
annotates its output with [p-xxxxxx] markers so
the host LLM can cite directly.
See annotateBodyWithParagraphIds in
apps/mcp/src/tools.ts.
It's twenty lines.
3. The eight-tool surface
The MCP server exposes eight tools. Each returns text plus structure; none calls an LLM internally.
| Tool | What it returns |
|---|---|
| list_works | The catalogue, optionally filtered by era, author, language, genre, difficulty. |
| list_chapters | Ordered chapter sequence for a work, with available variants. |
| get_metadata | Full provenance for a work: author bio, era, layouts, variant types. |
| read_chapter | The chapter body as markdown, annotated with stable [p-xxxxxx] identifiers. |
| get_passage | Specific paragraphs by ID or range. The citation primitive. |
| search_corpus | Pagefind-backed full-text search with an IDF-ranked auto-fallback for long queries. |
| find_related | TF-IDF cosine-ranked cross-links built at ingest, with a structural fallback. |
| compare_works | Topic-restricted chapter pointers from each of two works. The host LLM does the comparing. |
Vector RAG
- User query embedded
- Top-k chunks retrieved by cosine
- Chunks pasted into context
- LLM synthesizes from chunks
- Citation: chunk index or vector ID
Falsafa MCP
- LLM reads the question
- Calls
list_worksorsearch_corpus - Reads candidate chapters with
read_chapter - Pulls exact paragraphs with
get_passage - Citation:
work_slug+p-xxxxxx
4. What we are not claiming
A small-N run is a proof of protocol. It is not a confidence-bounded
estimate of corpus coverage, and it does not yet support a head-to-head
claim against vector RAG. The intellectually honest comparison is a
hybrid baseline. We are building one at apps/baseline/:
the same 1,000-question pool, the same blind-sub-agent dispatch, but
the agent is given a vector-retrieval tool over the same corpus
instead of the Falsafa eight. When that lands, the eval explorer
at /eval shows both rows side by side, per
case, and any reader can verify the comparison.
5. How a case passes the eval (deterministic, no LLM judge)
The eval at /eval reports pass rates per tier and per arm. This section is what those numbers actually measure today, and what the current rework will change.
Today: diacritic-folded substring match
Every question in the pool ships with an expected_works
list — slugs like unknown-manusmrti-347b76 identifying
the work(s) the auditor expects the model to land on. A case passes
when every expected slug (or its longest meaningful token) appears
somewhere in the model's answer text. NFKD normalization strips
combining marks, so “Manusmṛti” matches
“manusmrti” cleanly.
No LLM judges anything. A previous iteration scaffolded a
Sonnet-as-judge layer; we retired it once the citation contract
plus the deterministic substring check carried the weight. The
historical judge artifacts under eval/judge/ are
archived for reference; they do not feed today's scores.
Why this is honest, and where it falls short
A model can name “Manusmṛti” in prose without
ever opening it through the MCP. The substring check passes that
case, but the model didn't actually verify anything. We
measured the gap: switching the same scoring to a strict
citation-array check (every expected work must appear as
work_slug in the structured citations the model
emitted) drops the headline from 84.6% to 50.6% on the same data.
That 34-point gap is the citation-discipline gap. The substring
number understates rigor; the citation-strict number is too
binary to do the comparison story justice.
Next: graded 3-state score
The rework moves to a 0 / 0.5 / 1 scale: pass when every
expected work has a structured citation, mixed when
some expected works are cited or all are named in prose without
formal citation, fail when none. Plus a
one-time audit pass to expand expected_works for
valid alternative sources the model surfaces. The current
published numbers are the loose end; the graded numbers will be
the honest end. The paper waits for the graded numbers.
The full plan is at TODOS.md — search for
“EVAL SCORING REWORK”.
6. The Karpathy nod
The framing of "tools return text and structure, the LLM does the reasoning" is not original to us. Andrej Karpathy proposed something close in his gist on LLM-maintained markdown wikis: humans curate sources, the LLM does the bookkeeping that makes a knowledge base actually usable, and the wiki is a persistent compounding artifact rather than a chunk index rebuilt on every query.
For a curated archive, the wiki is not a place you put knowledge for the LLM to retrieve. It is the place where the LLM does its work.