Architecture
How a URL becomes a manifest, and how a manifest becomes five MCP tools.
Pinax is a single Go binary structured as a handful of focused packages.
Below is the pipeline a URL takes from pinax add to pinax serve,
followed by the on-disk layout.
The pipeline
A URL becomes a manifest, and a manifest becomes a set of tools. Pinax keeps the two halves loosely coupled - the on-disk manifest is the only contract between them - so each can change independently.
Add: URL → manifest
┌──────────────┐ ┌──────────────┐
URL ──────▶ │ crawler │──▶ │ preflight │──▶ manifest.json
└──────────────┘ └──────────────┘
llms.txt → density gate
sitemap → (samples, extracts,
bounded BFS refuses if thin)
pinax add runs discovery then a gate. The crawler tries llms.txt,
then sitemap.xml, then bounded BFS, and stops at the first strategy
that yields pages. The preflight gate then samples that page set,
extracts each sample to measure prose density, and refuses to write the
manifest when the site is mostly chrome.
Serve: manifest → MCP tools
manifest.json ──▶ BM25 index ──▶ mcp server ──▶ stdio / HTTP transport
list_docs
list_sections
search_pages
get_section_pages
get_page
pinax serve mounts every manifest under ~/.pinax/servers/ into a
single MCP server. Four of the five tools serve from in-memory state;
only get_page reaches the network.
Tool call: get_page lifecycle
agent
│
▼
get_page(url)
│
▼
┌─────────┐ hit
│ cache │ ─────────────▶ clean Markdown
└─────────┘
│ miss
▼
HTTP fetch
│
▼
extractor
(HTML/MD →
clean Markdown)
│
▼
cache.Set
│
▼
clean Markdown
Cached pages return immediately. On a miss, Pinax fetches the URL with
the configured User-Agent, runs the extractor to strip nav/footer
chrome and normalise HTML to Markdown, and stores the result in the
SQLite cache (WAL mode, TTL applied at read time) before returning it.
Discovery, in order
The crawler tries three strategies and stops at the first that yields a non-empty page list:
llms.txtprobe. Fetches/llms.txtat the host root. If present, every URL it lists becomes a manifest entry - no BFS at all.- Sitemap parse. Walks every URL declared in
/sitemap.xml(and any referenced sitemap indexes), filtered by the host. - Bounded BFS. Starts at the homepage, follows in-host links breadth-
first up to
--max-pages(default 1500), respects--excludefilters, bails on JS-heavy pages.
All three strategies feed the same downstream pipeline. The crawler
records which strategy fired in the manifest so pinax doctor can warn
when a BFS-derived manifest has thin pages.
Extraction
The extractor turns whatever Pinax fetched into clean Markdown. It runs
in two places: inside preflight during sampling, and inside get_page
on every live fetch.
text/markdownresponses pass through with a light cleanup.- HTML is parsed, navigation/footer/cookie chrome is stripped, then converted to Markdown.
Page titles are not produced by the extractor - they come from the
crawler, which reads the HTML <title> element and falls back to a
URL-slug title when none is present (e.g. sitemap-only sites). Sections
are derived from URL path segments, not headings.
Manifest
~/.pinax/servers/<name>/manifest.json is the source of truth. It
contains:
- Base URL, display name, discovery strategy used, last-refresh timestamp.
- One entry per page: URL, title, and section path.
- The tag set (when added through the catalog).
It’s written atomically - Pinax writes to a temp file in the same
directory then renames into place, so a half-completed pinax add never
corrupts the on-disk state.
Search
A small BM25 index over URL paths, titles, and section names is built
alongside each manifest. The actual page bodies are not indexed -
search hits return URLs which get_page fetches live. This is by design:
it keeps the on-disk footprint tiny and the live-fetch policy honest.
Serve
The unified server lets one MCP configuration cover every docs site
you’ve added; the optional docs argument scopes calls when an agent
wants to narrow. Transport is mark3labs/mcp-go: stdio is the default,
Streamable HTTP + SSE are available with --http.
Package map
cmd/pinax/ CLI entry point and argument parser
internal/buildinfo/ Version + User-Agent shared across CLI and server
internal/crawler/ llms.txt probe, sitemap parser, BFS, platform detection
internal/extractor/ HTML/Markdown → clean Markdown
internal/manifest/ Atomic JSON manifests + BM25 indexes
internal/cache/ SQLite page cache (WAL, TTL applied at read time)
internal/logger/ SQLite tool-call log + HTML log viewer
internal/mcp/ MCP server, transport, tools, middleware
internal/preflight/ Content-density check that gates `pinax add`
internal/doctor/ Health diagnosis used by `pinax doctor`
internal/catalog/ Built-in catalog + cache loader
Each package has its own README in the source tree.