Architecture

How a URL becomes a manifest, and how a manifest becomes five MCP tools.

Pinax is a single Go binary structured as a handful of focused packages. Below is the pipeline a URL takes from pinax add to pinax serve, followed by the on-disk layout.

The pipeline

A URL becomes a manifest, and a manifest becomes a set of tools. Pinax keeps the two halves loosely coupled - the on-disk manifest is the only contract between them - so each can change independently.

Add: URL → manifest

            ┌──────────────┐    ┌──────────────┐
URL ──────▶ │   crawler    │──▶ │  preflight   │──▶ manifest.json
            └──────────────┘    └──────────────┘
              llms.txt →          density gate
              sitemap →           (samples, extracts,
              bounded BFS          refuses if thin)

pinax add runs discovery then a gate. The crawler tries llms.txt, then sitemap.xml, then bounded BFS, and stops at the first strategy that yields pages. The preflight gate then samples that page set, extracts each sample to measure prose density, and refuses to write the manifest when the site is mostly chrome.

Serve: manifest → MCP tools

manifest.json ──▶ BM25 index ──▶ mcp server ──▶ stdio / HTTP transport
                                  list_docs
                                  list_sections
                                  search_pages
                                  get_section_pages
                                  get_page

pinax serve mounts every manifest under ~/.pinax/servers/ into a single MCP server. Four of the five tools serve from in-memory state; only get_page reaches the network.

Tool call: `get_page` lifecycle

       agent
         │
         ▼
     get_page(url)
         │
         ▼
    ┌─────────┐    hit
    │  cache  │ ─────────────▶ clean Markdown
    └─────────┘
         │ miss
         ▼
     HTTP fetch
         │
         ▼
      extractor
   (HTML/MD →
    clean Markdown)
         │
         ▼
      cache.Set
         │
         ▼
    clean Markdown

Cached pages return immediately. On a miss, Pinax fetches the URL with the configured User-Agent, runs the extractor to strip nav/footer chrome and normalise HTML to Markdown, and stores the result in the SQLite cache (WAL mode, TTL applied at read time) before returning it.

Discovery, in order

The crawler tries three strategies and stops at the first that yields a non-empty page list:

llms.txt probe. Fetches /llms.txt at the host root. If present, every URL it lists becomes a manifest entry - no BFS at all.
Sitemap parse. Walks every URL declared in /sitemap.xml (and any referenced sitemap indexes), filtered by the host.
Bounded BFS. Starts at the homepage, follows in-host links breadth- first up to --max-pages (default 1500), respects --exclude filters, bails on JS-heavy pages.

All three strategies feed the same downstream pipeline. The crawler records which strategy fired in the manifest so pinax doctor can warn when a BFS-derived manifest has thin pages.

Extraction

The extractor turns whatever Pinax fetched into clean Markdown. It runs in two places: inside preflight during sampling, and inside get_page on every live fetch.

text/markdown responses pass through with a light cleanup.
HTML is parsed, navigation/footer/cookie chrome is stripped, then converted to Markdown.

Page titles are not produced by the extractor - they come from the crawler, which reads the HTML <title> element and falls back to a URL-slug title when none is present (e.g. sitemap-only sites). Sections are derived from URL path segments, not headings.

Manifest

~/.pinax/servers/<name>/manifest.json is the source of truth. It contains:

Base URL, display name, discovery strategy used, last-refresh timestamp.
One entry per page: URL, title, and section path.
The tag set (when added through the catalog).

It’s written atomically - Pinax writes to a temp file in the same directory then renames into place, so a half-completed pinax add never corrupts the on-disk state.

Search

A small BM25 index over URL paths, titles, and section names is built alongside each manifest. The actual page bodies are not indexed - search hits return URLs which get_page fetches live. This is by design: it keeps the on-disk footprint tiny and the live-fetch policy honest.

Serve

The unified server lets one MCP configuration cover every docs site you’ve added; the optional docs argument scopes calls when an agent wants to narrow. Transport is mark3labs/mcp-go: stdio is the default, Streamable HTTP + SSE are available with --http.

Package map

cmd/pinax/            CLI entry point and argument parser
internal/buildinfo/   Version + User-Agent shared across CLI and server
internal/crawler/     llms.txt probe, sitemap parser, BFS, platform detection
internal/extractor/   HTML/Markdown → clean Markdown
internal/manifest/    Atomic JSON manifests + BM25 indexes
internal/cache/       SQLite page cache (WAL, TTL applied at read time)
internal/logger/      SQLite tool-call log + HTML log viewer
internal/mcp/         MCP server, transport, tools, middleware
internal/preflight/   Content-density check that gates `pinax add`
internal/doctor/      Health diagnosis used by `pinax doctor`
internal/catalog/     Built-in catalog + cache loader

Each package has its own README in the source tree.

Edit this page on GitHub