Agent-First Website Design

Your website has two audiences now. One of them doesn't have eyes.

March 2026

Every website built before 2024 was designed for one reader: a human with a browser. That assumption is now wrong. AI agents — coding assistants, research tools, search augmenters — are hitting your pages, and they're getting HTML soup when they need structured text.

This isn't a future problem. GPTBot went from 5% to 30% of crawler traffic between May 2024 and May 2025. Over 560,000 sites now have AI-specific entries in their robots.txt. The agents are already here. The question is whether your site is legible to them.

The Standards Landscape

Six competing standards have emerged in 18 months. None has won. Here's the field:

StandardWhat it doesAdoption
llms.txtMarkdown site index at root, proposed by Jeremy Howard (fast.ai)~10K-844K sites (contested)
AGENTS.mdProject guidance for coding agents, donated to Linux Foundation60,000+ repos
Agent Web ProtocolJSON at /.well-known/agent.json declaring site capabilitiesEarly
Content negotiationAccept: text/markdown → server returns markdownCloudflare + Vercel
IETF aiprefFormal extension to robots.txt for AI usage preferencesDraft RFC
Content SignalsCloudflare's ai-train/ai-input/search in robots.txtCloudflare sites
The uncomfortable truth
No LLM provider has confirmed their crawlers actually read llms.txt. Cloudflare's data showed zero visits from GPTBot, ClaudeBot, or PerplexityBot to llms.txt pages from August to October 2025. The supply side is building for demand that hasn't materialized in a standardized way.

What Actually Works (March 2026)

If you strip away the standards politics, three things demonstrably help agents consume your content:

1. Content Negotiation

The most technically mature approach. An agent sends Accept: text/markdown, your server returns clean markdown instead of HTML. Same URL, same content, different format.

Cloudflare shipped this at the edge in February 2026 — HTML-to-markdown conversion with zero origin changes. Token reduction: ~80% (16K tokens in HTML down to 3K in markdown). Vercel built it as Next.js middleware with 99.6% payload reduction.

If you're on Cloudflare, flip a switch. If you're on static hosting (GitHub Pages, Netlify), you can't do server-side negotiation. The fallback: serve companion .md files alongside your HTML and link them via <link rel="alternate" type="text/markdown">.

2. Structured Discovery Files

Put these at your root:

/robots.txt       — Who can crawl, AI-specific directives
/llms.txt         — Curated site summary for LLMs (Markdown)
/sitemap.xml      — Page discovery with dates
/index.json       — Structured catalog (your schema)

Even if no crawler reads llms.txt today, it's cheap insurance. The file takes 30 minutes to write and it establishes a machine-readable entry point. When agents do start reading it — and they will, because the alternative is scraping — yours will be there.

3. Self-Describing Pages

Every page should carry enough metadata that an agent can understand it without external context:

An agent hitting any page on your site should be able to determine what it is, who wrote it, when, and what topics it covers — without fetching a separate index.

The Dual-Format Pattern

The architecture I landed on for this site:

index.json          ← single source of truth (atom catalog)
index.html          ← fetches index.json, renders for humans
p/topic.html        ← rich interactive page (hand-authored)
p/topic.md          ← companion markdown (auto-generated by forge.py)
llms.txt            ← curated index for LLMs
llms-full.txt       ← concatenated content from all .md files
sitemap.xml         ← rebuilt from index.json
robots.txt          ← AI crawler permissions

The HTML pages are the source of truth — rich, styled, interactive. The .md files are generated derivatives. A build script (forge.py, ~300 lines, Python stdlib only) extracts content from HTML and generates everything else. A pre-commit hook runs the build + 13 BDD validation scenarios before every commit.

The key insight: your HTML is the product. The agent layer wraps it, never rewrites it. Don't adopt a static site generator to retrofit markdown-source onto working pages. Don't maintain two copies by hand. Generate the agent-readable format from the human-readable format, automatically.

The Robots.txt Problem

AI crawlers are fragmenting. Both OpenAI and Anthropic now operate three bots each:

ProviderTrainingSearchUser-initiated
OpenAIGPTBotOAI-SearchBotChatGPT-User
AnthropicClaudeBotClaude-SearchBotClaude-User

This lets you block training while allowing search and agentic use. But it also means your robots.txt is getting complex, and compliance is unverifiable — Perplexity was caught using undeclared crawlers with generic user-agent strings to bypass blocking directives.

The IETF's aipref working group (chartered January 2025) is building a formal extension to the Robots Exclusion Protocol with three usage categories: ai-train, ai-input (RAG/agentic), and search. Cloudflare's Content Signals is the draft implementation. Neither is enforceable — they're preference declarations, not access control.

What's Coming

Predictions, ordered by confidence:

  1. Content negotiation wins. Accept: text/markdown uses existing HTTP infrastructure, requires no new file formats, and Cloudflare's edge conversion removes the adoption barrier. This will be the default way agents read the web.
  2. IETF aipref becomes the new robots.txt. The three-signal model (train/input/search) will be an RFC within 18 months. Cloudflare's Content Signals is the implementation.
  3. llms.txt becomes niche. Useful for documentation sites, but eclipsed by content negotiation which works on every page.
  4. The trust problem gets worse. Without cryptographic agent identity, publishers will block aggressively, pushing agents toward browser automation — the opposite of the cooperative vision.

Implementation Checklist

If you're building or maintaining a website today, do these in order:

  1. Wrap content in <article> tags — one-line edit per page, enables reader-mode extraction
  2. Add JSON-LD structured dataschema.org/Article with title, description, date, author
  3. Write a robots.txt with AI crawler directives — allow what you're comfortable with
  4. Write a llms.txt — curated markdown summary of your site's content
  5. Generate .md companions for your HTML pages — link via <link rel="alternate">
  6. If on Cloudflare: enable Markdown for Agents (one toggle)
  7. If self-hosting: build a script to generate agent-readable formats from your HTML

Total effort for a 20-page static site: one afternoon. This site is the reference implementation — llms.txt, index.json, and every page has a .md companion.

Sources

.md