Agent-First Website Design

Your website has two audiences now. One of them doesn't have eyes.

March 2026

Every website built before 2024 was designed for one reader: a human with a browser. That assumption is now wrong. AI agents — coding assistants, research tools, search augmenters — are hitting your pages, and they're getting HTML soup when they need structured text.

This isn't a future problem. GPTBot went from 5% to 30% of crawler traffic between May 2024 and May 2025. Over 560,000 sites now have AI-specific entries in their robots.txt. The agents are already here. The question is whether your site is legible to them.

The Standards Landscape

Six competing standards have emerged in 18 months. None has won. Here's the field:

Standard	What it does	Adoption
`llms.txt`	Markdown site index at root, proposed by Jeremy Howard (fast.ai)	~10K-844K sites (contested)
`AGENTS.md`	Project guidance for coding agents, donated to Linux Foundation	60,000+ repos
Agent Web Protocol	JSON at `/.well-known/agent.json` declaring site capabilities	Early
Content negotiation	`Accept: text/markdown` → server returns markdown	Cloudflare + Vercel
IETF aipref	Formal extension to robots.txt for AI usage preferences	Draft RFC
Content Signals	Cloudflare's `ai-train`/`ai-input`/`search` in robots.txt	Cloudflare sites

The uncomfortable truth

No LLM provider has confirmed their crawlers actually read llms.txt. Cloudflare's data showed zero visits from GPTBot, ClaudeBot, or PerplexityBot to llms.txt pages from August to October 2025. The supply side is building for demand that hasn't materialized in a standardized way.

What Actually Works (March 2026)

If you strip away the standards politics, three things demonstrably help agents consume your content:

1. Content Negotiation

The most technically mature approach. An agent sends Accept: text/markdown, your server returns clean markdown instead of HTML. Same URL, same content, different format.

Cloudflare shipped this at the edge in February 2026 — HTML-to-markdown conversion with zero origin changes. Token reduction: ~80% (16K tokens in HTML down to 3K in markdown). Vercel built it as Next.js middleware with 99.6% payload reduction.

If you're on Cloudflare, flip a switch. If you're on static hosting (GitHub Pages, Netlify), you can't do server-side negotiation. The fallback: serve companion .md files alongside your HTML and link them via <link rel="alternate" type="text/markdown">.

2. Structured Discovery Files

Put these at your root:

/robots.txt       — Who can crawl, AI-specific directives
/llms.txt         — Curated site summary for LLMs (Markdown)
/sitemap.xml      — Page discovery with dates
/index.json       — Structured catalog (your schema)

Even if no crawler reads llms.txt today, it's cheap insurance. The file takes 30 minutes to write and it establishes a machine-readable entry point. When agents do start reading it — and they will, because the alternative is scraping — yours will be there.

3. Self-Describing Pages

Every page should carry enough metadata that an agent can understand it without external context:

JSON-LD structured data — schema.org/TechArticle with title, description, author, date, keywords
<article> semantic wrapper — enables Readability.js / Jina Reader extraction
<link rel="alternate"> — points to the .md companion
YAML frontmatter in .md files — title, date, tags, source URL

An agent hitting any page on your site should be able to determine what it is, who wrote it, when, and what topics it covers — without fetching a separate index.

The Dual-Format Pattern

The architecture I landed on for this site:

index.json          ← single source of truth (atom catalog)
index.html          ← fetches index.json, renders for humans
p/topic.html        ← rich interactive page (hand-authored)
p/topic.md          ← companion markdown (auto-generated by forge.py)
llms.txt            ← curated index for LLMs
llms-full.txt       ← concatenated content from all .md files
sitemap.xml         ← rebuilt from index.json
robots.txt          ← AI crawler permissions

The HTML pages are the source of truth — rich, styled, interactive. The .md files are generated derivatives. A build script (forge.py, ~300 lines, Python stdlib only) extracts content from HTML and generates everything else. A pre-commit hook runs the build + 13 BDD validation scenarios before every commit.

The key insight: your HTML is the product. The agent layer wraps it, never rewrites it. Don't adopt a static site generator to retrofit markdown-source onto working pages. Don't maintain two copies by hand. Generate the agent-readable format from the human-readable format, automatically.

The Robots.txt Problem

AI crawlers are fragmenting. Both OpenAI and Anthropic now operate three bots each:

Provider	Training	Search	User-initiated
OpenAI	GPTBot	OAI-SearchBot	ChatGPT-User
Anthropic	ClaudeBot	Claude-SearchBot	Claude-User

This lets you block training while allowing search and agentic use. But it also means your robots.txt is getting complex, and compliance is unverifiable — Perplexity was caught using undeclared crawlers with generic user-agent strings to bypass blocking directives.

The IETF's aipref working group (chartered January 2025) is building a formal extension to the Robots Exclusion Protocol with three usage categories: ai-train, ai-input (RAG/agentic), and search. Cloudflare's Content Signals is the draft implementation. Neither is enforceable — they're preference declarations, not access control.

What's Coming

Predictions, ordered by confidence:

Content negotiation wins. Accept: text/markdown uses existing HTTP infrastructure, requires no new file formats, and Cloudflare's edge conversion removes the adoption barrier. This will be the default way agents read the web.
IETF aipref becomes the new robots.txt. The three-signal model (train/input/search) will be an RFC within 18 months. Cloudflare's Content Signals is the implementation.
llms.txt becomes niche. Useful for documentation sites, but eclipsed by content negotiation which works on every page.
The trust problem gets worse. Without cryptographic agent identity, publishers will block aggressively, pushing agents toward browser automation — the opposite of the cooperative vision.

Implementation Checklist

If you're building or maintaining a website today, do these in order:

Wrap content in <article> tags — one-line edit per page, enables reader-mode extraction
Add JSON-LD structured data — schema.org/Article with title, description, date, author
Write a robots.txt with AI crawler directives — allow what you're comfortable with
Write a llms.txt — curated markdown summary of your site's content
Generate .md companions for your HTML pages — link via <link rel="alternate">
If on Cloudflare: enable Markdown for Agents (one toggle)
If self-hosting: build a script to generate agent-readable formats from your HTML

Total effort for a 20-page static site: one afternoon. This site is the reference implementation — llms.txt, index.json, and every page has a .md companion.

Sources

llmstxt.org — The /llms.txt spec (Jeremy Howard, fast.ai)
Cloudflare: Markdown for Agents — Edge-level content negotiation (Feb 2026)
Checkly: State of AI Agent Content Negotiation — Which agents actually negotiate
Vercel: Agent-Friendly Pages — Middleware implementation
IETF aipref Working Group — Formal robots.txt extension
Content Signals — Cloudflare's ai-train/ai-input/search spec
Agent Web Protocol — Action-discovery standard
Cloudflare: Who's Crawling Your Site — GPTBot 5%→30% traffic surge
Linux Foundation: AAIF + AGENTS.md