Agent-First Website Design
Your website has two audiences now. One of them doesn't have eyes.
March 2026
Every website built before 2024 was designed for one reader: a human with a browser. That assumption is now wrong. AI agents — coding assistants, research tools, search augmenters — are hitting your pages, and they're getting HTML soup when they need structured text.
This isn't a future problem. GPTBot went from 5% to 30% of crawler traffic between May 2024 and May 2025. Over 560,000 sites now have AI-specific entries in their robots.txt. The agents are already here. The question is whether your site is legible to them.
The Standards Landscape
Six competing standards have emerged in 18 months. None has won. Here's the field:
| Standard | What it does | Adoption |
|---|---|---|
llms.txt | Markdown site index at root, proposed by Jeremy Howard (fast.ai) | ~10K-844K sites (contested) |
AGENTS.md | Project guidance for coding agents, donated to Linux Foundation | 60,000+ repos |
| Agent Web Protocol | JSON at /.well-known/agent.json declaring site capabilities | Early |
| Content negotiation | Accept: text/markdown → server returns markdown | Cloudflare + Vercel |
| IETF aipref | Formal extension to robots.txt for AI usage preferences | Draft RFC |
| Content Signals | Cloudflare's ai-train/ai-input/search in robots.txt | Cloudflare sites |
llms.txt. Cloudflare's data showed zero visits from GPTBot, ClaudeBot, or PerplexityBot to llms.txt pages from August to October 2025. The supply side is building for demand that hasn't materialized in a standardized way.
What Actually Works (March 2026)
If you strip away the standards politics, three things demonstrably help agents consume your content:
1. Content Negotiation
The most technically mature approach. An agent sends Accept: text/markdown, your server returns clean markdown instead of HTML. Same URL, same content, different format.
Cloudflare shipped this at the edge in February 2026 — HTML-to-markdown conversion with zero origin changes. Token reduction: ~80% (16K tokens in HTML down to 3K in markdown). Vercel built it as Next.js middleware with 99.6% payload reduction.
If you're on Cloudflare, flip a switch. If you're on static hosting (GitHub Pages, Netlify), you can't do server-side negotiation. The fallback: serve companion .md files alongside your HTML and link them via <link rel="alternate" type="text/markdown">.
2. Structured Discovery Files
Put these at your root:
/robots.txt — Who can crawl, AI-specific directives /llms.txt — Curated site summary for LLMs (Markdown) /sitemap.xml — Page discovery with dates /index.json — Structured catalog (your schema)
Even if no crawler reads llms.txt today, it's cheap insurance. The file takes 30 minutes to write and it establishes a machine-readable entry point. When agents do start reading it — and they will, because the alternative is scraping — yours will be there.
3. Self-Describing Pages
Every page should carry enough metadata that an agent can understand it without external context:
- JSON-LD structured data —
schema.org/TechArticlewith title, description, author, date, keywords <article>semantic wrapper — enables Readability.js / Jina Reader extraction<link rel="alternate">— points to the.mdcompanion- YAML frontmatter in .md files — title, date, tags, source URL
An agent hitting any page on your site should be able to determine what it is, who wrote it, when, and what topics it covers — without fetching a separate index.
The Dual-Format Pattern
The architecture I landed on for this site:
index.json ← single source of truth (atom catalog) index.html ← fetches index.json, renders for humans p/topic.html ← rich interactive page (hand-authored) p/topic.md ← companion markdown (auto-generated by forge.py) llms.txt ← curated index for LLMs llms-full.txt ← concatenated content from all .md files sitemap.xml ← rebuilt from index.json robots.txt ← AI crawler permissions
The HTML pages are the source of truth — rich, styled, interactive. The .md files are generated derivatives. A build script (forge.py, ~300 lines, Python stdlib only) extracts content from HTML and generates everything else. A pre-commit hook runs the build + 13 BDD validation scenarios before every commit.
The key insight: your HTML is the product. The agent layer wraps it, never rewrites it. Don't adopt a static site generator to retrofit markdown-source onto working pages. Don't maintain two copies by hand. Generate the agent-readable format from the human-readable format, automatically.
The Robots.txt Problem
AI crawlers are fragmenting. Both OpenAI and Anthropic now operate three bots each:
| Provider | Training | Search | User-initiated |
|---|---|---|---|
| OpenAI | GPTBot | OAI-SearchBot | ChatGPT-User |
| Anthropic | ClaudeBot | Claude-SearchBot | Claude-User |
This lets you block training while allowing search and agentic use. But it also means your robots.txt is getting complex, and compliance is unverifiable — Perplexity was caught using undeclared crawlers with generic user-agent strings to bypass blocking directives.
The IETF's aipref working group (chartered January 2025) is building a formal extension to the Robots Exclusion Protocol with three usage categories: ai-train, ai-input (RAG/agentic), and search. Cloudflare's Content Signals is the draft implementation. Neither is enforceable — they're preference declarations, not access control.
What's Coming
Predictions, ordered by confidence:
- Content negotiation wins.
Accept: text/markdownuses existing HTTP infrastructure, requires no new file formats, and Cloudflare's edge conversion removes the adoption barrier. This will be the default way agents read the web. - IETF aipref becomes the new robots.txt. The three-signal model (train/input/search) will be an RFC within 18 months. Cloudflare's Content Signals is the implementation.
- llms.txt becomes niche. Useful for documentation sites, but eclipsed by content negotiation which works on every page.
- The trust problem gets worse. Without cryptographic agent identity, publishers will block aggressively, pushing agents toward browser automation — the opposite of the cooperative vision.
Implementation Checklist
If you're building or maintaining a website today, do these in order:
- Wrap content in
<article>tags — one-line edit per page, enables reader-mode extraction - Add JSON-LD structured data —
schema.org/Articlewith title, description, date, author - Write a
robots.txtwith AI crawler directives — allow what you're comfortable with - Write a
llms.txt— curated markdown summary of your site's content - Generate
.mdcompanions for your HTML pages — link via<link rel="alternate"> - If on Cloudflare: enable Markdown for Agents (one toggle)
- If self-hosting: build a script to generate agent-readable formats from your HTML
Total effort for a 20-page static site: one afternoon. This site is the reference implementation — llms.txt, index.json, and every page has a .md companion.
Sources
- llmstxt.org — The /llms.txt spec (Jeremy Howard, fast.ai)
- Cloudflare: Markdown for Agents — Edge-level content negotiation (Feb 2026)
- Checkly: State of AI Agent Content Negotiation — Which agents actually negotiate
- Vercel: Agent-Friendly Pages — Middleware implementation
- IETF aipref Working Group — Formal robots.txt extension
- Content Signals — Cloudflare's ai-train/ai-input/search spec
- Agent Web Protocol — Action-discovery standard
- Cloudflare: Who's Crawling Your Site — GPTBot 5%→30% traffic surge
- Linux Foundation: AAIF + AGENTS.md