---
title: "Agent-First Website Design"
description: "Your website has two audiences now. One of them doesn't have eyes. Standards, patterns, implementation."
author: "Joey Lopez"
date: "2026-03-25"
tags: ["methodology", "code", "theory"]
atom_id: 26
source_html: "agent-first-design.html"
url: "https://jrlopez.dev/p/agent-first-design.html"
generated: true
---

[← jrlopez.dev ]()
# Agent-First Website Design Your website has two audiences now. One of them doesn't have eyes. March 2026 Every website built before 2024 was designed for one reader: a human with a browser. That assumption is now wrong. AI agents — coding assistants, research tools, search augmenters — are hitting your pages, and they're getting HTML soup when they need structured text. This isn't a future problem. GPTBot went from 5% to 30% of crawler traffic between May 2024 and May 2025. Over 560,000 sites now have AI-specific entries in their robots.txt. The agents are already here. The question is whether your site is legible to them.
## The Standards Landscape Six competing standards have emerged in 18 months. None has won. Here's the field:

| Standard  || What it does  || Adoption  |

| llms.txt || Markdown site index at root, proposed by Jeremy Howard (fast.ai)  || ~10K-844K sites (contested)  |

| AGENTS.md || Project guidance for coding agents, donated to Linux Foundation  || 60,000+ repos  |

| Agent Web Protocol  || JSON at /.well-known/agent.jsondeclaring site capabilities  || Early  |

| Content negotiation  || Accept: text/markdown→ server returns markdown  || Cloudflare + Vercel  |

| IETF aipref  || Formal extension to robots.txt for AI usage preferences  || Draft RFC  |

| Content Signals  || Cloudflare's ai-train/ ai-input/ searchin robots.txt  || Cloudflare sites  |

The uncomfortable truth No LLM provider has confirmed their crawlers actually read llms.txt. Cloudflare's data showed **zero visits **from GPTBot, ClaudeBot, or PerplexityBot to llms.txt pages from August to October 2025. The supply side is building for demand that hasn't materialized in a standardized way.
## What Actually Works (March 2026) If you strip away the standards politics, three things demonstrably help agents consume your content:
### 1. Content Negotiation The most technically mature approach. An agent sends Accept: text/markdown, your server returns clean markdown instead of HTML. Same URL, same content, different format. Cloudflare shipped this at the edge in February 2026 — HTML-to-markdown conversion with zero origin changes. Token reduction: **~80% **(16K tokens in HTML down to 3K in markdown). Vercel built it as Next.js middleware with 99.6% payload reduction. If you're on Cloudflare, flip a switch. If you're on static hosting (GitHub Pages, Netlify), you can't do server-side negotiation. The fallback: serve companion .mdfiles alongside your HTML and link them via <link rel="alternate" type="text/markdown">.
### 2. Structured Discovery Files Put these at your root:
```
/robots.txt       — Who can crawl, AI-specific directives
/llms.txt         — Curated site summary for LLMs (Markdown)
/sitemap.xml      — Page discovery with dates
/index.json       — Structured catalog (your schema)
```

Even if no crawler reads llms.txttoday, it's cheap insurance. The file takes 30 minutes to write and it establishes a machine-readable entry point. When agents do start reading it — and they will, because the alternative is scraping — yours will be there.
### 3. Self-Describing Pages Every page should carry enough metadata that an agent can understand it without external context:
 - **JSON-LD structured data **— schema.org/TechArticlewith title, description, author, date, keywords
 - **<article>semantic wrapper **— enables Readability.js / Jina Reader extraction
 - **<link rel="alternate">**— points to the .mdcompanion
 - **YAML frontmatter in .md files **— title, date, tags, source URL An agent hitting any page on your site should be able to determine what it is, who wrote it, when, and what topics it covers — without fetching a separate index.
## The Dual-Format Pattern The architecture I landed on for this site:
```
index.json          ← single source of truth (atom catalog)
index.html          ← fetches index.json, renders for humans
p/topic.html        ← rich interactive page (hand-authored)
p/topic.md          ← companion markdown (auto-generated by forge.py)
llms.txt            ← curated index for LLMs
llms-full.txt       ← concatenated content from all .md files
sitemap.xml         ← rebuilt from index.json
robots.txt          ← AI crawler permissions
```

The HTML pages are the source of truth — rich, styled, interactive. The .mdfiles are generated derivatives. A build script ( forge.py, ~300 lines, Python stdlib only) extracts content from HTML and generates everything else. A pre-commit hook runs the build + 13 BDD validation scenarios before every commit. The key insight: **your HTML is the product. The agent layer wraps it, never rewrites it. **Don't adopt a static site generator to retrofit markdown-source onto working pages. Don't maintain two copies by hand. Generate the agent-readable format from the human-readable format, automatically.
## The Robots.txt Problem AI crawlers are fragmenting. Both OpenAI and Anthropic now operate three bots each:

| Provider  || Training  || Search  || User-initiated  |

| OpenAI  || GPTBot  || OAI-SearchBot  || ChatGPT-User  |

| Anthropic  || ClaudeBot  || Claude-SearchBot  || Claude-User  |

This lets you block training while allowing search and agentic use. But it also means your robots.txt is getting complex, and compliance is unverifiable — Perplexity was caught using undeclared crawlers with generic user-agent strings to bypass blocking directives. The IETF's aiprefworking group (chartered January 2025) is building a formal extension to the Robots Exclusion Protocol with three usage categories: ai-train, ai-input(RAG/agentic), and search. Cloudflare's Content Signals is the draft implementation. Neither is enforceable — they're preference declarations, not access control.
## What's Coming Predictions, ordered by confidence:
 - **Content negotiation wins. **Accept: text/markdownuses existing HTTP infrastructure, requires no new file formats, and Cloudflare's edge conversion removes the adoption barrier. This will be the default way agents read the web.
 - **IETF aipref becomes the new robots.txt. **The three-signal model (train/input/search) will be an RFC within 18 months. Cloudflare's Content Signals is the implementation.
 - **llms.txt becomes niche. **Useful for documentation sites, but eclipsed by content negotiation which works on every page.
 - **The trust problem gets worse. **Without cryptographic agent identity, publishers will block aggressively, pushing agents toward browser automation — the opposite of the cooperative vision.
## Implementation Checklist If you're building or maintaining a website today, do these in order:
 - **Wrap content in <article>tags **— one-line edit per page, enables reader-mode extraction
 - **Add JSON-LD structured data **— schema.org/Articlewith title, description, date, author
 - **Write a robots.txt**with AI crawler directives — allow what you're comfortable with
 - **Write a llms.txt**— curated markdown summary of your site's content
 - **Generate .mdcompanions **for your HTML pages — link via <link rel="alternate">
 - **If on Cloudflare: **enable Markdown for Agents (one toggle)
 - **If self-hosting: **build a script to generate agent-readable formats from your HTML Total effort for a 20-page static site: one afternoon. This site is the reference implementation — [llms.txt ](), [index.json ](), and every page has a [.md companion ]().
## Sources
 - [llmstxt.org ]()— The /llms.txt spec (Jeremy Howard, fast.ai)
 - [Cloudflare: Markdown for Agents ]()— Edge-level content negotiation (Feb 2026)
 - [Checkly: State of AI Agent Content Negotiation ]()— Which agents actually negotiate
 - [Vercel: Agent-Friendly Pages ]()— Middleware implementation
 - [IETF aipref Working Group ]()— Formal robots.txt extension
 - [Content Signals ]()— Cloudflare's ai-train/ai-input/search spec
 - [Agent Web Protocol ]()— Action-discovery standard
 - [Cloudflare: Who's Crawling Your Site ]()— GPTBot 5%→30% traffic surge
 - [Linux Foundation: AAIF + AGENTS.md ]()Joey Lopez · 2026 [.md ]()