Manage how LLMs and their web crawlers interact with your website

Team Flare

3 September 2025

This guide is for technical SEOs, web managers, and content leads who want control over how large language models (LLMs) like ChatGPT, Claude, and Gemini access and use their sites. You’ll learn practical robots.txt patterns, when to allow or block specific AI crawlers, and how to monitor and enforce your policy.

Why this matters

Visibility: Allowing reputable AI crawlers can help your brand be referenced in AI answers and copilots.
Control: You may want to block training crawlers from premium, private, or rate-limited sections.
Costs: Unchecked bot traffic can strain servers. Set limits to protect performance.

If your goal is to be cited and recommended by AI systems, pair access controls with Generative Engine Optimization (GEO). For a deeper primer, see what Generative Engine Optimization (GEO) is and our B2B GEO checklist. If you want to make your content easier for LLMs to use, our guide on llms.txt covers emerging best practices.

How LLMs reach your site

Most LLM providers run their own web crawlers (user-agents) and, often, also ingest data from large third‑party crawls like Common Crawl. You can signal access preferences with robots.txt. It’s a standard file at yourdomain.com/robots.txt that bots check for rules.

Robots.txt is a public, advisory protocol. It’s widely respected by reputable crawlers, but it doesn’t secure content. Google’s guide explains the basics and limits clearly in Create a robots.txt file. For a refresher, see Search Engine Journal’s robots.txt guide and Cloudflare’s overview What is robots.txt?.

Decide your policy first

Allow all reputable AI crawlers to improve AI visibility and citations.
Allow some, block others to balance reach and risk.
Block training crawlers while still welcoming traditional search engines.
Block most and only allow known partners or use rate limits.

Tip: If you block Common Crawl (CCBot) but allow other AI crawlers, some models may still reach your content via those other sources. If you allow CCBot, many AI systems can still learn your content indirectly.

Robots.txt templates you can copy

Place robots.txt at the site root. Test your changes. Keep comments clear for future you.

Allow selected LLM crawlers, block everything else

# Allow named AI crawlers, block the rest User-agent: GPTBot Allow: /
User-agent: ClaudeBot
Allow: /

User-agent: CCBot
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: Bytespider
Allow: /

Optional examples to consider:
User-agent: PerplexityBot
Allow: /
Default rule for all other crawlers
User-agent: *
Disallow: /

Block known AI training crawlers, allow normal search

# Allow mainstream search engines their normal access User-agent: Googlebot Allow: /
User-agent: Bingbot
Allow: /

User-agent: DuckDuckBot
Allow: /

Block common AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Bytespider
Disallow: /

Optional:
User-agent: PerplexityBot
Disallow: /

Mixed policy by path

# Allow AI crawlers on public docs, block private areas User-agent: GPTBot Allow: /docs/ Disallow: /pricing/ Disallow: /account/
User-agent: ClaudeBot
Allow: /docs/
Disallow: /pricing/
Disallow: /account/

User-agent: *
Allow: /
Disallow: /account/

Rate limiting and crawl-delay

Crawl-delay is not part of the original Robots Exclusion Protocol and many major bots ignore it. If you still want to signal it to those that honor it:

# Some bots honor this, many don't User-agent: * Crawl-delay: 10

For real protection, use your CDN or WAF for rate limiting and bot rules. Cloudflare documents options like a managed robots.txt and bot controls in Managed robots.txt.

Known AI crawlers and notes

GPTBot — OpenAI’s crawler.
ClaudeBot — Anthropic’s crawler.
CCBot — Common Crawl. Allowing it often means your content enters large open datasets.
cohere-ai — Cohere’s crawler.
Bytespider — ByteDance’s crawler.
PerplexityBot — Perplexity’s crawler (name varies across reports).

Bot names and behaviors evolve. Keep an updated watchlist. The team at BotRank keeps a running index of AI crawlers here: Robots.txt: all AI crawlers. For practical blocking patterns, see AI crawlers and how to block them and DataDome’s perspective on limits of robots.txt in Blocking with robots.txt.

Beyond robots.txt: enforce and monitor

CDN/WAF policies: Rate-limit or block by user-agent and IP. Create exceptions for legitimate bots.
IP allow/deny lists: Many providers publish IP ranges. Validate in logs before trusting.
Traffic analytics: Track user-agent, IP, bytes served, and response codes. Watch for spikes on heavy pages.
Honeypot URLs: Place a disallowed test path in robots.txt. Crawls on that path signal noncompliance.

If you need a one-stop policy layer, bot protection providers and CDNs can help. Cloudflare’s docs are a good starting point: What is robots.txt? and Managed robots.txt.

How to test your rules

Manual fetch: Use curl with a bot user-agent to a sensitive URL and confirm the response.
```
curl -A "GPTBot" -I https://yourdomain.com/private/
```
Server logs: Filter by known AI user-agents and check hit patterns.
Robots syntax check: Validate formatting against guides: Google robots.txt and SEJ robots.txt guide.

FAQ-style answers

Will robots.txt keep my content out of all LLMs? It strongly signals your preference. Reputable crawlers comply. Some bots ignore it. Use enforcement (CDN/WAF, IP filters) if you need stronger controls.

Does blocking AI crawlers hurt SEO? Blocking AI crawlers doesn’t affect traditional search crawlers like Googlebot or Bingbot if you still allow them. Keep rules explicit to avoid accidental blocks.

Should I block Common Crawl (CCBot)? If you don’t want your content widely available in open datasets, yes. If you want maximum reach in AI systems, consider allowing it.

What about meta tags like noai? There are emerging signals, but robots.txt is the most broadly honored control today. Combine it with network controls for enforcement.

A simple rollout plan

Audit: Pull last 30 days of logs. List AI user-agents and their hit volume.
Choose a policy: Allow all, mixed, or block. Map any path exceptions.
Implement robots.txt: Start with the templates above.
Enforce: Add CDN/WAF rate limits and bot rules for noncompliant traffic.
Monitor and iterate: Review logs weekly. Update your allow/deny list monthly.

Related resources

If you manage Drupal or WordPress and want a smoother workflow, explore Drupal SEO Studio and our SEO Studio API tooling. They pair well with a GEO strategy and clear bot access rules.

Stay flexible. AI crawlers change names and behavior. Keep your robots.txt clean, your logs active, and your enforcement ready.