Tutorial

How to Block AI Crawlers: Complete 2025 Guide (4 Methods)

12 min read

How to Block AI Crawlers: Complete 2025 Guide

35% of the world's top 1,000 websites now block AI crawlers — and for good reason. AI bots like GPTBot and ClaudeBot are scraping web content to train models like ChatGPT and Claude, costing website owners thousands in bandwidth and stealing traffic. 48% of news sites are already blocking AI crawlers.

This comprehensive guide shows you exactly how to block AI crawlers using 4 different methods, from beginner-friendly robots.txt to advanced server-level blocking.

Quick Navigation

Why Block AI Crawlers?

Before we dive into the "how," let's quickly cover the "why":

Cost Savings:

  • Reduce bandwidth by 50-75%
  • Save $1,500-$5,000/month in CDN costs
  • Wikimedia saw 50% bandwidth increase from AI bots alone

Content Protection:

Traffic Recovery:

  • Users visit your site instead of getting AI-generated answers
  • Protect advertising revenue
  • Maintain direct customer relationships

First, check which AI bots are currently crawling your site or use our free detection tool →


Method 1: robots.txt Configuration

Difficulty: ⭐ Easy
Effectiveness: 70% (compliant bots only)
Time: 5 minutes

How It Works

The robots.txt file tells crawlers which parts of your site they can access. Most legitimate AI crawlers (GPTBot, ClaudeBot) respect this file.

Step 1: Locate Your robots.txt

Your robots.txt file should be at: https://yoursite.com/robots.txt

If it doesn't exist, create it in your website's root directory.

Step 2: Add Blocking Rules

Add these lines to block major AI crawlers:

# Block OpenAI GPTBot
User-agent: GPTBot
Disallow: /

# Block Anthropic ClaudeBot
User-agent: ClaudeBot
Disallow: /

# Block Google Bard/Gemini training
User-agent: Google-Extended
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block Bytespider (TikTok/ByteDance)
User-agent: Bytespider
Disallow: /

# Block Perplexity AI
User-agent: PerplexityBot
Disallow: /

# Block other major AI bots
User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: ChatGLM-Spider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: ImagesiftBot
Disallow: /

Step 3: Block All 29 AI Crawlers

For complete protection, use our robots.txt generator:

👉 Generate Custom robots.txt →

Step 4: Verify It Works

  1. Save the file
  2. Visit https://yoursite.com/robots.txt
  3. Use CheckAIBots to verify blocking

Important Limitations

robots.txt CANNOT block:

  • Bytespider (ignores the file completely)
  • 360Spider (often doesn't respect rules)
  • Malicious scrapers
  • Bots with bugs in their code

For these bots, use Method 2 or 3 (server-level blocking).


Method 2: Nginx Configuration

Difficulty: ⭐⭐ Intermediate
Effectiveness: 95%
Time: 10 minutes

Server-level blocking prevents bots from accessing your site at all, regardless of whether they respect robots.txt.

Step 1: Open Nginx Config

sudo nano /etc/nginx/nginx.conf
# or
sudo nano /etc/nginx/sites-available/default

Step 2: Add User-Agent Blocking

Add this inside your server block:

# Block AI crawlers
if ($http_user_agent ~* (GPTBot|ClaudeBot|Claude-Web|anthropic-ai|cohere-ai|Omgilibot|FacebookBot|Applebot-Extended|Bytespider|YouBot|PerplexityBot|Google-Extended|CCBot|ChatGPT-User|OAI-SearchBot|Diffbot|ImagesiftBot|ChatGLM-Spider|360Spider|Baiduspider|PetalBot)) {
    return 403;
}

Step 3: Test Configuration

sudo nginx -t

If you see "test is successful", proceed to Step 4.

Step 4: Reload Nginx

sudo systemctl reload nginx

Advanced: Separate Config File

For cleaner organization:

# Create file
sudo nano /etc/nginx/snippets/block-ai-bots.conf

# Add this content:
if ($http_user_agent ~* (GPTBot|ClaudeBot|Claude-Web|anthropic-ai|cohere-ai|Omgilibot|FacebookBot|Applebot-Extended|Bytespider|YouBot|PerplexityBot|Google-Extended|CCBot|ChatGPT-User|OAI-SearchBot|Diffbot|ImagesiftBot|ChatGLM-Spider|360Spider|Baiduspider|PetalBot)) {
    return 403;
}

# Include in your server block:
include snippets/block-ai-bots.conf;

Verify Blocking Works

Test with curl:

curl -A "GPTBot" https://yoursite.com
# Should return: 403 Forbidden

Method 3: Apache/.htaccess

Difficulty: ⭐⭐ Intermediate
Effectiveness: 95%
Time: 10 minutes

Step 1: Locate .htaccess

Find your .htaccess file in your website root (where index.php/index.html is).

If it doesn't exist, create it:

touch .htaccess

Step 2: Add Blocking Rules

Add this to your .htaccess:

# Block AI Crawlers
RewriteEngine On

# Block GPTBot
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule .* - [F,L]

# Block ClaudeBot
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Claude-Web [NC,OR]
RewriteCond %{HTTP_USER_AGENT} anthropic-ai [NC]
RewriteRule .* - [F,L]

# Block Google Bard/Gemini
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC]
RewriteRule .* - [F,L]

# Block other major AI bots
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PerplexityBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} cohere-ai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Omgilibot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FacebookBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Applebot-Extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Diffbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ImagesiftBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChatGLM-Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} 360Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC]
RewriteRule .* - [F,L]

Step 3: Test Configuration

Visit your website. It should load normally. Then test:

curl -A "GPTBot" https://yoursite.com
# Should return: 403 Forbidden

Method 4: Cloudflare WAF Rules

Difficulty: ⭐ Easy
Effectiveness: 99%
Time: 3 minutes

If you use Cloudflare, this is the easiest and most effective method.

Option A: One-Click Blocking (All Cloudflare Plans)

  1. Log in to Cloudflare Dashboard
  2. Select your domain
  3. Go to Security > Bots
  4. Find "AI Scrapers and Crawlers"
  5. Toggle it ON

Done! This blocks all known AI crawlers automatically.

Option B: Custom WAF Rule (More Control)

For selective blocking:

  1. Go to Security > WAF
  2. Click Create Rule
  3. Rule name: Block AI Crawlers
  4. Expression:
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "Claude-Web") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "cohere-ai") or
(http.user_agent contains "Omgilibot") or
(http.user_agent contains "FacebookBot") or
(http.user_agent contains "Applebot-Extended") or
(http.user_agent contains "ChatGPT-User")
  1. Action: Block
  2. Click Deploy

Selective Blocking Strategy

Not all AI bots are bad. Here's a recommended approach:

Always Block (No Value)

  • ✅ Bytespider (wastes bandwidth, ignores robots.txt)
  • ✅ 360Spider (often doesn't respect rules)
  • ✅ CCBot (Common Crawl - no direct benefit)
  • ✅ GPTBot (if you don't want ChatGPT training on your content)
  • ✅ ClaudeBot (if you don't want Claude training on your content)

Consider Allowing (Potential Traffic)

  • ⚠️ PerplexityBot (drives AI search traffic)
  • ⚠️ OAI-SearchBot (ChatGPT search with attribution)
  • ⚠️ YouBot (AI search platform)

Use CheckAIBots to generate a customized blocking strategy.


How to Verify Your Blocking Works

Method 1: Use CheckAIBots

The easiest way:

  1. Go to CheckAIBots.com
  2. Enter your website URL
  3. Click "Check Now"
  4. See which bots are blocked/allowed
  5. Get recommendations

Method 2: Test with Curl

# Test GPTBot
curl -A "GPTBot" https://yoursite.com

# Test ClaudeBot
curl -A "ClaudeBot" https://yoursite.com

# Test Bytespider
curl -A "Bytespider" https://yoursite.com

You should see 403 Forbidden or 451 Unavailable For Legal Reasons.

Method 3: Check Server Logs

Monitor your access logs for AI bot user agents:

grep -i "GPTBot\|ClaudeBot\|Bytespider" /var/log/nginx/access.log

If your blocking works, you should see 403 status codes.


Common Mistakes to Avoid

❌ Mistake #1: Only Using robots.txt

Problem: Bytespider and other aggressive bots ignore robots.txt
Solution: Use server-level blocking (nginx/Apache/Cloudflare)

❌ Mistake #2: Blocking Googlebot

Problem: You'll lose Google search rankings
Solution: Only block AI crawlers, not search engine bots

❌ Mistake #3: Typos in User Agent Names

Problem: GPTbot (lowercase 'b') won't block GPTBot
Solution: Use case-insensitive matching ([NC] in Apache, ~* in nginx)

❌ Mistake #4: Not Testing

Problem: You think you're protected but bots still get through
Solution: Always verify with CheckAIBots or curl tests


Real Results: Before & After Blocking

Case Study 1: Medium-Sized Blog

Before blocking:

  • Bandwidth: 320GB/month
  • CDN cost: $180/month
  • AI bot requests: 1.2M/month

After blocking (server-level):

  • Bandwidth: 80GB/month (75% reduction)
  • CDN cost: $45/month (saved $135/month)
  • AI bot requests: Blocked successfully

Case Study 2: E-Commerce Site

Before:

  • Bytespider requests: 50,000/day
  • Server load: High
  • Page load times: 2.8s

After blocking Bytespider:

  • Requests reduced to 0
  • Server load: Normal
  • Page load times: 1.2s (57% faster)

Next Steps

1. Check Your Current Status

👉 Use CheckAIBots to see which bots access your site

2. Choose Your Method

  • Beginner: robots.txt + Cloudflare one-click
  • Recommended: robots.txt + nginx/Apache blocking
  • Best: All methods combined

3. Implement & Verify

Follow the guides above, then verify it works.

4. Monitor Regularly

AI crawlers are constantly evolving. Check monthly for new bots.


Frequently Asked Questions

Q: Will this hurt my SEO?

A: No. AI crawlers are separate from search engine crawlers like Googlebot. Blocking GPTBot has zero impact on Google rankings.

Q: Can I block some bots but allow others?

A: Yes! Use selective blocking to allow AI search bots (potential traffic) while blocking training bots.

Q: What if new AI bots appear?

A: We update our database monthly. Subscribe to get updates about new crawlers.

Q: Is this legal?

A: Yes. You have the right to control who accesses your servers. Major publishers like NYT do this.


Conclusion

Blocking AI crawlers is essential for:

  • ✅ Reducing bandwidth costs by 50-75%
  • ✅ Protecting your original content
  • ✅ Maintaining direct customer relationships
  • ✅ Controlling your intellectual property

The best approach combines multiple methods:

  1. robots.txt for compliant bots
  2. Server-level blocking for aggressive bots
  3. Regular monitoring to catch new crawlers

Start now: Check which AI bots can access your website →


Last updated: January 27, 2025

Related Articles:

Ready to Check Your Website?

Use CheckAIBots to instantly discover which AI crawlers can access your website and get actionable blocking recommendations

Free AI Crawler Check