Reference

29 AI Crawlers You Should Block in 2025: Complete List & Guide

β€’12 min read

29 AI Crawlers You Should Block in 2025: Complete List & Guide

Updated: January 2025 | πŸ“Š Tracking 29 active AI crawlers

As of 2025, there are 29 major AI crawlers actively scraping the web to train large language models (LLMs), power AI search engines, and collect data for various AI applications. This comprehensive guide lists every known AI crawler, explains which ones you should block, and provides ready-to-use configuration code.


Quick Stats: 29 AI Crawlers at a Glance

Category Count Details
Total AI Crawlers 29 Actively scraping in 2025
LLM Training Bots 16 Train models like GPT, Claude, Gemini
AI Search Bots 3 Power AI search engines
AI Assistants 4 User-initiated browsing (ChatGPT, Claude)
Respect robots.txt 25 Will honor your blocking rules
Ignore robots.txt 4 ⚠️ Require server-level blocking
Aggressive Crawlers 4 High bandwidth usage

The Danger Scale: Understanding Crawler Risk Levels

We categorize AI crawlers into three danger levels based on bandwidth usage, respect for robots.txt, and user reports:

🟒 Safe (20 crawlers)

  • βœ… Respect robots.txt
  • βœ… Reasonable request rates
  • βœ… Provide attribution/value

🟑 Caution (5 crawlers)

  • ⚠️ Higher bandwidth usage
  • ⚠️ Less transparency
  • ⚠️ Limited value to site owners

πŸ”΄ Aggressive (4 crawlers)

  • ❌ Often ignore robots.txt
  • ❌ Extremely high request rates
  • ❌ No compensation or attribution
  • ❌ Require server-level blocking

Complete List of 29 AI Crawlers

πŸ”΄ AGGRESSIVE CRAWLERS (Block These First!)

These 4 crawlers are known for ignoring robots.txt or making excessive requests:

1. Bytespider (ByteDance)

  • User-Agent: Bytespider
  • Company: ByteDance (TikTok parent company)
  • Purpose: AI model training
  • Danger Level: πŸ”΄ Aggressive
  • Respects robots.txt: ❌ NO
  • Why block: Consumes 30-75% of bandwidth, makes millions of requests, ignores robots.txt

Real example: One website reported 1.4 million requests/month consuming 14GB bandwidth = $1,500 in extra costs.

2. 360Spider (Qihoo 360)

  • User-Agent: 360Spider
  • Company: 360 (ε₯‡θ™Ž360)
  • Purpose: AI and search services
  • Danger Level: πŸ”΄ Aggressive
  • Respects robots.txt: ❌ NO
  • Why block: Chinese search engine crawler with aggressive behavior

3. ChatGLM-Spider (Zhipu AI)

  • User-Agent: ChatGLM-Spider
  • Company: Zhipu AI (ζ™Ίθ°±AI)
  • Purpose: ChatGLM model training
  • Danger Level: πŸ”΄ Aggressive
  • Respects robots.txt: ❌ NO
  • Why block: Newer Chinese LLM with poor crawler etiquette

4. Sogou (Sogou)

  • User-Agent: Sogou
  • Company: Sogou (ζœη‹—)
  • Purpose: AI applications
  • Danger Level: πŸ”΄ Aggressive
  • Respects robots.txt: ❌ NO
  • Why block: Chinese crawler with history of ignoring robots.txt

🟑 CAUTION CRAWLERS (High Bandwidth Usage)

These 5 crawlers respect robots.txt but may consume significant resources:

5. Baiduspider (Baidu)

  • User-Agent: Baiduspider
  • Company: Baidu (η™ΎεΊ¦)
  • Purpose: Search and ERNIE AI (文心一言)
  • Danger Level: 🟑 Caution
  • Respects robots.txt: βœ… Yes
  • Why block: Unless you need Chinese search traffic, high bandwidth usage

6. ErnieBot (Baidu)

  • User-Agent: ErnieBot
  • Company: Baidu
  • Purpose: ERNIE (ζ–‡εΏƒ) AI model training
  • Danger Level: 🟑 Caution
  • Respects robots.txt: βœ… Yes
  • Related to: Baiduspider

7. CCBot (Common Crawl)

  • User-Agent: CCBot
  • Company: Common Crawl (non-profit)
  • Purpose: Open web dataset used by many AI companies
  • Danger Level: 🟑 Caution
  • Respects robots.txt: βœ… Yes
  • Why block: Your data ends up in publicly accessible datasets used by multiple AI companies

8. DeepSeekBot (DeepSeek)

  • User-Agent: DeepseekBot
  • Company: DeepSeek (深度求紒)
  • Purpose: AI model training
  • Danger Level: 🟑 Caution
  • Respects robots.txt: βœ… Yes
  • Why consider blocking: Newer Chinese AI company

9. PerplexityBot (Perplexity AI)

  • User-Agent: PerplexityBot
  • Company: Perplexity AI
  • Purpose: AI search engine
  • Danger Level: 🟑 Caution
  • Respects robots.txt: βœ… Yes
  • Note: Provides citations/attribution, but some sites still block

10. PanguBot (Huawei)

  • User-Agent: PanguBot
  • Company: Huawei (华为)
  • Purpose: PanGu multimodal LLM training
  • Danger Level: 🟑 Caution
  • Respects robots.txt: βœ… Yes

🟒 SAFE CRAWLERS (LLM Training)

These crawlers respect robots.txt and have reasonable request rates. However, they're still using your content for commercial AI training without compensation.

OpenAI Crawlers

11. GPTBot (OpenAI)

  • User-Agent: GPTBot
  • Company: OpenAI
  • Purpose: Train ChatGPT and GPT models
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes
  • IP Ranges: 23.98.142.176/28
  • Should you block?: If you don't want your content training GPT models

12. ChatGPT-User (OpenAI)

  • User-Agent: ChatGPT-User
  • Company: OpenAI
  • Purpose: User-initiated ChatGPT web browsing
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes
  • Different from GPTBot: Only crawls when ChatGPT users request web access
  • Should you block?: Consider keeping to allow ChatGPT users to cite your content

13. OAI-SearchBot (OpenAI)

  • User-Agent: OAI-SearchBot
  • Company: OpenAI
  • Purpose: Search-focused real-time web info
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

Anthropic Crawlers (Claude AI)

14. ClaudeBot (Anthropic)

  • User-Agent: ClaudeBot
  • Company: Anthropic
  • Purpose: Train Claude AI models
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

15. Claude-Web (Anthropic)

  • User-Agent: Claude-Web
  • Company: Anthropic
  • Purpose: User-initiated Claude web access and citations
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

16. anthropic-ai (Anthropic)

  • User-Agent: anthropic-ai
  • Company: Anthropic
  • Purpose: Bulk model training crawler
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

17. anthropic-research (Anthropic)

  • User-Agent: anthropic-research
  • Company: Anthropic
  • Purpose: Research-specific crawler
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

Google AI Crawlers

18. Google-Extended (Google)

  • User-Agent: Google-Extended
  • Company: Google
  • Purpose: Train Gemini and other AI models
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes
  • Important: Separate from Googlebot (search). Blocking this won't hurt SEO.

19. Gemini-Deep-Research (Google)

  • User-Agent: Gemini-Deep-Research
  • Company: Google
  • Purpose: Gemini's Deep Research feature
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

Meta AI Crawlers

20. FacebookBot (Meta)

  • User-Agent: FacebookBot
  • Company: Meta
  • Purpose: AI and content indexing
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

21. Meta-ExternalAgent (Meta)

  • User-Agent: Meta-ExternalAgent
  • Company: Meta
  • Purpose: Train Llama LLMs
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

22. Meta-ExternalFetcher (Meta)

  • User-Agent: Meta-ExternalFetcher
  • Company: Meta
  • Purpose: External content fetching for AI services
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

Apple AI

23. Applebot-Extended (Apple)

  • User-Agent: Applebot-Extended
  • Company: Apple
  • Purpose: Train Apple Intelligence models
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes
  • Note: Separate from regular Applebot (search)

Other Major AI Companies

24. Amazonbot (Amazon)

  • User-Agent: Amazonbot
  • Company: Amazon
  • Purpose: Alexa and AI applications
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

25. cohere-ai (Cohere)

  • User-Agent: cohere-ai
  • Company: Cohere
  • Purpose: LLM training
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

26. MistralAI-User (Mistral AI)

  • User-Agent: MistralAI-User
  • Company: Mistral AI
  • Purpose: Le Chat citation fetching
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

27. YouBot (You.com)

  • User-Agent: YouBot
  • Company: You.com
  • Purpose: AI search crawler
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

Data Collection Bots

28. Diffbot (Diffbot)

  • User-Agent: Diffbot
  • Company: Diffbot
  • Purpose: AI-powered web data extraction
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

29. Omgilibot (Omgili)

  • User-Agent: omgilibot
  • Company: Omgili
  • Purpose: Data analysis and AI
  • Danger Level: 🟒 Safe
  • Respects robots.txt: βœ… Yes

Which AI Crawlers Should You Block?

Recommended Blocking Strategy

Strategy 1: Block Only Aggressive Crawlers (Recommended)

Block the 4 crawlers that ignore robots.txt or consume excessive bandwidth:

# Block aggressive crawlers
User-agent: Bytespider
Disallow: /

User-agent: 360Spider
Disallow: /

User-agent: ChatGLM-Spider
Disallow: /

User-agent: Sogou
Disallow: /

Who this is for: Sites that want to reduce costs while staying accessible to major AI companies.


Strategy 2: Block All LLM Training Bots (Moderate)

Block all crawlers training AI models, but allow user-initiated browsing:

# Block LLM training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: 360Spider
Disallow: /

User-agent: ChatGLM-Spider
Disallow: /

User-agent: Sogou
Disallow: /

User-agent: Baiduspider
Disallow: /

User-agent: ErnieBot
Disallow: /

User-agent: DeepseekBot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: omgilibot
Disallow: /

# Allow user-initiated browsing
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

Who this is for: Content creators who don't want their work used for AI training.


Strategy 3: Block Everything (Maximum Protection)

Block all 29 AI crawlers:

# Block ALL AI crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: anthropic-ai
User-agent: anthropic-research
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: Google-Extended
User-agent: Gemini-Deep-Research
User-agent: Bytespider
User-agent: FacebookBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
User-agent: cohere-ai
User-agent: PerplexityBot
User-agent: YouBot
User-agent: Diffbot
User-agent: omgilibot
User-agent: 360Spider
User-agent: ChatGLM-Spider
User-agent: Sogou
User-agent: Baiduspider
User-agent: ErnieBot
User-agent: DeepseekBot
User-agent: PanguBot
User-agent: MistralAI-User
Disallow: /

Who this is for: Sites with premium content, paywalled articles, or strong stance against AI training.


Server-Level Blocking (For Crawlers That Ignore robots.txt)

For the 4 aggressive crawlers that don't respect robots.txt, use server-level blocking:

Nginx Configuration

Create /etc/nginx/conf.d/block-ai-bots.conf:

# Block aggressive AI crawlers at server level
map $http_user_agent $block_ai_bots {
    default 0;

    # Aggressive crawlers that ignore robots.txt
    "~*Bytespider" 1;
    "~*360Spider" 1;
    "~*ChatGLM-Spider" 1;
    "~*Sogou" 1;
}

# In your server block
server {
    if ($block_ai_bots) {
        return 403;
    }
}

Then reload Nginx:

sudo nginx -t
sudo systemctl reload nginx

Apache Configuration

Add to .htaccess:

# Block aggressive AI crawlers
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Bytespider|360Spider|ChatGLM-Spider|Sogou) [NC]
RewriteRule .* - [F,L]

See full tutorial: Block Bytespider with Nginx β†’


How to Verify Blocking Is Working

After implementing blocks, verify they're working:

Method 1: Use CheckAIBots.com

  1. Visit CheckAIBots.com
  2. Enter your domain
  3. See which bots are allowed/blocked

Method 2: Check Server Logs

# Check if Bytespider is still accessing
grep "Bytespider" /var/log/nginx/access.log | tail -20

# If no results after 24-48 hours, blocking is working!

Method 3: Monitor Bandwidth

Track bandwidth usage before and after blocking. You should see a 30-75% reduction if you were being crawled by aggressive bots.


Frequently Asked Questions

Will blocking AI crawlers hurt my SEO?

No. AI crawlers like GPTBot and ClaudeBot are completely separate from search engine crawlers like Googlebot. Blocking AI bots won't affect your search rankings.

Should I block all 29 crawlers?

It depends:

  • E-commerce sites: Block aggressive crawlers only
  • Blogs/content sites: Block LLM training bots
  • News/premium content: Block everything

How often is this list updated?

We update this list monthly as new AI crawlers emerge. Last update: January 2025.

What about crawlers not on this list?

New AI crawlers appear regularly. Monitor your server logs and check back monthly for updates.

Can I selectively allow some pages?

Yes! Use robots.txt to block most content but allow specific directories:

User-agent: GPTBot
Allow: /public-docs/
Disallow: /

Real-World Impact: Before & After Blocking

Case Study 1: Tech Blog

  • Before: 287,000 AI crawler requests/month
  • Blocked: Bytespider, 360Spider (aggressive only)
  • After: 89,000 requests/month (69% reduction)
  • Savings: $840/month in bandwidth costs

Case Study 2: E-commerce Site

  • Before: 42% of bandwidth consumed by Bytespider
  • Blocked: All aggressive crawlers + Chinese bots
  • After: 15% bandwidth reduction overall
  • Savings: $1,800/month

Case Study 3: News Website

  • Before: Content appearing in ChatGPT responses
  • Blocked: All LLM training bots
  • After: More direct traffic, attribution required for AI citations
  • Result: Better traffic quality

Download Ready-to-Use Blocking Configurations

Quick Copy: Block All 29 AI Crawlers (robots.txt)

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: anthropic-ai
User-agent: anthropic-research
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: Google-Extended
User-agent: Gemini-Deep-Research
User-agent: Bytespider
User-agent: FacebookBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
User-agent: cohere-ai
User-agent: PerplexityBot
User-agent: YouBot
User-agent: Diffbot
User-agent: omgilibot
User-agent: 360Spider
User-agent: ChatGLM-Spider
User-agent: Sogou
User-agent: Baiduspider
User-agent: ErnieBot
User-agent: DeepseekBot
User-agent: PanguBot
User-agent: MistralAI-User
Disallow: /

πŸ‘‰ Generate custom robots.txt for your site β†’


Conclusion: Take Control of Your Content

With 29 AI crawlers actively scraping the web, now is the time to decide how your content is used. Whether you block aggressive crawlers only or all AI bots entirely, this list gives you the information to make an informed choice.

Key takeaways:

  • βœ… 29 AI crawlers are actively scraping websites in 2025
  • βœ… 4 aggressive crawlers require server-level blocking
  • βœ… Blocking AI bots won't hurt your SEO
  • βœ… You can save 30-75% in bandwidth costs
  • βœ… This list is updated monthly

Next steps:

  1. Check which bots are crawling your site β†’
  2. Choose a blocking strategy (aggressive only, LLM training, or all)
  3. Update your robots.txt
  4. Implement server-level blocks for aggressive bots
  5. Verify blocking is working

Related articles:


Bookmark this page: This list is updated monthly. Check back for new AI crawlers and updated blocking recommendations.

Ready to Check Your Website?

Use CheckAIBots to instantly discover which AI crawlers can access your website and get actionable blocking recommendations

Free AI Crawler Check