29 AI Crawlers You Should Block in 2025: Complete List & Guide
29 AI Crawlers You Should Block in 2025: Complete List & Guide
Updated: January 2025 | π Tracking 29 active AI crawlers
As of 2025, there are 29 major AI crawlers actively scraping the web to train large language models (LLMs), power AI search engines, and collect data for various AI applications. This comprehensive guide lists every known AI crawler, explains which ones you should block, and provides ready-to-use configuration code.
Quick Stats: 29 AI Crawlers at a Glance
| Category | Count | Details |
|---|---|---|
| Total AI Crawlers | 29 | Actively scraping in 2025 |
| LLM Training Bots | 16 | Train models like GPT, Claude, Gemini |
| AI Search Bots | 3 | Power AI search engines |
| AI Assistants | 4 | User-initiated browsing (ChatGPT, Claude) |
| Respect robots.txt | 25 | Will honor your blocking rules |
| Ignore robots.txt | 4 | β οΈ Require server-level blocking |
| Aggressive Crawlers | 4 | High bandwidth usage |
The Danger Scale: Understanding Crawler Risk Levels
We categorize AI crawlers into three danger levels based on bandwidth usage, respect for robots.txt, and user reports:
π’ Safe (20 crawlers)
- β Respect robots.txt
- β Reasonable request rates
- β Provide attribution/value
π‘ Caution (5 crawlers)
- β οΈ Higher bandwidth usage
- β οΈ Less transparency
- β οΈ Limited value to site owners
π΄ Aggressive (4 crawlers)
- β Often ignore robots.txt
- β Extremely high request rates
- β No compensation or attribution
- β Require server-level blocking
Complete List of 29 AI Crawlers
π΄ AGGRESSIVE CRAWLERS (Block These First!)
These 4 crawlers are known for ignoring robots.txt or making excessive requests:
1. Bytespider (ByteDance)
- User-Agent:
Bytespider - Company: ByteDance (TikTok parent company)
- Purpose: AI model training
- Danger Level: π΄ Aggressive
- Respects robots.txt: β NO
- Why block: Consumes 30-75% of bandwidth, makes millions of requests, ignores robots.txt
Real example: One website reported 1.4 million requests/month consuming 14GB bandwidth = $1,500 in extra costs.
2. 360Spider (Qihoo 360)
- User-Agent:
360Spider - Company: 360 (ε₯θ360)
- Purpose: AI and search services
- Danger Level: π΄ Aggressive
- Respects robots.txt: β NO
- Why block: Chinese search engine crawler with aggressive behavior
3. ChatGLM-Spider (Zhipu AI)
- User-Agent:
ChatGLM-Spider - Company: Zhipu AI (ζΊθ°±AI)
- Purpose: ChatGLM model training
- Danger Level: π΄ Aggressive
- Respects robots.txt: β NO
- Why block: Newer Chinese LLM with poor crawler etiquette
4. Sogou (Sogou)
- User-Agent:
Sogou - Company: Sogou (ζη)
- Purpose: AI applications
- Danger Level: π΄ Aggressive
- Respects robots.txt: β NO
- Why block: Chinese crawler with history of ignoring robots.txt
π‘ CAUTION CRAWLERS (High Bandwidth Usage)
These 5 crawlers respect robots.txt but may consume significant resources:
5. Baiduspider (Baidu)
- User-Agent:
Baiduspider - Company: Baidu (ηΎεΊ¦)
- Purpose: Search and ERNIE AI (ζεΏδΈθ¨)
- Danger Level: π‘ Caution
- Respects robots.txt: β Yes
- Why block: Unless you need Chinese search traffic, high bandwidth usage
6. ErnieBot (Baidu)
- User-Agent:
ErnieBot - Company: Baidu
- Purpose: ERNIE (ζεΏ) AI model training
- Danger Level: π‘ Caution
- Respects robots.txt: β Yes
- Related to: Baiduspider
7. CCBot (Common Crawl)
- User-Agent:
CCBot - Company: Common Crawl (non-profit)
- Purpose: Open web dataset used by many AI companies
- Danger Level: π‘ Caution
- Respects robots.txt: β Yes
- Why block: Your data ends up in publicly accessible datasets used by multiple AI companies
8. DeepSeekBot (DeepSeek)
- User-Agent:
DeepseekBot - Company: DeepSeek (ζ·±εΊ¦ζ±η΄’)
- Purpose: AI model training
- Danger Level: π‘ Caution
- Respects robots.txt: β Yes
- Why consider blocking: Newer Chinese AI company
9. PerplexityBot (Perplexity AI)
- User-Agent:
PerplexityBot - Company: Perplexity AI
- Purpose: AI search engine
- Danger Level: π‘ Caution
- Respects robots.txt: β Yes
- Note: Provides citations/attribution, but some sites still block
10. PanguBot (Huawei)
- User-Agent:
PanguBot - Company: Huawei (εδΈΊ)
- Purpose: PanGu multimodal LLM training
- Danger Level: π‘ Caution
- Respects robots.txt: β Yes
π’ SAFE CRAWLERS (LLM Training)
These crawlers respect robots.txt and have reasonable request rates. However, they're still using your content for commercial AI training without compensation.
OpenAI Crawlers
11. GPTBot (OpenAI)
- User-Agent:
GPTBot - Company: OpenAI
- Purpose: Train ChatGPT and GPT models
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
- IP Ranges:
23.98.142.176/28 - Should you block?: If you don't want your content training GPT models
12. ChatGPT-User (OpenAI)
- User-Agent:
ChatGPT-User - Company: OpenAI
- Purpose: User-initiated ChatGPT web browsing
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
- Different from GPTBot: Only crawls when ChatGPT users request web access
- Should you block?: Consider keeping to allow ChatGPT users to cite your content
13. OAI-SearchBot (OpenAI)
- User-Agent:
OAI-SearchBot - Company: OpenAI
- Purpose: Search-focused real-time web info
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
Anthropic Crawlers (Claude AI)
14. ClaudeBot (Anthropic)
- User-Agent:
ClaudeBot - Company: Anthropic
- Purpose: Train Claude AI models
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
15. Claude-Web (Anthropic)
- User-Agent:
Claude-Web - Company: Anthropic
- Purpose: User-initiated Claude web access and citations
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
16. anthropic-ai (Anthropic)
- User-Agent:
anthropic-ai - Company: Anthropic
- Purpose: Bulk model training crawler
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
17. anthropic-research (Anthropic)
- User-Agent:
anthropic-research - Company: Anthropic
- Purpose: Research-specific crawler
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
Google AI Crawlers
18. Google-Extended (Google)
- User-Agent:
Google-Extended - Company: Google
- Purpose: Train Gemini and other AI models
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
- Important: Separate from Googlebot (search). Blocking this won't hurt SEO.
19. Gemini-Deep-Research (Google)
- User-Agent:
Gemini-Deep-Research - Company: Google
- Purpose: Gemini's Deep Research feature
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
Meta AI Crawlers
20. FacebookBot (Meta)
- User-Agent:
FacebookBot - Company: Meta
- Purpose: AI and content indexing
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
21. Meta-ExternalAgent (Meta)
- User-Agent:
Meta-ExternalAgent - Company: Meta
- Purpose: Train Llama LLMs
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
22. Meta-ExternalFetcher (Meta)
- User-Agent:
Meta-ExternalFetcher - Company: Meta
- Purpose: External content fetching for AI services
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
Apple AI
23. Applebot-Extended (Apple)
- User-Agent:
Applebot-Extended - Company: Apple
- Purpose: Train Apple Intelligence models
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
- Note: Separate from regular Applebot (search)
Other Major AI Companies
24. Amazonbot (Amazon)
- User-Agent:
Amazonbot - Company: Amazon
- Purpose: Alexa and AI applications
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
25. cohere-ai (Cohere)
- User-Agent:
cohere-ai - Company: Cohere
- Purpose: LLM training
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
26. MistralAI-User (Mistral AI)
- User-Agent:
MistralAI-User - Company: Mistral AI
- Purpose: Le Chat citation fetching
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
27. YouBot (You.com)
- User-Agent:
YouBot - Company: You.com
- Purpose: AI search crawler
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
Data Collection Bots
28. Diffbot (Diffbot)
- User-Agent:
Diffbot - Company: Diffbot
- Purpose: AI-powered web data extraction
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
29. Omgilibot (Omgili)
- User-Agent:
omgilibot - Company: Omgili
- Purpose: Data analysis and AI
- Danger Level: π’ Safe
- Respects robots.txt: β Yes
Which AI Crawlers Should You Block?
Recommended Blocking Strategy
Strategy 1: Block Only Aggressive Crawlers (Recommended)
Block the 4 crawlers that ignore robots.txt or consume excessive bandwidth:
# Block aggressive crawlers
User-agent: Bytespider
Disallow: /
User-agent: 360Spider
Disallow: /
User-agent: ChatGLM-Spider
Disallow: /
User-agent: Sogou
Disallow: /
Who this is for: Sites that want to reduce costs while staying accessible to major AI companies.
Strategy 2: Block All LLM Training Bots (Moderate)
Block all crawlers training AI models, but allow user-initiated browsing:
# Block LLM training bots
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: 360Spider
Disallow: /
User-agent: ChatGLM-Spider
Disallow: /
User-agent: Sogou
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: ErnieBot
Disallow: /
User-agent: DeepseekBot
Disallow: /
User-agent: PanguBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: omgilibot
Disallow: /
# Allow user-initiated browsing
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-Web
Allow: /
Who this is for: Content creators who don't want their work used for AI training.
Strategy 3: Block Everything (Maximum Protection)
Block all 29 AI crawlers:
# Block ALL AI crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: anthropic-ai
User-agent: anthropic-research
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: Google-Extended
User-agent: Gemini-Deep-Research
User-agent: Bytespider
User-agent: FacebookBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
User-agent: cohere-ai
User-agent: PerplexityBot
User-agent: YouBot
User-agent: Diffbot
User-agent: omgilibot
User-agent: 360Spider
User-agent: ChatGLM-Spider
User-agent: Sogou
User-agent: Baiduspider
User-agent: ErnieBot
User-agent: DeepseekBot
User-agent: PanguBot
User-agent: MistralAI-User
Disallow: /
Who this is for: Sites with premium content, paywalled articles, or strong stance against AI training.
Server-Level Blocking (For Crawlers That Ignore robots.txt)
For the 4 aggressive crawlers that don't respect robots.txt, use server-level blocking:
Nginx Configuration
Create /etc/nginx/conf.d/block-ai-bots.conf:
# Block aggressive AI crawlers at server level
map $http_user_agent $block_ai_bots {
default 0;
# Aggressive crawlers that ignore robots.txt
"~*Bytespider" 1;
"~*360Spider" 1;
"~*ChatGLM-Spider" 1;
"~*Sogou" 1;
}
# In your server block
server {
if ($block_ai_bots) {
return 403;
}
}
Then reload Nginx:
sudo nginx -t
sudo systemctl reload nginx
Apache Configuration
Add to .htaccess:
# Block aggressive AI crawlers
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Bytespider|360Spider|ChatGLM-Spider|Sogou) [NC]
RewriteRule .* - [F,L]
See full tutorial: Block Bytespider with Nginx β
How to Verify Blocking Is Working
After implementing blocks, verify they're working:
Method 1: Use CheckAIBots.com
- Visit CheckAIBots.com
- Enter your domain
- See which bots are allowed/blocked
Method 2: Check Server Logs
# Check if Bytespider is still accessing
grep "Bytespider" /var/log/nginx/access.log | tail -20
# If no results after 24-48 hours, blocking is working!
Method 3: Monitor Bandwidth
Track bandwidth usage before and after blocking. You should see a 30-75% reduction if you were being crawled by aggressive bots.
Frequently Asked Questions
Will blocking AI crawlers hurt my SEO?
No. AI crawlers like GPTBot and ClaudeBot are completely separate from search engine crawlers like Googlebot. Blocking AI bots won't affect your search rankings.
Should I block all 29 crawlers?
It depends:
- E-commerce sites: Block aggressive crawlers only
- Blogs/content sites: Block LLM training bots
- News/premium content: Block everything
How often is this list updated?
We update this list monthly as new AI crawlers emerge. Last update: January 2025.
What about crawlers not on this list?
New AI crawlers appear regularly. Monitor your server logs and check back monthly for updates.
Can I selectively allow some pages?
Yes! Use robots.txt to block most content but allow specific directories:
User-agent: GPTBot
Allow: /public-docs/
Disallow: /
Real-World Impact: Before & After Blocking
Case Study 1: Tech Blog
- Before: 287,000 AI crawler requests/month
- Blocked: Bytespider, 360Spider (aggressive only)
- After: 89,000 requests/month (69% reduction)
- Savings: $840/month in bandwidth costs
Case Study 2: E-commerce Site
- Before: 42% of bandwidth consumed by Bytespider
- Blocked: All aggressive crawlers + Chinese bots
- After: 15% bandwidth reduction overall
- Savings: $1,800/month
Case Study 3: News Website
- Before: Content appearing in ChatGPT responses
- Blocked: All LLM training bots
- After: More direct traffic, attribution required for AI citations
- Result: Better traffic quality
Download Ready-to-Use Blocking Configurations
Quick Copy: Block All 29 AI Crawlers (robots.txt)
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: anthropic-ai
User-agent: anthropic-research
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: Google-Extended
User-agent: Gemini-Deep-Research
User-agent: Bytespider
User-agent: FacebookBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
User-agent: cohere-ai
User-agent: PerplexityBot
User-agent: YouBot
User-agent: Diffbot
User-agent: omgilibot
User-agent: 360Spider
User-agent: ChatGLM-Spider
User-agent: Sogou
User-agent: Baiduspider
User-agent: ErnieBot
User-agent: DeepseekBot
User-agent: PanguBot
User-agent: MistralAI-User
Disallow: /
π Generate custom robots.txt for your site β
Conclusion: Take Control of Your Content
With 29 AI crawlers actively scraping the web, now is the time to decide how your content is used. Whether you block aggressive crawlers only or all AI bots entirely, this list gives you the information to make an informed choice.
Key takeaways:
- β 29 AI crawlers are actively scraping websites in 2025
- β 4 aggressive crawlers require server-level blocking
- β Blocking AI bots won't hurt your SEO
- β You can save 30-75% in bandwidth costs
- β This list is updated monthly
Next steps:
- Check which bots are crawling your site β
- Choose a blocking strategy (aggressive only, LLM training, or all)
- Update your robots.txt
- Implement server-level blocks for aggressive bots
- Verify blocking is working
Related articles:
- What Are AI Crawlers? Complete Introduction
- How to Block AI Crawlers: Complete Guide 2025
- How to Detect AI Crawlers on Your Website
- Nginx Tutorial: Block AI Bots in 5 Minutes
- Why 48% of News Sites Block AI Crawlers
- Block Bytespider with Nginx
- robots.txt Guide: Block AI Without Hurting SEO
Bookmark this page: This list is updated monthly. Check back for new AI crawlers and updated blocking recommendations.
Ready to Check Your Website?
Use CheckAIBots to instantly discover which AI crawlers can access your website and get actionable blocking recommendations
Free AI Crawler Check