CheckAIBots is a free online tool that analyzes your website's robots.txt file and performs actual access testing to determine which AI crawlers (like GPTBot, ClaudeBot, Meta-ExternalAgent, Google-Extended) can access your content. Updated for 2025, it checks 40+ different AI bots and provides detailed reports including robots.txt analysis, real crawler testing with spoofed user agent detection, server config generation, and bandwidth cost calculations.

How do I block AI bots from my website?

You can block AI bots by adding rules to your robots.txt file or using server-level blocking (nginx, Apache, Cloudflare WAF, or firewall rules). CheckAIBots provides one-click generators for all major platforms. For aggressive crawlers like Bytespider that ignore robots.txt, server-level blocking is required.

Which AI crawlers does CheckAIBots detect?

CheckAIBots detects 40+ AI crawlers (2025 updated) including GPTBot (OpenAI), ClaudeBot (Anthropic), Meta-ExternalAgent (Meta/Llama), ChatGPT-User, Google-Extended, CCBot (Common Crawl), Bytespider (ByteDance), PerplexityBot, OAI-SearchBot, Baiduspider, ChatGLM-Spider, DeepSeekBot, AI2Bot, and many more from major AI companies. It covers LLM training bots (80% of traffic), AI search engines (18%), AI assistants, and data collection services.

Will blocking AI bots affect my SEO?

No. Blocking AI training bots (like GPTBot, ClaudeBot) does not affect traditional search engine crawlers (like Googlebot, Bingbot). These are separate crawlers with different user agents and purposes. Your SEO rankings will remain completely unaffected when blocking AI training crawlers.

What is the difference between robots.txt checking and actual access testing?

Robots.txt checking analyzes your robots.txt file to see which bots SHOULD be blocked based on your configuration. Actual access testing sends real HTTP requests with AI crawler user agents to verify if bots are ACTUALLY blocked. This helps detect configuration errors and crawlers that ignore robots.txt rules.

How can I save bandwidth costs by blocking AI crawlers?

AI crawlers can consume significant bandwidth by repeatedly crawling your entire site for training data. CheckAIBots includes a bandwidth cost calculator that estimates your monthly AI bot traffic and potential savings. Some websites report saving 60-75% on CDN costs after blocking AI crawlers, especially for large content sites.

29 AI Crawlers You Should Block in 2025: Complete List & Guide

Updated: January 2025 | 📊 Tracking 29 active AI crawlers

As of 2025, there are 29 major AI crawlers actively scraping the web to train large language models (LLMs), power AI search engines, and collect data for various AI applications. This comprehensive guide lists every known AI crawler, explains which ones you should block, and provides ready-to-use configuration code.

Quick Stats: 29 AI Crawlers at a Glance

Category	Count	Details
Total AI Crawlers	29	Actively scraping in 2025
LLM Training Bots	16	Train models like GPT, Claude, Gemini
AI Search Bots	3	Power AI search engines
AI Assistants	4	User-initiated browsing (ChatGPT, Claude)
Respect robots.txt	25	Will honor your blocking rules
Ignore robots.txt	4	⚠️ Require server-level blocking
Aggressive Crawlers	4	High bandwidth usage

The Danger Scale: Understanding Crawler Risk Levels

We categorize AI crawlers into three danger levels based on bandwidth usage, respect for robots.txt, and user reports:

🟢 Safe (20 crawlers)

✅ Respect robots.txt
✅ Reasonable request rates
✅ Provide attribution/value

🟡 Caution (5 crawlers)

⚠️ Higher bandwidth usage
⚠️ Less transparency
⚠️ Limited value to site owners

🔴 Aggressive (4 crawlers)

❌ Often ignore robots.txt
❌ Extremely high request rates
❌ No compensation or attribution
❌ Require server-level blocking

Complete List of 29 AI Crawlers

🔴 AGGRESSIVE CRAWLERS (Block These First!)

These 4 crawlers are known for ignoring robots.txt or making excessive requests:

1. Bytespider (ByteDance)

User-Agent: Bytespider
Company: ByteDance (TikTok parent company)
Purpose: AI model training
Danger Level: 🔴 Aggressive
Respects robots.txt: ❌ NO
Why block: Consumes 30-75% of bandwidth, makes millions of requests, ignores robots.txt

Real example: One website reported 1.4 million requests/month consuming 14GB bandwidth = $1,500 in extra costs.

2. 360Spider (Qihoo 360)

User-Agent: 360Spider
Company: 360 (奇虎360)
Purpose: AI and search services
Danger Level: 🔴 Aggressive
Respects robots.txt: ❌ NO
Why block: Chinese search engine crawler with aggressive behavior

3. ChatGLM-Spider (Zhipu AI)

User-Agent: ChatGLM-Spider
Company: Zhipu AI (智谱AI)
Purpose: ChatGLM model training
Danger Level: 🔴 Aggressive
Respects robots.txt: ❌ NO
Why block: Newer Chinese LLM with poor crawler etiquette

4. Sogou (Sogou)

User-Agent: Sogou
Company: Sogou (搜狗)
Purpose: AI applications
Danger Level: 🔴 Aggressive
Respects robots.txt: ❌ NO
Why block: Chinese crawler with history of ignoring robots.txt

🟡 CAUTION CRAWLERS (High Bandwidth Usage)

These 5 crawlers respect robots.txt but may consume significant resources:

5. Baiduspider (Baidu)

User-Agent: Baiduspider
Company: Baidu (百度)
Purpose: Search and ERNIE AI (文心一言)
Danger Level: 🟡 Caution
Respects robots.txt: ✅ Yes
Why block: Unless you need Chinese search traffic, high bandwidth usage

6. ErnieBot (Baidu)

User-Agent: ErnieBot
Company: Baidu
Purpose: ERNIE (文心) AI model training
Danger Level: 🟡 Caution
Respects robots.txt: ✅ Yes
Related to: Baiduspider

7. CCBot (Common Crawl)

User-Agent: CCBot
Company: Common Crawl (non-profit)
Purpose: Open web dataset used by many AI companies
Danger Level: 🟡 Caution
Respects robots.txt: ✅ Yes
Why block: Your data ends up in publicly accessible datasets used by multiple AI companies

8. DeepSeekBot (DeepSeek)

User-Agent: DeepseekBot
Company: DeepSeek (深度求索)
Purpose: AI model training
Danger Level: 🟡 Caution
Respects robots.txt: ✅ Yes
Why consider blocking: Newer Chinese AI company

9. PerplexityBot (Perplexity AI)

User-Agent: PerplexityBot
Company: Perplexity AI
Purpose: AI search engine
Danger Level: 🟡 Caution
Respects robots.txt: ✅ Yes
Note: Provides citations/attribution, but some sites still block

10. PanguBot (Huawei)

User-Agent: PanguBot
Company: Huawei (华为)
Purpose: PanGu multimodal LLM training
Danger Level: 🟡 Caution
Respects robots.txt: ✅ Yes

🟢 SAFE CRAWLERS (LLM Training)

These crawlers respect robots.txt and have reasonable request rates. However, they're still using your content for commercial AI training without compensation.

OpenAI Crawlers

11. GPTBot (OpenAI)

User-Agent: GPTBot
Company: OpenAI
Purpose: Train ChatGPT and GPT models
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes
IP Ranges: 23.98.142.176/28
Should you block?: If you don't want your content training GPT models

12. ChatGPT-User (OpenAI)

User-Agent: ChatGPT-User
Company: OpenAI
Purpose: User-initiated ChatGPT web browsing
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes
Different from GPTBot: Only crawls when ChatGPT users request web access
Should you block?: Consider keeping to allow ChatGPT users to cite your content

13. OAI-SearchBot (OpenAI)

User-Agent: OAI-SearchBot
Company: OpenAI
Purpose: Search-focused real-time web info
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

Anthropic Crawlers (Claude AI)

14. ClaudeBot (Anthropic)

User-Agent: ClaudeBot
Company: Anthropic
Purpose: Train Claude AI models
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

15. Claude-Web (Anthropic)

User-Agent: Claude-Web
Company: Anthropic
Purpose: User-initiated Claude web access and citations
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

16. anthropic-ai (Anthropic)

User-Agent: anthropic-ai
Company: Anthropic
Purpose: Bulk model training crawler
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

17. anthropic-research (Anthropic)

User-Agent: anthropic-research
Company: Anthropic
Purpose: Research-specific crawler
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

Google AI Crawlers

18. Google-Extended (Google)

User-Agent: Google-Extended
Company: Google
Purpose: Train Gemini and other AI models
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes
Important: Separate from Googlebot (search). Blocking this won't hurt SEO.

19. Gemini-Deep-Research (Google)

User-Agent: Gemini-Deep-Research
Company: Google
Purpose: Gemini's Deep Research feature
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

Meta AI Crawlers

20. FacebookBot (Meta)

User-Agent: FacebookBot
Company: Meta
Purpose: AI and content indexing
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

21. Meta-ExternalAgent (Meta)

User-Agent: Meta-ExternalAgent
Company: Meta
Purpose: Train Llama LLMs
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

22. Meta-ExternalFetcher (Meta)

User-Agent: Meta-ExternalFetcher
Company: Meta
Purpose: External content fetching for AI services
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

Apple AI

23. Applebot-Extended (Apple)

User-Agent: Applebot-Extended
Company: Apple
Purpose: Train Apple Intelligence models
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes
Note: Separate from regular Applebot (search)

Other Major AI Companies

24. Amazonbot (Amazon)

User-Agent: Amazonbot
Company: Amazon
Purpose: Alexa and AI applications
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

25. cohere-ai (Cohere)

User-Agent: cohere-ai
Company: Cohere
Purpose: LLM training
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

26. MistralAI-User (Mistral AI)

User-Agent: MistralAI-User
Company: Mistral AI
Purpose: Le Chat citation fetching
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

27. YouBot (You.com)

User-Agent: YouBot
Company: You.com
Purpose: AI search crawler
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

Data Collection Bots

28. Diffbot (Diffbot)

User-Agent: Diffbot
Company: Diffbot
Purpose: AI-powered web data extraction
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

29. Omgilibot (Omgili)

User-Agent: omgilibot
Company: Omgili
Purpose: Data analysis and AI
Danger Level: 🟢 Safe
Respects robots.txt: ✅ Yes

Which AI Crawlers Should You Block?

Recommended Blocking Strategy

Strategy 1: Block Only Aggressive Crawlers (Recommended)

Block the 4 crawlers that ignore robots.txt or consume excessive bandwidth:

# Block aggressive crawlers
User-agent: Bytespider
Disallow: /

User-agent: 360Spider
Disallow: /

User-agent: ChatGLM-Spider
Disallow: /

User-agent: Sogou
Disallow: /

Who this is for: Sites that want to reduce costs while staying accessible to major AI companies.

Strategy 2: Block All LLM Training Bots (Moderate)

Block all crawlers training AI models, but allow user-initiated browsing:

# Block LLM training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: 360Spider
Disallow: /

User-agent: ChatGLM-Spider
Disallow: /

User-agent: Sogou
Disallow: /

User-agent: Baiduspider
Disallow: /

User-agent: ErnieBot
Disallow: /

User-agent: DeepseekBot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: omgilibot
Disallow: /

# Allow user-initiated browsing
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

Who this is for: Content creators who don't want their work used for AI training.

Strategy 3: Block Everything (Maximum Protection)

Block all 29 AI crawlers:

# Block ALL AI crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: anthropic-ai
User-agent: anthropic-research
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: Google-Extended
User-agent: Gemini-Deep-Research
User-agent: Bytespider
User-agent: FacebookBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
User-agent: cohere-ai
User-agent: PerplexityBot
User-agent: YouBot
User-agent: Diffbot
User-agent: omgilibot
User-agent: 360Spider
User-agent: ChatGLM-Spider
User-agent: Sogou
User-agent: Baiduspider
User-agent: ErnieBot
User-agent: DeepseekBot
User-agent: PanguBot
User-agent: MistralAI-User
Disallow: /

Who this is for: Sites with premium content, paywalled articles, or strong stance against AI training.

Server-Level Blocking (For Crawlers That Ignore robots.txt)

For the 4 aggressive crawlers that don't respect robots.txt, use server-level blocking:

Nginx Configuration

Create /etc/nginx/conf.d/block-ai-bots.conf:

# Block aggressive AI crawlers at server level
map $http_user_agent $block_ai_bots {
    default 0;

    # Aggressive crawlers that ignore robots.txt
    "~*Bytespider" 1;
    "~*360Spider" 1;
    "~*ChatGLM-Spider" 1;
    "~*Sogou" 1;
}

# In your server block
server {
    if ($block_ai_bots) {
        return 403;
    }
}

Then reload Nginx:

sudo nginx -t
sudo systemctl reload nginx

Apache Configuration

Add to .htaccess:

# Block aggressive AI crawlers
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Bytespider|360Spider|ChatGLM-Spider|Sogou) [NC]
RewriteRule .* - [F,L]

See full tutorial: Block Bytespider with Nginx →

How to Verify Blocking Is Working

After implementing blocks, verify they're working:

Method 1: Use CheckAIBots.com

Visit CheckAIBots.com
Enter your domain
See which bots are allowed/blocked

Method 2: Check Server Logs

# Check if Bytespider is still accessing
grep "Bytespider" /var/log/nginx/access.log | tail -20

# If no results after 24-48 hours, blocking is working!

Method 3: Monitor Bandwidth

Track bandwidth usage before and after blocking. You should see a 30-75% reduction if you were being crawled by aggressive bots.

Frequently Asked Questions

Will blocking AI crawlers hurt my SEO?

No. AI crawlers like GPTBot and ClaudeBot are completely separate from search engine crawlers like Googlebot. Blocking AI bots won't affect your search rankings.

Should I block all 29 crawlers?

It depends:

E-commerce sites: Block aggressive crawlers only
Blogs/content sites: Block LLM training bots
News/premium content: Block everything

How often is this list updated?

We update this list monthly as new AI crawlers emerge. Last update: January 2025.

What about crawlers not on this list?

New AI crawlers appear regularly. Monitor your server logs and check back monthly for updates.

Can I selectively allow some pages?

Yes! Use robots.txt to block most content but allow specific directories:

User-agent: GPTBot
Allow: /public-docs/
Disallow: /

Real-World Impact: Before & After Blocking

Case Study 1: Tech Blog

Before: 287,000 AI crawler requests/month
Blocked: Bytespider, 360Spider (aggressive only)
After: 89,000 requests/month (69% reduction)
Savings: $840/month in bandwidth costs

Case Study 2: E-commerce Site

Before: 42% of bandwidth consumed by Bytespider
Blocked: All aggressive crawlers + Chinese bots
After: 15% bandwidth reduction overall
Savings: $1,800/month

Case Study 3: News Website

Before: Content appearing in ChatGPT responses
Blocked: All LLM training bots
After: More direct traffic, attribution required for AI citations
Result: Better traffic quality

Download Ready-to-Use Blocking Configurations

Quick Copy: Block All 29 AI Crawlers (robots.txt)

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: anthropic-ai
User-agent: anthropic-research
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: Google-Extended
User-agent: Gemini-Deep-Research
User-agent: Bytespider
User-agent: FacebookBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: Applebot-Extended
User-agent: Amazonbot
User-agent: CCBot
User-agent: cohere-ai
User-agent: PerplexityBot
User-agent: YouBot
User-agent: Diffbot
User-agent: omgilibot
User-agent: 360Spider
User-agent: ChatGLM-Spider
User-agent: Sogou
User-agent: Baiduspider
User-agent: ErnieBot
User-agent: DeepseekBot
User-agent: PanguBot
User-agent: MistralAI-User
Disallow: /

👉 Generate custom robots.txt for your site →

Conclusion: Take Control of Your Content

With 29 AI crawlers actively scraping the web, now is the time to decide how your content is used. Whether you block aggressive crawlers only or all AI bots entirely, this list gives you the information to make an informed choice.

Key takeaways:

✅ 29 AI crawlers are actively scraping websites in 2025
✅ 4 aggressive crawlers require server-level blocking
✅ Blocking AI bots won't hurt your SEO
✅ You can save 30-75% in bandwidth costs
✅ This list is updated monthly

Next steps:

Check which bots are crawling your site →
Choose a blocking strategy (aggressive only, LLM training, or all)
Update your robots.txt
Implement server-level blocks for aggressive bots
Verify blocking is working

Related articles:

Bookmark this page: This list is updated monthly. Check back for new AI crawlers and updated blocking recommendations.