Bytespider Ignoring robots.txt? Here's How to Block It (Nginx + Apache)
Bytespider Ignoring robots.txt? Here's How to Block It
Bytespider is notorious for ignoring robots.txt files and consuming massive bandwidth — sometimes 14GB in a single day for small websites. This TikTok/ByteDance crawler is one of the most aggressive AI bots on the web.
If you've tried blocking Bytespider with robots.txt and it's still hammering your server, you're not alone. This guide shows you exactly how to block Bytespider at the server level using nginx, Apache, or Cloudflare.
Why Bytespider Ignores robots.txt
Unlike compliant crawlers like GPTBot or ClaudeBot, Bytespider frequently disregards robots.txt directives. Here's why:
The Evidence
Multiple website owners report Bytespider ignoring their robots.txt:
- 14GB in one day: A small blog reported 14GB of Bytespider traffic despite blocking it in robots.txt
- 50,000 requests/day: An e-commerce site saw 50,000 Bytespider requests despite explicit disallow rules
- Cloudflare reports: Bytespider responsible for millions of requests across their network, many from sites that block it
Why It Happens
- Implementation bugs: Bytespider's robots.txt parser may have bugs
- Aggressive crawling: ByteDance prioritizes data collection over compliance
- Multiple user agents: Bytespider sometimes uses alternate user agent strings
- Regional variations: Different ByteDance servers may not sync robots.txt rules
Bottom line: You cannot rely on robots.txt to block Bytespider. Server-level blocking is required.
Quick Bandwidth Check
Before we start, let's see how much Bytespider is costing you:
Check which bots access your site →
Method 1: Block Bytespider in Nginx
Difficulty: ⭐⭐ Intermediate
Effectiveness: 98%
Time: 5 minutes
This is the most effective method for nginx servers.
Step 1: Identify Bytespider User Agents
Bytespider uses multiple user agent strings:
Bytespider
bytespider
ByteSpider
We'll block all variations.
Step 2: Edit Nginx Configuration
Open your nginx config file:
sudo nano /etc/nginx/nginx.conf
# or for site-specific config:
sudo nano /etc/nginx/sites-available/yoursite.conf
Step 3: Add Blocking Rule
Add this inside your server block:
# Block Bytespider (all variations)
if ($http_user_agent ~* (bytespider)) {
return 403;
}
Full example:
server {
listen 80;
server_name yoursite.com;
# Block Bytespider
if ($http_user_agent ~* (bytespider)) {
return 403;
}
# Rest of your configuration...
location / {
proxy_pass http://localhost:3000;
}
}
Step 4: Test Configuration
Always test before reloading:
sudo nginx -t
You should see:
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
Step 5: Reload Nginx
sudo systemctl reload nginx
Verify It Works
Test with curl:
curl -A "Bytespider" https://yoursite.com
You should see:
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>
Method 2: Block Bytespider in Apache
Difficulty: ⭐⭐ Intermediate
Effectiveness: 98%
Time: 5 minutes
For Apache servers, use .htaccess or httpd.conf.
Step 1: Locate .htaccess File
Your .htaccess file should be in your website root:
ls -la /var/www/html/.htaccess
# or
ls -la /home/user/public_html/.htaccess
If it doesn't exist, create it:
touch /var/www/html/.htaccess
Step 2: Add Blocking Rules
Add this to your .htaccess:
# Block Bytespider
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bytespider [NC]
RewriteRule .* - [F,L]
Explanation:
RewriteEngine On: Enables mod_rewriteRewriteCond %{HTTP_USER_AGENT} bytespider [NC]: Checks user agent (case-insensitive)RewriteRule .* - [F,L]: Returns 403 Forbidden
Step 3: Alternative Apache Method
If you prefer using Apache config files directly:
Edit your Apache configuration:
sudo nano /etc/apache2/sites-available/yoursite.conf
Add this inside your <VirtualHost> block:
<VirtualHost *:80>
ServerName yoursite.com
# Block Bytespider
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bytespider [NC]
RewriteRule .* - [F,L]
</IfModule>
# Rest of configuration...
</VirtualHost>
Step 4: Test Configuration
sudo apache2ctl configtest
# or on CentOS/RHEL:
sudo apachectl configtest
You should see: Syntax OK
Step 5: Reload Apache
sudo systemctl reload apache2
# or on CentOS/RHEL:
sudo systemctl reload httpd
Verify It Works
curl -A "Bytespider" https://yoursite.com
Expected response: 403 Forbidden
Method 3: Block Bytespider with Cloudflare
Difficulty: ⭐ Easy
Effectiveness: 99%+
Time: 2 minutes
If you use Cloudflare, this is the easiest method.
Option A: Use Cloudflare's One-Click Blocking
- Log in to Cloudflare Dashboard
- Select your domain
- Go to Security > Bots
- Find "AI Scrapers and Crawlers"
- Toggle it ON
Done! This blocks Bytespider and all other AI crawlers automatically.
Option B: Create Custom WAF Rule
For more control over specifically Bytespider:
- Go to Security > WAF
- Click Create Rule
- Rule name:
Block Bytespider - Expression:
(http.user_agent contains "bytespider") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "ByteSpider")
- Action: Block
- Click Deploy
Benefits of Cloudflare Method
✅ Blocks Bytespider before it reaches your server
✅ Zero server resource usage
✅ Works regardless of your server type
✅ Easy to enable/disable
✅ Analytics show blocked requests
Advanced: Block All AI Crawlers (Not Just Bytespider)
While you're at it, why not block all problematic AI crawlers?
Nginx: Block Multiple AI Bots
# Block major AI crawlers
if ($http_user_agent ~* (bytespider|gptbot|claudebot|claude-web|google-extended|ccbot|anthropic-ai|cohere-ai|360spider)) {
return 403;
}
Apache: Block Multiple AI Bots
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} gptbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} claudebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} google-extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ccbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} 360spider [NC]
RewriteRule .* - [F,L]
Cloudflare: Block Multiple AI Bots
(http.user_agent contains "bytespider") or
(http.user_agent contains "gptbot") or
(http.user_agent contains "claudebot") or
(http.user_agent contains "google-extended") or
(http.user_agent contains "ccbot") or
(http.user_agent contains "360spider")
Real Results: Before & After Blocking Bytespider
Case Study 1: Small Business Website
Before blocking:
- Daily bandwidth: 20GB
- Bytespider requests: 50,000/day
- Server CPU: 85% average
- Monthly CDN cost: $220
After blocking (nginx):
- Daily bandwidth: 5GB (75% reduction)
- Bytespider requests: 0
- Server CPU: 25% average
- Monthly CDN cost: $55 (saved $165/month)
Case Study 2: E-Commerce Site
Before:
- Bytespider consumed 40% of total bandwidth
- Page load time: 2.8s
- $5,000/month in bandwidth costs
After blocking (Cloudflare):
- Bytespider traffic: 0
- Page load time: 1.2s (57% faster)
- $2,000/month in bandwidth costs (saved $3,000)
- Improved customer experience
Case Study 3: Personal Blog
Before:
- 14GB Bytespider traffic in one day
- Server nearly crashed
- Hosting provider sent warning
After blocking (Apache):
- Normal traffic levels restored
- No more server warnings
- Stable performance
Common Mistakes When Blocking Bytespider
❌ Mistake #1: Only Using robots.txt
Problem: Bytespider ignores robots.txt
Solution: Use server-level blocking (nginx/Apache/Cloudflare)
❌ Mistake #2: Case-Sensitive Matching
Problem: Blocking "Bytespider" won't catch "bytespider"
Solution: Use case-insensitive flags ([NC] in Apache, ~* in nginx)
❌ Mistake #3: Not Testing
Problem: You think it's blocked but Bytespider still gets through
Solution: Always verify with curl tests and monitor logs
❌ Mistake #4: Typos in Configuration
Problem: Small syntax errors break entire config
Solution: Always run nginx -t or apache2ctl configtest before reloading
Monitoring: Check If Bytespider Is Really Blocked
Method 1: Use CheckAIBots
The easiest way to verify:
Our tool tests your site with actual Bytespider user agents and shows you the result.
Method 2: Check Server Logs
Monitor your access logs:
Nginx:
grep -i "bytespider" /var/log/nginx/access.log | tail -20
Apache:
grep -i "bytespider" /var/log/apache2/access.log | tail -20
You should see 403 status codes if blocking works:
1.2.3.4 - - [28/Jan/2025:10:15:23 +0000] "GET / HTTP/1.1" 403 564 "-" "Mozilla/5.0 (compatible; Bytespider; https://zhanzhang.toutiao.com/)"
Method 3: Real-Time Monitoring
Set up real-time alerts:
# Watch for Bytespider attempts
tail -f /var/log/nginx/access.log | grep -i bytespider
Should You Block Other Chinese Crawlers?
Bytespider isn't the only problematic Chinese AI crawler:
Also Consider Blocking:
360Spider (Qihoo 360):
- Often ignores robots.txt
- High bandwidth usage
- No clear benefit to site owners
Baiduspider (Baidu):
- Respects robots.txt (usually)
- Only block if you don't target Chinese audience
- Can be aggressive
PetalBot (Huawei):
- More respectful than Bytespider
- Lower bandwidth usage
- Consider allowing if you want Huawei device visibility
Nginx Configuration for All Chinese Crawlers:
if ($http_user_agent ~* (bytespider|360spider|baiduspider|petalbot)) {
return 403;
}
Performance Impact: Will Blocking Improve Speed?
Short answer: Yes, significantly.
Before Blocking Bytespider:
- Server processes 50,000 unnecessary requests/day
- Bandwidth consumed by crawler traffic
- CPU cycles wasted on bot requests
- Slower response times for real users
After Blocking:
- ✅ 40-75% reduction in bandwidth usage
- ✅ 20-60% reduction in server CPU load
- ✅ 30-50% faster page load times for real users
- ✅ Lower hosting costs
- ✅ Better user experience
Legal Considerations
Q: Is it legal to block Bytespider?
A: Yes, absolutely. You have the right to control who accesses your servers.
- Major publishers block AI crawlers (NYT, Reuters, WSJ)
- Courts have upheld website owners' rights to block bots
- Your server, your rules
- No legal obligation to allow any crawler
Q: Will ByteDance take action?
A: No. ByteDance has no legal recourse. Blocking unwanted traffic is standard practice.
Frequently Asked Questions
Q: Will blocking Bytespider hurt my SEO?
A: No. Bytespider is not a search engine crawler. It's a data collection bot for TikTok/ByteDance AI. Blocking it has zero impact on Google, Bing, or other search engine rankings.
Q: Can Bytespider bypass server-level blocking?
A: Theoretically, if it uses a different user agent string. However, nginx/Apache/Cloudflare blocking catches 98%+ of Bytespider traffic.
Q: Should I still add Bytespider to robots.txt?
A: Yes, use defense in depth:
- Block in robots.txt (for compliant systems)
- Block at server level (for actual enforcement)
- Monitor logs to verify
Q: How much bandwidth will I save?
A: Most sites report 40-75% bandwidth reduction after blocking Bytespider. Use our calculator: Calculate Your Savings →
Q: Can I temporarily allow Bytespider?
A: Yes, simply comment out the blocking rules and reload your server configuration.
Conclusion
Bytespider's disregard for robots.txt makes server-level blocking essential. By implementing nginx, Apache, or Cloudflare blocking, you can:
- ✅ Reduce bandwidth costs by 40-75%
- ✅ Improve page load times by 30-50%
- ✅ Stop wasting server resources on unwanted bots
- ✅ Regain control over who accesses your content
Don't rely on robots.txt alone — it doesn't work for Bytespider. Use the methods in this guide for effective, permanent blocking.
Next steps:
- Check if Bytespider can currently access your site →
- Choose your method (nginx/Apache/Cloudflare)
- Implement the configuration
- Verify with testing
- Monitor your logs
Last updated: January 28, 2025
Related Articles:
Ready to Check Your Website?
Use CheckAIBots to instantly discover which AI crawlers can access your website and get actionable blocking recommendations
Free AI Crawler Check