Tutorial

Bytespider Ignoring robots.txt? Here's How to Block It (Nginx + Apache)

10 min read

Bytespider Ignoring robots.txt? Here's How to Block It

Bytespider is notorious for ignoring robots.txt files and consuming massive bandwidth — sometimes 14GB in a single day for small websites. This TikTok/ByteDance crawler is one of the most aggressive AI bots on the web.

If you've tried blocking Bytespider with robots.txt and it's still hammering your server, you're not alone. This guide shows you exactly how to block Bytespider at the server level using nginx, Apache, or Cloudflare.

Why Bytespider Ignores robots.txt

Unlike compliant crawlers like GPTBot or ClaudeBot, Bytespider frequently disregards robots.txt directives. Here's why:

The Evidence

Multiple website owners report Bytespider ignoring their robots.txt:

  • 14GB in one day: A small blog reported 14GB of Bytespider traffic despite blocking it in robots.txt
  • 50,000 requests/day: An e-commerce site saw 50,000 Bytespider requests despite explicit disallow rules
  • Cloudflare reports: Bytespider responsible for millions of requests across their network, many from sites that block it

Why It Happens

  1. Implementation bugs: Bytespider's robots.txt parser may have bugs
  2. Aggressive crawling: ByteDance prioritizes data collection over compliance
  3. Multiple user agents: Bytespider sometimes uses alternate user agent strings
  4. Regional variations: Different ByteDance servers may not sync robots.txt rules

Bottom line: You cannot rely on robots.txt to block Bytespider. Server-level blocking is required.

Quick Bandwidth Check

Before we start, let's see how much Bytespider is costing you:

Check which bots access your site →


Method 1: Block Bytespider in Nginx

Difficulty: ⭐⭐ Intermediate
Effectiveness: 98%
Time: 5 minutes

This is the most effective method for nginx servers.

Step 1: Identify Bytespider User Agents

Bytespider uses multiple user agent strings:

Bytespider
bytespider
ByteSpider

We'll block all variations.

Step 2: Edit Nginx Configuration

Open your nginx config file:

sudo nano /etc/nginx/nginx.conf
# or for site-specific config:
sudo nano /etc/nginx/sites-available/yoursite.conf

Step 3: Add Blocking Rule

Add this inside your server block:

# Block Bytespider (all variations)
if ($http_user_agent ~* (bytespider)) {
    return 403;
}

Full example:

server {
    listen 80;
    server_name yoursite.com;

    # Block Bytespider
    if ($http_user_agent ~* (bytespider)) {
        return 403;
    }

    # Rest of your configuration...
    location / {
        proxy_pass http://localhost:3000;
    }
}

Step 4: Test Configuration

Always test before reloading:

sudo nginx -t

You should see:

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

Step 5: Reload Nginx

sudo systemctl reload nginx

Verify It Works

Test with curl:

curl -A "Bytespider" https://yoursite.com

You should see:

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>

Method 2: Block Bytespider in Apache

Difficulty: ⭐⭐ Intermediate
Effectiveness: 98%
Time: 5 minutes

For Apache servers, use .htaccess or httpd.conf.

Step 1: Locate .htaccess File

Your .htaccess file should be in your website root:

ls -la /var/www/html/.htaccess
# or
ls -la /home/user/public_html/.htaccess

If it doesn't exist, create it:

touch /var/www/html/.htaccess

Step 2: Add Blocking Rules

Add this to your .htaccess:

# Block Bytespider
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bytespider [NC]
RewriteRule .* - [F,L]

Explanation:

  • RewriteEngine On: Enables mod_rewrite
  • RewriteCond %{HTTP_USER_AGENT} bytespider [NC]: Checks user agent (case-insensitive)
  • RewriteRule .* - [F,L]: Returns 403 Forbidden

Step 3: Alternative Apache Method

If you prefer using Apache config files directly:

Edit your Apache configuration:

sudo nano /etc/apache2/sites-available/yoursite.conf

Add this inside your <VirtualHost> block:

<VirtualHost *:80>
    ServerName yoursite.com

    # Block Bytespider
    <IfModule mod_rewrite.c>
        RewriteEngine On
        RewriteCond %{HTTP_USER_AGENT} bytespider [NC]
        RewriteRule .* - [F,L]
    </IfModule>

    # Rest of configuration...
</VirtualHost>

Step 4: Test Configuration

sudo apache2ctl configtest
# or on CentOS/RHEL:
sudo apachectl configtest

You should see: Syntax OK

Step 5: Reload Apache

sudo systemctl reload apache2
# or on CentOS/RHEL:
sudo systemctl reload httpd

Verify It Works

curl -A "Bytespider" https://yoursite.com

Expected response: 403 Forbidden


Method 3: Block Bytespider with Cloudflare

Difficulty: ⭐ Easy
Effectiveness: 99%+
Time: 2 minutes

If you use Cloudflare, this is the easiest method.

Option A: Use Cloudflare's One-Click Blocking

  1. Log in to Cloudflare Dashboard
  2. Select your domain
  3. Go to Security > Bots
  4. Find "AI Scrapers and Crawlers"
  5. Toggle it ON

Done! This blocks Bytespider and all other AI crawlers automatically.

Option B: Create Custom WAF Rule

For more control over specifically Bytespider:

  1. Go to Security > WAF
  2. Click Create Rule
  3. Rule name: Block Bytespider
  4. Expression:
(http.user_agent contains "bytespider") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "ByteSpider")
  1. Action: Block
  2. Click Deploy

Benefits of Cloudflare Method

✅ Blocks Bytespider before it reaches your server
✅ Zero server resource usage
✅ Works regardless of your server type
✅ Easy to enable/disable
✅ Analytics show blocked requests


Advanced: Block All AI Crawlers (Not Just Bytespider)

While you're at it, why not block all problematic AI crawlers?

Nginx: Block Multiple AI Bots

# Block major AI crawlers
if ($http_user_agent ~* (bytespider|gptbot|claudebot|claude-web|google-extended|ccbot|anthropic-ai|cohere-ai|360spider)) {
    return 403;
}

Apache: Block Multiple AI Bots

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} gptbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} claudebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} google-extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ccbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} 360spider [NC]
RewriteRule .* - [F,L]

Cloudflare: Block Multiple AI Bots

(http.user_agent contains "bytespider") or
(http.user_agent contains "gptbot") or
(http.user_agent contains "claudebot") or
(http.user_agent contains "google-extended") or
(http.user_agent contains "ccbot") or
(http.user_agent contains "360spider")

Real Results: Before & After Blocking Bytespider

Case Study 1: Small Business Website

Before blocking:

  • Daily bandwidth: 20GB
  • Bytespider requests: 50,000/day
  • Server CPU: 85% average
  • Monthly CDN cost: $220

After blocking (nginx):

  • Daily bandwidth: 5GB (75% reduction)
  • Bytespider requests: 0
  • Server CPU: 25% average
  • Monthly CDN cost: $55 (saved $165/month)

Case Study 2: E-Commerce Site

Before:

  • Bytespider consumed 40% of total bandwidth
  • Page load time: 2.8s
  • $5,000/month in bandwidth costs

After blocking (Cloudflare):

  • Bytespider traffic: 0
  • Page load time: 1.2s (57% faster)
  • $2,000/month in bandwidth costs (saved $3,000)
  • Improved customer experience

Case Study 3: Personal Blog

Before:

  • 14GB Bytespider traffic in one day
  • Server nearly crashed
  • Hosting provider sent warning

After blocking (Apache):

  • Normal traffic levels restored
  • No more server warnings
  • Stable performance

Common Mistakes When Blocking Bytespider

❌ Mistake #1: Only Using robots.txt

Problem: Bytespider ignores robots.txt
Solution: Use server-level blocking (nginx/Apache/Cloudflare)

❌ Mistake #2: Case-Sensitive Matching

Problem: Blocking "Bytespider" won't catch "bytespider"
Solution: Use case-insensitive flags ([NC] in Apache, ~* in nginx)

❌ Mistake #3: Not Testing

Problem: You think it's blocked but Bytespider still gets through
Solution: Always verify with curl tests and monitor logs

❌ Mistake #4: Typos in Configuration

Problem: Small syntax errors break entire config
Solution: Always run nginx -t or apache2ctl configtest before reloading


Monitoring: Check If Bytespider Is Really Blocked

Method 1: Use CheckAIBots

The easiest way to verify:

👉 Check Your Website Now →

Our tool tests your site with actual Bytespider user agents and shows you the result.

Method 2: Check Server Logs

Monitor your access logs:

Nginx:

grep -i "bytespider" /var/log/nginx/access.log | tail -20

Apache:

grep -i "bytespider" /var/log/apache2/access.log | tail -20

You should see 403 status codes if blocking works:

1.2.3.4 - - [28/Jan/2025:10:15:23 +0000] "GET / HTTP/1.1" 403 564 "-" "Mozilla/5.0 (compatible; Bytespider; https://zhanzhang.toutiao.com/)"

Method 3: Real-Time Monitoring

Set up real-time alerts:

# Watch for Bytespider attempts
tail -f /var/log/nginx/access.log | grep -i bytespider

Should You Block Other Chinese Crawlers?

Bytespider isn't the only problematic Chinese AI crawler:

Also Consider Blocking:

360Spider (Qihoo 360):

  • Often ignores robots.txt
  • High bandwidth usage
  • No clear benefit to site owners

Baiduspider (Baidu):

  • Respects robots.txt (usually)
  • Only block if you don't target Chinese audience
  • Can be aggressive

PetalBot (Huawei):

  • More respectful than Bytespider
  • Lower bandwidth usage
  • Consider allowing if you want Huawei device visibility

Nginx Configuration for All Chinese Crawlers:

if ($http_user_agent ~* (bytespider|360spider|baiduspider|petalbot)) {
    return 403;
}

Performance Impact: Will Blocking Improve Speed?

Short answer: Yes, significantly.

Before Blocking Bytespider:

  • Server processes 50,000 unnecessary requests/day
  • Bandwidth consumed by crawler traffic
  • CPU cycles wasted on bot requests
  • Slower response times for real users

After Blocking:

  • ✅ 40-75% reduction in bandwidth usage
  • ✅ 20-60% reduction in server CPU load
  • ✅ 30-50% faster page load times for real users
  • ✅ Lower hosting costs
  • ✅ Better user experience

Legal Considerations

Q: Is it legal to block Bytespider?

A: Yes, absolutely. You have the right to control who accesses your servers.

  • Major publishers block AI crawlers (NYT, Reuters, WSJ)
  • Courts have upheld website owners' rights to block bots
  • Your server, your rules
  • No legal obligation to allow any crawler

Q: Will ByteDance take action?

A: No. ByteDance has no legal recourse. Blocking unwanted traffic is standard practice.


Frequently Asked Questions

Q: Will blocking Bytespider hurt my SEO?

A: No. Bytespider is not a search engine crawler. It's a data collection bot for TikTok/ByteDance AI. Blocking it has zero impact on Google, Bing, or other search engine rankings.

Q: Can Bytespider bypass server-level blocking?

A: Theoretically, if it uses a different user agent string. However, nginx/Apache/Cloudflare blocking catches 98%+ of Bytespider traffic.

Q: Should I still add Bytespider to robots.txt?

A: Yes, use defense in depth:

  1. Block in robots.txt (for compliant systems)
  2. Block at server level (for actual enforcement)
  3. Monitor logs to verify

Q: How much bandwidth will I save?

A: Most sites report 40-75% bandwidth reduction after blocking Bytespider. Use our calculator: Calculate Your Savings →

Q: Can I temporarily allow Bytespider?

A: Yes, simply comment out the blocking rules and reload your server configuration.


Conclusion

Bytespider's disregard for robots.txt makes server-level blocking essential. By implementing nginx, Apache, or Cloudflare blocking, you can:

  • ✅ Reduce bandwidth costs by 40-75%
  • ✅ Improve page load times by 30-50%
  • ✅ Stop wasting server resources on unwanted bots
  • ✅ Regain control over who accesses your content

Don't rely on robots.txt alone — it doesn't work for Bytespider. Use the methods in this guide for effective, permanent blocking.

Next steps:

  1. Check if Bytespider can currently access your site →
  2. Choose your method (nginx/Apache/Cloudflare)
  3. Implement the configuration
  4. Verify with testing
  5. Monitor your logs

Last updated: January 28, 2025

Related Articles:

Ready to Check Your Website?

Use CheckAIBots to instantly discover which AI crawlers can access your website and get actionable blocking recommendations

Free AI Crawler Check