← Back to Blog

Bot Traffic From Access Logs: How to Separate Crawlers, Scrapers, and Real Users

Security · May 31, 2026 · 5 min read

Your access logs are full of non-human traffic. Learn to classify it into search crawlers, security scanners, content scrapers, and real visitors.

Technical cover image for Bot Traffic From Access Logs: How to Separate Crawlers, Scrapers, and Real Users

Your Access Log Is 80% Bots. Which Ones Matter?

A raw access log is a firehose of machine-generated traffic. Search crawlers, vulnerability scanners, uptime monitors, SEO tools, content scrapers, and API clients all blend together. Separating them from real user traffic is the first step in any log analysis.

Start with OpsCheck User Agent Parser to categorize User-Agent strings. Cross-reference IPs with OpsCheck IP Geolocation to see which are from cloud/datacenter ranges versus residential ISPs.

Extracting and Classifying Bot Traffic

# Get top User-Agents from nginx access log
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

# Filter out known good crawlers (Googlebot, Bingbot)
grep -v -iE "googlebot|bingbot|slurp|duckduckbot" /var/log/nginx/access.log | wc -l

# Identify datacenter IPs — these are almost never real users
# Use OpsCheck IP Geolocation to classify

Traffic Categories

# Category 1: Search crawlers — low rate, identifiable, beneficial
# Googlebot, Bingbot, YandexBot, Baiduspider, DuckDuckBot
# Verify via reverse DNS: *.googlebot.com, *.search.msn.com

# Category 2: Vulnerability scanners — high noise, potentially dangerous
# Requesting /wp-admin, /.env, /config.php, /adminer.php
grep -E "wp-admin|\.env|config\.php|adminer" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn

# Category 3: Content scrapers — sequential requests, no assets
# Check if an IP requests pages but never CSS/JS/images
# Compare page requests to asset requests per IP

Real-World Scenario

A hosting provider noticed their support portal was slow. Access log analysis showed 60% of traffic came from 3 IP ranges all using similar User-Agent strings with "HeadlessChrome". These were price-scraping bots from competing hosting companies, hitting the pricing page every 30 seconds. The fix was rate-limiting those specific pages to 10 requests/minute per IP, which dropped server load by 40% without affecting real users. The OpsCheck Blacklist Checker showed none of the IPs were on public blacklists — this was commercial scraping, not malware.

# One-liner to find the busiest IPs
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

# Cross-reference each busy IP's User-Agent
grep "^198.51.100.10 " /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq -c

Classification Checklist

  • Separate by IP type: residential vs datacenter/cloud
  • Categorize User-Agent: declared bot, generic browser, custom/none
  • Check request pattern: sequential IDs, no assets, high rate = scraper
  • Verify crawler identity: reverse DNS must match the claimed bot name
  • Look for vulnerability scanner patterns: probing admin paths, config files
  • Rate-limit appropriately: don't block search crawlers, do throttle scrapers