To disallow: Amazonbot, anthropic-ai, AwarioRssBot, AwarioSmartBot, Bytespider, ...

jraph · on Sept 4, 2024

I've seen some of these bots take a lot of CPU on my server, especially when browsing my (very small) forgejo instance. I banned them with a 444 error [1] in the reverse proxy settings as a temporary measure that became permanent, and then some more from this list [2], but I will consider yours as well, thanks for sharing.

    if ($http_user_agent ~ facebook) { return 444; }
    if ($http_user_agent ~ Amazonbot) { return 444; }
    if ($http_user_agent ~ Bytespider) { return 444; }
    if ($http_user_agent ~ GPTBot) { return 444; }
    if ($http_user_agent ~ ClaudeBot) { return 444; }
    if ($http_user_agent ~ ImagesiftBot) { return 444; }
    if ($http_user_agent ~ CCBot) { return 444; }
    if ($http_user_agent ~ ChatGPT-User) { return 444; }
    if ($http_user_agent ~ omgili) { return 444; }
    if ($http_user_agent ~ Diffbot) { return 444; }
    if ($http_user_agent ~ Claude-Web) { return 444; }
    if ($http_user_agent ~ PerplexityBot) { return 444; }

(edit: see replies to do it in a cleaner way)

[1] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#ngin...

[2] https://blog.cloudflare.com/declaring-your-aindependence-blo...

otherme123 · on Sept 4, 2024

In your nginx.conf, http block, add

    include /etc/nginx/useragent.rules;

In /etc/nginx/useragent.rules

    map $http_user_agent $badagent {
        default 0;
        ~facebook 1;
        [...]
        ~PerplexityBot 1;
    }

In your site.conf, server block, add

    if ($badagent) {
        return 444;
    }

nilsherzig · on Sept 4, 2024

Anyone knows of a crowd sourced list of these user agents? With the current state of AI startups it will be hard to keep this up to date by myself

Tepix · on Sept 4, 2024

Ideally i would like these crawlers to access /robots.txt but nothing else.

Only if they ignore robots.txt the access rules will stop them.

jraph · on Sept 4, 2024

You can probably write a specific location block for robots.txt which will have a higher priority.

See also https://stackoverflow.com/questions/5238377/nginx-location-p...

jraph · on Sept 4, 2024

Ah, nice, way better than my string of ifs.

autoexec · on Sept 4, 2024

As much as these companies should respect our preferences, it's very clear that they won't. It wouldn't matter to these companies if it was outright illegal, "pretty please" certainly isn't going to cut it. You can't stop scraping and the harder people try the worse their sites become for everyone else. Throwing up a robots.txt or llms.txt that calls out their bad behavior isn't a bad idea, but it's not likely to help anything either.

otherme123 · on Sept 4, 2024

In one of my robots.txt I have "Crawl-Delay: 20" for all User-Agents. Pretty much every search bot respect that Crawl-Delay, even the shaddy ones. But one of the most known AI bots launched a crawl requesting about 2 pages per second. It was so intense that it got banned by the "limit_req_" and "limit_rate_" of the nginx config. Now I have it configured to always get a 444 by user agent and ip range no matter how much they request.

efilife · on Sept 4, 2024

Rookie question, how do you ban an ip range?

otherme123 · on Sept 4, 2024

In your nginx, server section:

    deny 1.2.3.0/24;

And all 256 ips from 1.2.3.0 to 1.2.3.255 get banned. You can have multiple "deny" lines, or a file with "deny" and then include it.

It's better to do it at the firewall.

aeonik · on Sept 4, 2024

You can do it in a few places, but I use my network firewall for this I use PFSense at home, but there are many enterprise grade brands).

It's common to use the host's firewall as well (nftables, firewalld, or iptables).

You can do it at the webserver too, with access.conf in nginx. Apache uses mod_authz.

I usually do it at the network though so it uses the least amount of resources (no connection ever gets to the webserver). Though if you only have access to your webserver it's faster to ban it there than to send a request to the network team (depending on your org, some orgs might have this automated).

dns_snek · on Sept 4, 2024

> a crawl requesting about 2 pages per second. It was so intense [...]

Do 2 pages per second really count as "intense" activity? Even if I was hosting a website on a $5 VPS, I don't think I'd even notice anything short of 100 requests per second, in terms of resource usage.

BrutalCoding · on Sept 4, 2024

I assumed that he meant per client. Having a limit of 2 pages a second for a single client seems like a reasonable amount to me.

gtirloni · on Sept 4, 2024

If ypu open DevTools and visit any website these days, you'll be surprised.

otherme123 · on Sept 4, 2024

In my scenario you request one single page from the proxy endpoint, and all other requests go straight to the static files and have no limits. I know than no human needs to request more than 1/s from the proxy, unless you are opening tabs frantically. So far, I only get praises about how responsive and quick the sites are: being harsh with the abusers means more resources for the regulars.

dns_snek · on Sept 4, 2024

Downvoted for asking a completely reasonable question? Where am I?

itsbjoern · on Sept 4, 2024

Always find it amusing when people write about „blocking“ requests using robots.txt as if they are deploying a firewall

gitgud · on Sept 4, 2024

Agreed, all it takes is another site to copy the content, then an LLM could just scrape that…

An open web that block scraping… is likely “not an open web”

Tepix · on Sept 4, 2024

There's a typo in your file: achive

aftbit · on Sept 4, 2024

s/consider of/consider if/

jeffhuys · on Sept 4, 2024

Consider if course this