If this takes off, I've made my own variant of llms.txt here: https://boehs.org/llms.txt . I hereby release this file to the public domain, if you wish to adapt and reuse it on your own site.
I've seen some of these bots take a lot of CPU on my server, especially when browsing my (very small) forgejo instance. I banned them with a 444 error [1] in the reverse proxy settings as a temporary measure that became permanent, and then some more from this list [2], but I will consider yours as well, thanks for sharing.
if ($http_user_agent ~ facebook) { return 444; }
if ($http_user_agent ~ Amazonbot) { return 444; }
if ($http_user_agent ~ Bytespider) { return 444; }
if ($http_user_agent ~ GPTBot) { return 444; }
if ($http_user_agent ~ ClaudeBot) { return 444; }
if ($http_user_agent ~ ImagesiftBot) { return 444; }
if ($http_user_agent ~ CCBot) { return 444; }
if ($http_user_agent ~ ChatGPT-User) { return 444; }
if ($http_user_agent ~ omgili) { return 444; }
if ($http_user_agent ~ Diffbot) { return 444; }
if ($http_user_agent ~ Claude-Web) { return 444; }
if ($http_user_agent ~ PerplexityBot) { return 444; }
As much as these companies should respect our preferences, it's very clear that they won't. It wouldn't matter to these companies if it was outright illegal, "pretty please" certainly isn't going to cut it. You can't stop scraping and the harder people try the worse their sites become for everyone else. Throwing up a robots.txt or llms.txt that calls out their bad behavior isn't a bad idea, but it's not likely to help anything either.
In one of my robots.txt I have "Crawl-Delay: 20" for all User-Agents. Pretty much every search bot respect that Crawl-Delay, even the shaddy ones. But one of the most known AI bots launched a crawl requesting about 2 pages per second. It was so intense that it got banned by the "limit_req_" and "limit_rate_" of the nginx config. Now I have it configured to always get a 444 by user agent and ip range no matter how much they request.
You can do it in a few places, but I use my network firewall for this I use PFSense at home, but there are many enterprise grade brands).
It's common to use the host's firewall as well (nftables, firewalld, or iptables).
You can do it at the webserver too, with access.conf in nginx. Apache uses mod_authz.
I usually do it at the network though so it uses the least amount of resources (no connection ever gets to the webserver). Though if you only have access to your webserver it's faster to ban it there than to send a request to the network team (depending on your org, some orgs might have this automated).
> a crawl requesting about 2 pages per second. It was so intense [...]
Do 2 pages per second really count as "intense" activity? Even if I was hosting a website on a $5 VPS, I don't think I'd even notice anything short of 100 requests per second, in terms of resource usage.
In my scenario you request one single page from the proxy endpoint, and all other requests go straight to the static files and have no limits. I know than no human needs to request more than 1/s from the proxy, unless you are opening tabs frantically. So far, I only get praises about how responsive and quick the sites are: being harsh with the abusers means more resources for the regulars.
Amazonbot, anthropic-ai, AwarioRssBot, AwarioSmartBot, Bytespider, CCBot, ChatGPT-User, ClaudeBot, Claude-Web, cohere-ai, DataForSeoBot, Diffbot, Webzio-Extended, FacebookBot, FriendlyCrawler, Google-Extended, GPTBot, 0AI-SearchBot, ImagesiftBot, Meta-ExternalAgent, Meta-ExternalFetcher, omgili, omgilibot, PerplexityBot, Quora-Bot, TurnitinBot
For all of these bots,
User-agent: <Bot Name> Disallow: /
For more information, check https://darkvisitors.com/agents
If this takes off, I've made my own variant of llms.txt here: https://boehs.org/llms.txt . I hereby release this file to the public domain, if you wish to adapt and reuse it on your own site.
Hall of shame: https://www.404media.co/websites-are-blocking-the-wrong-ai-s...