Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To disallow:

Amazonbot, anthropic-ai, AwarioRssBot, AwarioSmartBot, Bytespider, CCBot, ChatGPT-User, ClaudeBot, Claude-Web, cohere-ai, DataForSeoBot, Diffbot, Webzio-Extended, FacebookBot, FriendlyCrawler, Google-Extended, GPTBot, 0AI-SearchBot, ImagesiftBot, Meta-ExternalAgent, Meta-ExternalFetcher, omgili, omgilibot, PerplexityBot, Quora-Bot, TurnitinBot

For all of these bots,

User-agent: <Bot Name> Disallow: /

For more information, check https://darkvisitors.com/agents

If this takes off, I've made my own variant of llms.txt here: https://boehs.org/llms.txt . I hereby release this file to the public domain, if you wish to adapt and reuse it on your own site.

Hall of shame: https://www.404media.co/websites-are-blocking-the-wrong-ai-s...



I've seen some of these bots take a lot of CPU on my server, especially when browsing my (very small) forgejo instance. I banned them with a 444 error [1] in the reverse proxy settings as a temporary measure that became permanent, and then some more from this list [2], but I will consider yours as well, thanks for sharing.

    if ($http_user_agent ~ facebook) { return 444; }
    if ($http_user_agent ~ Amazonbot) { return 444; }
    if ($http_user_agent ~ Bytespider) { return 444; }
    if ($http_user_agent ~ GPTBot) { return 444; }
    if ($http_user_agent ~ ClaudeBot) { return 444; }
    if ($http_user_agent ~ ImagesiftBot) { return 444; }
    if ($http_user_agent ~ CCBot) { return 444; }
    if ($http_user_agent ~ ChatGPT-User) { return 444; }
    if ($http_user_agent ~ omgili) { return 444; }
    if ($http_user_agent ~ Diffbot) { return 444; }
    if ($http_user_agent ~ Claude-Web) { return 444; }
    if ($http_user_agent ~ PerplexityBot) { return 444; }
(edit: see replies to do it in a cleaner way)

[1] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#ngin...

[2] https://blog.cloudflare.com/declaring-your-aindependence-blo...


In your nginx.conf, http block, add

    include /etc/nginx/useragent.rules;
In /etc/nginx/useragent.rules

    map $http_user_agent $badagent {
        default 0;
        ~facebook 1;
        [...]
        ~PerplexityBot 1;
    }
In your site.conf, server block, add

    if ($badagent) {
        return 444;
    }


Anyone knows of a crowd sourced list of these user agents? With the current state of AI startups it will be hard to keep this up to date by myself


Ideally i would like these crawlers to access /robots.txt but nothing else.

Only if they ignore robots.txt the access rules will stop them.


You can probably write a specific location block for robots.txt which will have a higher priority.

See also https://stackoverflow.com/questions/5238377/nginx-location-p...


Ah, nice, way better than my string of ifs.


As much as these companies should respect our preferences, it's very clear that they won't. It wouldn't matter to these companies if it was outright illegal, "pretty please" certainly isn't going to cut it. You can't stop scraping and the harder people try the worse their sites become for everyone else. Throwing up a robots.txt or llms.txt that calls out their bad behavior isn't a bad idea, but it's not likely to help anything either.


In one of my robots.txt I have "Crawl-Delay: 20" for all User-Agents. Pretty much every search bot respect that Crawl-Delay, even the shaddy ones. But one of the most known AI bots launched a crawl requesting about 2 pages per second. It was so intense that it got banned by the "limit_req_" and "limit_rate_" of the nginx config. Now I have it configured to always get a 444 by user agent and ip range no matter how much they request.


Rookie question, how do you ban an ip range?


In your nginx, server section:

    deny 1.2.3.0/24;
And all 256 ips from 1.2.3.0 to 1.2.3.255 get banned. You can have multiple "deny" lines, or a file with "deny" and then include it.

It's better to do it at the firewall.


You can do it in a few places, but I use my network firewall for this I use PFSense at home, but there are many enterprise grade brands).

It's common to use the host's firewall as well (nftables, firewalld, or iptables).

You can do it at the webserver too, with access.conf in nginx. Apache uses mod_authz.

I usually do it at the network though so it uses the least amount of resources (no connection ever gets to the webserver). Though if you only have access to your webserver it's faster to ban it there than to send a request to the network team (depending on your org, some orgs might have this automated).


> a crawl requesting about 2 pages per second. It was so intense [...]

Do 2 pages per second really count as "intense" activity? Even if I was hosting a website on a $5 VPS, I don't think I'd even notice anything short of 100 requests per second, in terms of resource usage.


I assumed that he meant per client. Having a limit of 2 pages a second for a single client seems like a reasonable amount to me.


If ypu open DevTools and visit any website these days, you'll be surprised.


In my scenario you request one single page from the proxy endpoint, and all other requests go straight to the static files and have no limits. I know than no human needs to request more than 1/s from the proxy, unless you are opening tabs frantically. So far, I only get praises about how responsive and quick the sites are: being harsh with the abusers means more resources for the regulars.


Downvoted for asking a completely reasonable question? Where am I?


Always find it amusing when people write about „blocking“ requests using robots.txt as if they are deploying a firewall


Agreed, all it takes is another site to copy the content, then an LLM could just scrape that…

An open web that block scraping… is likely “not an open web”


There's a typo in your file: achive


s/consider of/consider if/


Consider if course this




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: