That's easy to say when it's your bot, but I've been on the other side to know t...

kijin · 2025-07-17T05:00:34 1752728434

True. Major search engines and bots from social networks have a clear value proposition: in exchange for consuming my resources, they help drive human traffic to my site. GPTBot et al. will probably do the same, as more people use AI to replace search.

A random scraper, on the other hand, just racks up my AWS bill and contributes nothing in return. You'd have to be very, very convincing in your bot description (yes, I do check out the link in the user-agent string to see what the bot claims to be for) in order to justify using other people's resources on a large scale and not giving anything back.

An open web that is accessible to all sounds great, but that ideal only holds between consenting adults. Not parasites.

NackerHughes · 2025-07-17T06:11:41 1752732701

> GPTBot et al. will probably do the same, as more people use AI to replace search.

It really won’t. It will steal your website’s content and regurgitate it back out in a mangled form to any lazy prompt that gets prodded into it. GPT bots are a perfect example of the parasites you speak of that have destroyed any possibility of an open web.

kijin · 2025-07-17T10:16:29 1752747389

Only if the GPT companies can resist the temptation of all that advertising $$$.

I'll give them at most 3 years before sponsored links begin appearing in the output and "AI optimization" becomes a fashionable service alongside the SEO snake oil. Most publishers won't care whether their content is mangled or not, as long as it is regurgitated with the right keywords and links.

tpxl · 2025-07-17T14:07:20 1752761240

What do you mean sponsored links? It'll be a sponsored reply, no outbound links required.

kijin · 2025-07-18T06:03:37 1752818617

Links to a site where the sponsor's product or service can be purchased, obviously.

EPendragon · 2025-07-17T13:41:54 1752759714

That was my hunch. My initial post on robots.txt: https://evgeniipendragon.com/posts/i-am-disallowing-all-craw... - revolved around blocking AI models from doing that because I do not believe that it will bring more traffic to my website - it will use the content to keep people using their service. I might be proven wrong in the future, but I do not see why they would want to let go of an extra opportunity to increase retention.

losvedir · 2025-07-17T19:19:24 1752779964

Which is all you need a lot of the time. If you're a hotel, or restaurant, or selling a product, or have a blog to share information important to you, then all you need is for the LLM to share it with the user.

"Yes, there are a lot of great restaurants in Chicago that cater to vegans and people who enjoy musical theater. Dancing Dandelions in River North is one." or "One way to handle dogs defecating in your lawn is with Poop-Be-Gone, a non-toxic product that dissolves the poop."

It's not great for people who sell text online (journalists, I guess, who else?). But that's probably not the majority of content.

EPendragon · 2025-07-17T20:03:39 1752782619

You are bringing a great point. In some cases having your data as available as possible is the best thing you can do for your business. Letting them crawl and scrape creates means by which your product is found and advertised.

In other cases, like technical writing, you might want to protect the data. There is a danger that your content will be stolen and nothing will be given in return - traffic, money, references, etc.

NoMoreNicksLeft · 2025-07-17T13:33:34 1752759214

> but that ideal only holds between consenting adults.

If your webserver serves up the page, you've already pre-consented.

One of my retirement plans has a monthly statement available as a pdf document. We're allowed to download that. But the bot I wrote to download it once a month was having trouble, they used some fancy bot detection library to cockblock it. Wasn't allowed to use Mechanize. Why? Who the fuck knows. I'm only allowed to have that statement if I can be bothered to spend 15 minutes a month remembering how to fucking find it on their site and downloading it manually, rather than just saving a copy. Banks are even worse... they won't keep a copy of your statements longer than 6 months, but go apeshit if you try to have those automatically downloaded.

I don't ask permission or play nice anymore. Your robots.txt is ignorable, so I ignore it. I do what I want, and you're the problem not me.

hinkley · 2025-07-17T17:23:52 1752773032

All our customers were in North American but we let a semi naughty bot from the UK scan us and I will never understand why. It was still sending us malformed URLs we purged from the site years ago. WTF.

komali2 · 2025-07-17T07:36:42 1752737802

I'm confused why scraping is so resource intensive - it hits every URL your site serves? For an individual ecommerce site that's maybe 10,000 hits?

xnorswap · 2025-07-17T08:06:13 1752739573

And the thousands of other bots also hitting those, together is far more than the legitimate traffic for many sites.

TylerE · 2025-07-17T09:00:02 1752742802

Yeah, there were times, even running a fairly busy site, that the bots would outnumber user traffic 10:1 or more, and the bots loved to endlessly troll through things like archive indexs that could be computationally (db) expensive. At one point it got so bad that I got permission to just blackhole all of .cn and .ru, since of course none of those bots even thought of half obeying robots.txt. That literally cut CPU load on the database server by more than half.

hombre_fatal · 2025-07-17T14:48:59 1752763739

In the last month, bot traffic has exploded to 10:1 due to LLM bots on my forum according to Cloudflare.

It would be one thing if it were driving more users to my forum. But human usage hasn't changed much, and the bots drop cache hit rate from 70% to 4% because they go so deep into old forum content.

I'd be curious to see a breakdown of what the bots are doing. On demand searches? General data scraping? I ended up blocking them with CF's Bot Blocker toggle, but I'd allow them if it were doing something beneficial for me.

EPendragon · 2025-07-17T13:38:01 1752759481

For me (as I'm sure for plenty other people as well) limiting traffic to actual users matters a lot because I'm using a free tier for hosting in the time being. Bots could quickly exhaust it, and your website could be unavailable for the rest of the current "free billing" cycle, i.e. until your quota gets renewed.

jlokier · 2025-07-17T17:04:54 1752771894

With 1000s of bots per month and 10,000 hits on an ecommerce site, with product images, that's a lot of data transfer, and a lot of compute if your site has badly designed or no caching, rendering all the same page components millions of times over. But...

Part of the problem is all those companies who use AWS "standard practice" services, who assume the cost of bandwidth is just what AWS charges, and compute-per-page is just what it is, and don't even optimise those (e.g. S3/EC2/Lambda instead of CloudFront).

I've just compared AWS egress charge against the best I can trivially get at Hetzner (small cloud VMs for bulk serving https cache).

You get an astonishing 392x(!) more HTTPS egress from Hetzner for the same price, or equivalently 392x cheaper for the same amount.

You can comfortably serve 100+ TB/month that way. With 10,000 pages times 1000 bots per month, that gives you 10MB per page, which is more than almost any eCommerce site uses, when you factor that bots (other than very badly coded bots) won't fetch the common resources (JS etc.) repeatedly for each page, only the unique elements (e.g. HTML and per-product images).

hinkley · 2025-07-17T17:24:53 1752773093

You’ve forgotten the combinatorics of query params.

10,000 is Monday. Morning.

paulddraper · 2025-07-17T14:51:50 1752763910

Yes. That’s a lot of bandwidth, depending on the content of course.