Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's easy to say when it's your bot, but I've been on the other side to know that the problem isn't your bot, it's the 9000 other ones just like it, none of which will deliver traffic anywhere close to the resources consumed by scraping.


True. Major search engines and bots from social networks have a clear value proposition: in exchange for consuming my resources, they help drive human traffic to my site. GPTBot et al. will probably do the same, as more people use AI to replace search.

A random scraper, on the other hand, just racks up my AWS bill and contributes nothing in return. You'd have to be very, very convincing in your bot description (yes, I do check out the link in the user-agent string to see what the bot claims to be for) in order to justify using other people's resources on a large scale and not giving anything back.

An open web that is accessible to all sounds great, but that ideal only holds between consenting adults. Not parasites.


> GPTBot et al. will probably do the same, as more people use AI to replace search.

It really won’t. It will steal your website’s content and regurgitate it back out in a mangled form to any lazy prompt that gets prodded into it. GPT bots are a perfect example of the parasites you speak of that have destroyed any possibility of an open web.


Only if the GPT companies can resist the temptation of all that advertising $$$.

I'll give them at most 3 years before sponsored links begin appearing in the output and "AI optimization" becomes a fashionable service alongside the SEO snake oil. Most publishers won't care whether their content is mangled or not, as long as it is regurgitated with the right keywords and links.


What do you mean sponsored links? It'll be a sponsored reply, no outbound links required.


Links to a site where the sponsor's product or service can be purchased, obviously.


That was my hunch. My initial post on robots.txt: https://evgeniipendragon.com/posts/i-am-disallowing-all-craw... - revolved around blocking AI models from doing that because I do not believe that it will bring more traffic to my website - it will use the content to keep people using their service. I might be proven wrong in the future, but I do not see why they would want to let go of an extra opportunity to increase retention.


Which is all you need a lot of the time. If you're a hotel, or restaurant, or selling a product, or have a blog to share information important to you, then all you need is for the LLM to share it with the user.

"Yes, there are a lot of great restaurants in Chicago that cater to vegans and people who enjoy musical theater. Dancing Dandelions in River North is one." or "One way to handle dogs defecating in your lawn is with Poop-Be-Gone, a non-toxic product that dissolves the poop."

It's not great for people who sell text online (journalists, I guess, who else?). But that's probably not the majority of content.


You are bringing a great point. In some cases having your data as available as possible is the best thing you can do for your business. Letting them crawl and scrape creates means by which your product is found and advertised.

In other cases, like technical writing, you might want to protect the data. There is a danger that your content will be stolen and nothing will be given in return - traffic, money, references, etc.


> but that ideal only holds between consenting adults.

If your webserver serves up the page, you've already pre-consented.

One of my retirement plans has a monthly statement available as a pdf document. We're allowed to download that. But the bot I wrote to download it once a month was having trouble, they used some fancy bot detection library to cockblock it. Wasn't allowed to use Mechanize. Why? Who the fuck knows. I'm only allowed to have that statement if I can be bothered to spend 15 minutes a month remembering how to fucking find it on their site and downloading it manually, rather than just saving a copy. Banks are even worse... they won't keep a copy of your statements longer than 6 months, but go apeshit if you try to have those automatically downloaded.

I don't ask permission or play nice anymore. Your robots.txt is ignorable, so I ignore it. I do what I want, and you're the problem not me.


All our customers were in North American but we let a semi naughty bot from the UK scan us and I will never understand why. It was still sending us malformed URLs we purged from the site years ago. WTF.


I'm confused why scraping is so resource intensive - it hits every URL your site serves? For an individual ecommerce site that's maybe 10,000 hits?


And the thousands of other bots also hitting those, together is far more than the legitimate traffic for many sites.


Yeah, there were times, even running a fairly busy site, that the bots would outnumber user traffic 10:1 or more, and the bots loved to endlessly troll through things like archive indexs that could be computationally (db) expensive. At one point it got so bad that I got permission to just blackhole all of .cn and .ru, since of course none of those bots even thought of half obeying robots.txt. That literally cut CPU load on the database server by more than half.


In the last month, bot traffic has exploded to 10:1 due to LLM bots on my forum according to Cloudflare.

It would be one thing if it were driving more users to my forum. But human usage hasn't changed much, and the bots drop cache hit rate from 70% to 4% because they go so deep into old forum content.

I'd be curious to see a breakdown of what the bots are doing. On demand searches? General data scraping? I ended up blocking them with CF's Bot Blocker toggle, but I'd allow them if it were doing something beneficial for me.


For me (as I'm sure for plenty other people as well) limiting traffic to actual users matters a lot because I'm using a free tier for hosting in the time being. Bots could quickly exhaust it, and your website could be unavailable for the rest of the current "free billing" cycle, i.e. until your quota gets renewed.


With 1000s of bots per month and 10,000 hits on an ecommerce site, with product images, that's a lot of data transfer, and a lot of compute if your site has badly designed or no caching, rendering all the same page components millions of times over. But...

Part of the problem is all those companies who use AWS "standard practice" services, who assume the cost of bandwidth is just what AWS charges, and compute-per-page is just what it is, and don't even optimise those (e.g. S3/EC2/Lambda instead of CloudFront).

I've just compared AWS egress charge against the best I can trivially get at Hetzner (small cloud VMs for bulk serving https cache).

You get an astonishing 392x(!) more HTTPS egress from Hetzner for the same price, or equivalently 392x cheaper for the same amount.

You can comfortably serve 100+ TB/month that way. With 10,000 pages times 1000 bots per month, that gives you 10MB per page, which is more than almost any eCommerce site uses, when you factor that bots (other than very badly coded bots) won't fetch the common resources (JS etc.) repeatedly for each page, only the unique elements (e.g. HTML and per-product images).


You’ve forgotten the combinatorics of query params.

10,000 is Monday. Morning.


Yes. That’s a lot of bandwidth, depending on the content of course.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: