Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains

Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.



> Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user.

Like most AI companies, Perplexity has established user agent strings for both these cases, and the behavior that Cloudflare is calling out does not use either. It pretends to be a person using Chrome on MacOS.


Sounds like an ad for Perplexity.

They do end up looking bad out of Cloudflare's report, who are the "good guys" in this story - btw Cloudflare's been very pushy lately with their we'll save the web, content independence day marketspeak. But deep in the back of my head, Cloudflare's goodwill elevates Perplexity cunning habilities (assuming they're the culprit since no real evidence, only heresay is in the OP), both companies look like titans fighting, which ends up being positive for Perplexity, at least in the inflated perception of their firepower... if that makes any sense.


Sounds like an ad for OpenAI, since Cloudflare reported how OpenAI is "following the rules".

Personally, I'm now less interested in using Perplexity, and more interested in using an OpenAI product.


Sounds like ad for cloudflare. Didn’t they announce a month ago they will protect websites from llm content sweep? And now they realize they cannot deliver on that promise. We did it correctly but these guys are doing it illegal way! That’ll be 14.99 per month btw..


In theory retrieving a page on behalf of a user would be acceptable, but these are AI companies who have disregarded all norms surrounding copyright, etc. It would be stupid of them not to also save contents of the page and use it for future AI training or further crawling


If you allow Googlebot to crawl your website and train Gemini, but you don't allow smaller AI companies to do the same thing, then you're contributing to Google's hegemony. Given that AI is likely to be an increasingly important part of society in the future, that kind of discrimination is anti-social. I don't want a future where everything is run by Google even more than it currently is.

Crawling is legal. Training is presumably legal. Long may the little guys do both.


Googlebot respects robots.txt. And Google doesn't use the fetched data from users of Chrome to supplement their search index (as a2128 is speculating that Perplexity might do when they fetch pages on the user's behalf).


Yes, but there's no way to say "allow indexing for search, but not for AI use", right?


But there is: https://developers.google.com/search/docs/crawling-indexing/...

There is an user agent for search that you can control in robots.txt.

    user-agent: Googlebot
There is another user agent for AI training.

    user-agent: Google-Extended


Wow, I had no idea this page existed, thanks for the reference!


The HTTP spec draws such a distinction, albeit implicitly, in the form (and name) of its concept of "user agent."


Over time it degraded into declaring compatibility with a bunch of different browser engines and doesn't reflect the actual agent anymore.

And very likely Perplexity is in fact using a Chrome-compatible engine to render the page.


The header to which you refer was named for the concept.


user agent = which bullshit css hacks and js polyfills will be needed


The examples the article cites seem to me that they are merely retrieving content on behalf of the user. I do not see a problem with this.


If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper? Just learn off the cached data. Being the man-in-the-middle seems like a pretty easy way to scrape salient content while also getting signals about that content's value.


No. The key difference is that if a user asks about a specific page, when Perplexity fetches that page, it is being operated by a human not acting as a crawler. It doesn’t matter how many times this happens or what they do with the result. If they aren’t recursively fetching pages, then they aren’t a crawler and robots.txt does not apply to them. robots.txt is not a generic access control mechanism, it is designed solely for automated clients.


Many people don't want their data used for free/any training. AI developers have been so repeatedly unethical that the well-earned Baysian prior is high probability that you cannot trust AI developers to not cross the training/inference streams.


> Many people don't want their data used for free/any training.

That is true. But robots.txt is not designed to give them the ability to prevent this.


It is in the name, rules for the robots. Any scraping ai or not, and even mass recrsive or single page, should abide by the rules.


I would only agree with this if we knew for sure that these on-demand human-initiated crawls didn't result in the crawled page being added to an overall index and scheduled for future automated crawls.

Otherwise it's just adding an unwilling website to a crawl index, and showing the result of the first crawl as a byproduct of that action.


> It doesn’t matter how many times this happens or what they do with the result.

That's where you lost me, as this is key to GP's point above and it takes more than a mere out-of-left-field declaration that "it doesn't matter" to settle the question of whether it matters.

I think they raised an important point about using cached data to support functions beyond the scope of simple at-request page retrieval.


>If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper?

That's basically how many crowdsourced crawling/archive projects work. For instance, sci-hub and RECAP[1]. Do you think they should be shut down as well? In both cases there's even a stronger justification to shutting them down, because the original content is paywalled and you could plausibly argue there's lost revenue on the line.

[1] https://en.wikipedia.org/wiki/Free_Law_Project#RECAP


I didn't suggest Perplexity should be shut down, though. And yes, in your analogy sites are completely justified to take whatever actions they can to block people who are building those caches.


> I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.

We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.


Taking down the content because you're annoyed that people are asking questions about it via an LLM interface doesn't seem like you're winning.

It's also a gift to your competitors.

You're certainly free to do it. It's just a really faint example of you being "in control" much less winning over LLM agents: Ok, so the people who cared about your content can't access it anymore because you "got back" at Perplexity, a company who will never notice.


It could be my server keeps going down because of llms agents keep requesting pages from my lyric site. Removing that site allowed other sites to remain up. True story.

Who cares if perplexity will never notice. Or competitors get an advantage. It is a negative for users using perplexity or visiting directly because the content doesn't exist.

That's the world perplexity and others are creating. They will be able to pull anything from the web but nothing will be left.


> Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.


FF isn’t training their money printer with MY data. AI scrapers are


>But they didn't take down the content, you did.

That skips the part about one party's unique role in the abuse of trust.


You don't win, because presumably you were providing the content for some reason, and forcing yourself to take it down is contrary to whatever reason that was in the first place.


Llms attack certain topics so removing one site will allow the others to live on the same server.


You can limit access, sure: with ACLs, putting content behind login, certificate based mechanisms, and at the end of the day -a power cord-.

But really, controlling which automated retrieval tools are allowed has always been more of a code of honor than a technical control. And that trust you mention has always been broken. For as long as I can remember anyway. Remember LexiBot and AltaVista?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: