Also the main business model of Google (and of search engines in general) is to republish rearranged snippets of copyrighted content and even serve whole copies of the content (googleusercontent cache), without prior authorization of the copyright holders, and for-profit.
It’s completely illegal if you think about it.
So why LLMs who crawl the internet to present snippets and information should be treated differently from Google ? (who also reproduce verbatim the same content without paying any compensation to the copyright owners (all types: text, image, code)
Google would argue (and they won in federal court versus the Author's Guild using this argument) that displaying snippets of publicly-crawlable websites constitutes "fair use." Profitability weighs against fair use but it doesn't discount it outright.
They would also probably cite robots.txt as an easy and widely-accepted "opt-out" method.
Overall, I'm not sure any court would rule against Google's use of snippets for search. And since Google's been around for over 20 years and they haven't lost a lawsuit over it, I don't think it's accurate to say "it's completely illegal if you think about it."
US copyright law is one of those things that might seem simple, but really isn't. Hence many of the copyright lawsuits clogging our judicial system.
If I was a gambling person I would say that interpretation of fair use is going to fall in the next 20 years as there is just too much weight put on it currently, and AI is just going to make it untenable in its current form.
In addition, the fair use test contains a pillar about the use not affecting the market for the copyright holder's works[1] which I think in google's case (and probably in the current openAI case too) seems obviously not to have worked out (ie google's use has demonstrably negatively affected the market for the original copyrighted work in cases such as news for example).
> ie google's use has demonstrably negatively affected the market for the original copyrighted work in cases such as news for example
Most news sites wouldn't get any traffic without search engines and aggegrators. Which is why they are now whining about FB et al no longer sending them traffic.
And let's not forget that both traditional and online news is no stranger to republishing other people's content - one of the reasons fair use exists in the first place.
I have no love for big tech but let's not pretent that this is about anything other than news publishers wanting more gibs.
Well it's because judges are humans and humans are fallible. Humans also "like google" because it makes their life easier. It's hard to punish an entity you like.
The result of that is either that they wouldn't show snippets or that they would pass the cost on to you. And do you think they profit from showing the snippets of results that are not the result you want to click on?
Not wanting to defend the likes of Google, but search engines link the original source (in contrast to LLMs). Their basic idea is to direct people to your content. There are countries where content companies didn't like what Google does: Google took them out of the index -> suddenly they where ok with it again so that Google put them in again. (extremely simplified story)
> Their basic idea is to direct people to your content.
This is less and less true, as evidenced by the progression of 0-click searchs.
> There are countries where content companies didn't like what Google does: Google took them out of the index -> suddenly they where ok with it again so that Google put them in again.
I over-simplified. It's about Google News. The news paper companies managed to lobby for a law that requires search providers to pay money to the news papers they link to (or for the tiny excerpt they show in the search results). So Google said they will discontinue Google News in those countries. Suddenly the news papers gave Google a free license to link to them. (still simplified story)
Because search engines do not create mishmash of this data to parrot some stuff about it. Also they don’t strip the source, the license, and stop scraping my site when I tell them.
LLMs scrape my site and code, strip all identifying information and license, and provide/sell that to others for profit, without my consent.
There's a standard for excluding content from indexing via the Robots Exclusion Standard using robots.txt (sitewide) or the <noindex> HTML meta header. The robots.txt standard has existed for nearly 30 years, being first proposed in February 1994.[1]
Should a publisher wish to be excluded from Google's, or any other web index's search and presentation, that's easy enough to specify.
That's not how copyright law works at all. It doesn't say "well if you didn't want someone to copy this thing you should have stopped them from doing it". It lays out 4 factors for a court to consider about whether something is fair use and none of them are around how easy it was to rip the work off.[1]
In the LLM space it seems even more clear because many/most of the works in the various corpora used for this training have very clear copyright terms which prevent digital storage and reproduction without the publishers permission (just look at the reverse of the title page of any book for the copyright notice if you don't believe me).
Finally, for LLMs many/most of the works are in corpora[2] that people just download so they aren't looking at a robots.txt file put up by teh original site. If you look at The Pile paper[3] for example they explicitly say that much of the material is under copyright and that they are relying on fair use.
Most critically, courts have put strong emphasis on the notion of transformative use of copyrighted works, and web indexing is transformative in the sense that it does not create a competing work, but provides a means of discovering and assessing the relevance of the indexed work itself.
As to web indexing, that (and associated factors including thumbnails and caching) have been ruled by courts to be fair-use adaptations of works:
Displaying a cached website in search engine results is a fair use and not an infringement. A “cache” refers to the temporary storage of an archival copy—often a copy of an image of part or all of a website. With cached technology it is possible to search Web pages that the website owner has permanently removed from display. An attorney/author sued Google when the company’s cached search results provided end users with copies of copyrighted works. The court held that Google did not infringe. Important factors: Google was considered passive in the activity—users chose whether to view the cached link. In addition, Google had an implied license to cache Web pages since owners of websites have the ability to turn on or turn off the caching of their sites using tags and code. In this case, the attorney/author knew of this ability and failed to turn off caching, making his claim against Google appear to be manufactured. (Field v. Google Inc., 412 F.Supp.2d 1106 (D. Nev., 2006).)
Or, to use your phrase, by common law (precedential case law), that is precisely "how copyright law works". Note particularly that the courts leaned on publishers' capabilities to indicate whether or not caching was or was not permitted "using tags and code".
There's a larger issue which I'm not aware of being explicitly raised in case law, which concerns how the World Wide Web is indexed as contrasted to how a print library is indexed. In the case of a library, an independent third party (the library cataloguer) assigns metadata to a work (standardised title, author(s), translator(s), illustrator(s), publisher(s), etc., as well as subject headings and call numbers. Additional indexing is provided through citations indices (both forward and reverse --- works cited by, and citing, other works). These largely don't rely on the text of the indexed work itself, though of course the cataloguer presumably is reading at least portions of the work to classify it. Critically: the works themselves are physical artefacts of fixed form which are virtually always read directly rather than interpreted through some mechanism.[1]
As it's evolved over the past quarter century or so, Web search doesn't rely strongly on metadata (though some of this is taken into consideration), and most particularly publisher-provided keywords are almost wholly ignored, largely due to flagrant abuse of that feature by some publishers. Instead, a combined approach of full-text indexing (that is: capturing the full text of a work and identifying keywords and tuples (multi-word phrases) which can be matched against queries entered by persons searching for documents, and an assessment of the overall relevance of that work, usually at a site (or sub-site) level based on other indicia, most famously (though somewhat less relevantly today) "PageRank", Google's original site-ranking algorithm.
Further, the entire mechanism of the Web is of creating copies of works on request. When an HTTP request is sent, the server responds by copying the requested work to an output stream, which is then received (and duplicated, often multiple times) by the client system as an integral part of the utilisation of that content. US copyright law does not have a section specifically referring to computer-network transmission, but there are multiple limitations on exclusive rights to copy (by authors) above and beyond the 107 Fair Use exemptions in sections 108 through 122 of 17 U.S.C, including specifically ephemeral recordings (108) and the case of computer programmes (117).
Large language model training is a new area of use and law (legislative or common) is yet to be determined, but there's at the very least existing statutory language as well as precedent which suggest that at least some uses might well be found to be fair use. As I'm watching the situation evolve, I'm reminded strongly of several articles copyright scholar Pamela Samuelson wrote in the 1990s over adapting copyright to the Internet age, and questions of what its future place might be: specific governance over the literal copying of expressive works, or a general doctrine against misappropriation. As always, there's a sharp tension between authors' rights (and, let's be brutally honest: publishers' profits) and the underlying Constitutional justification of US copyright law: "To promote the Progress of Science and useful Arts".
(Discussion here strongly reliant on US law. There's general international agreement on copyright through the Berne Convention, though significant national differences exist.)
________________________________
Notes:
1. There is a spectrum of works, e.g., print books, phonographs, CDs and DVDs (the latter containing anti-circumvention mechanisms), etc., but in general there's minimal if any intermediate copying and duplication of works, and in many cases none at all.
I appreciate the detail in your reply. Do you think the recent Warhol "Orange Prince" case[1] gives an inkling into possible future court treatment of the question of "transformative" use for generative AI models? There Warhol's silk screen print of the original Prince photo was deemed not transformative enough as I understand it. One of things about the stochastic nature of generative AI is can be rather hard to notice when the model spits out something very close to the training material.
Google respects the "robot.txt" and asks you to use it to opt out of their crawling.
Parent's point is if your own scaping army respects the "scaping.txt" and goes down on Google as they don't opt-out in their scraping.txt, it probably wouldn't fly.
I don't understand. What does "Rules for thee but not for me" mean if "google is allowed to scrape" whatever people allows Google to scrape but "you’re not allowed to scrape google" because using the same rules google.com/robots.txt says
There's an imbalance because the robot.txt rule is something Google pushed forward (didn't invent it, but made it standard) and is opt-out. So yes, Google made up their rules and won't let other people to make up their own self-beneficial rules in a similar way.
> Google [...] won't let other people to make up their own self-beneficial rules in a similar way.
What "other people"?
If it's the "you" who is not allowed to scrape google in https://news.ycombinator.com/item?id=36817237 then you can make your own "google is not allowed to scrape my thing" rules if you think that's beneficial for you.
If it's somehow related to LLM providers or users I doubt that's what the original comment was referring to.
To be clear, I understand the original comment as
LLM companies say "I can use your content and you cannot not prevent me from doing so, but I won't allow you to use the output of the LLM" just like Google says "I can scrape your content and you cannot not prevent me from doing so, but I won't allow you to scrape the output of the search engine"
You should change "you cannot prevent me from doing so" into "you'll need to setup your ressources in the way that I defined if you don't want me to slurp them".
I see it as the equivalent of the spam mail that require the user to login to disable them.