Web Scraping and Crawling Are Perfectly Legal, Right? (2017)

modeless · on July 31, 2019

If search engines hadn't started scraping the web back in the 90s, imagine the outcry if you tried to start a search engine today. You would instantly be sued into oblivion. Copyright, trademark, TOS violations, etc. Nothing remotely similar to a modern search engine would ever be allowed to start if it wasn't grandfathered in by the time people started getting serious about law enforcement on the internet.

The law is not as clear cut as people imagine it to be. In many cases the text of the law is less important than precedent and established practice.

3xblah · on July 31, 2019

Who imagines the law to be clear cut? There is a lack of precedent and therefore ambuiguity because the cases were never strong enough. They still are not strong enough. The ambiguity favors the incumbents, who have enough cash to finance any litigation they so choose. They do not want the law on scraping clarified through legal precedent lest they end up the losers.

The price of losing a weak case against a new search engine would be worse for an existing one than avoiding litigation and instead relying on other methods of blocking competition.

The OP who admits he is not a lawyer yet proceeds to advise others. Why? He cannot distinguish the difference between scraping and publishing scraped content or between something being "illegal", i.e. a criminal offense, versus something legal that someone perceives as violating their "rights" and prompts them to file a civil suit. As is typical in the US, such suits may ultimately be meritless and purely used to intimidate.

There are law firms that make money from convincing people to file civil suits against potential competitors. They might list a number of potential causes of action against someone who visits your website. You might be able to sue your users. It does not mean you will win. In losing, you may create legal precedent that in effect enables more of them to violate your "rights".

LinkedIn is taking a calculated risk. If they continue to sue web users who use automation (as does LinkedIn), not every defendant is going be intimidated. Some will defend themselves. I think one of them is going to win.

It is one thing to purport to limit how someone can use some data that you provide to them. That is not "scraping". That is something else. It is another thing to purport to limit how someone can choose to use their own computer, i.e., interactively versus non-interactively (automation).

tyingq · on July 31, 2019

Ticket Master might be the exception...they were totally owned.

https://www.vice.com/en_us/article/mgxqb8/the-man-who-broke-...

pflenker · on July 31, 2019

You are definitely right - Google (News) provides a huge benefit to news publishers, it is responsible for most of their traffic, and still German Lawmakers tried making a law against that so that on top of all what Google does for the publishers, they should also pay for that privilege - the infamous Leistungsschutzrecht[1].

[1]https://en.wikipedia.org/wiki/Ancillary_copyright_for_press_...

beamatronic · on July 31, 2019

If you tried it today, you would need a paid subscription to get access to a search engine

quickthrower2 · on July 31, 2019

There were paid search engines in the 90's. I don't recall the names. Maybe the fact I don't remember the name says it all!

ReinholdNiebuhr · on July 31, 2019

I'd be okay with a paid subscription service if I got privacy and some other things as well. There's room for both free and pay. Duckduckgo is ok, but not google. Would be nice to have google, without the bias and without the privacy issues.

481092 · on July 31, 2019

You'll never have a search engine without bias. A search engine's job is just that, to bias various data, not soak up any and all data on the internet. That's why humans or anyone or anything in this universe is biased, because we can only soak up so much info from so many proverbial directories in a lifetime.

nine_k · on July 31, 2019

The web may be biased. But a search engine that reflects that bias is just doing its job correctly.

One inevitable and useful bias a search engine has is the bias against spam. It's aligned with the interests of the user.

There can be (and are) biases that are against the users' interests but are aligned with operator's: promotion of some content, censorship of other content. This bias I would rather see gone.

kamfc · on July 31, 2019

For a second, thought you were describing Netflix's algorithm.

mosselman · on July 31, 2019

I always hear people complain about DuckDuckGo and it gives me the impression that they haven't actually used it for a while. I only use DuckDuckGo and I find it to be a great search engine.

treerock · on July 31, 2019

Except perhaps Google is actually better at search because it knows so much about you.

In that case a Google without the privacy issues is basically the same as duckduckgo.

fauigerzigerk · on July 31, 2019

>I'd be okay with a paid subscription service if I got privacy and some other things as well

Payment and privacy are a very problematic combination, especially where end-to-end encryption is essentially impossible.

benoliver999 · on July 31, 2019

Yeah "I'd be OK with a paid subscription" means "I want my searches tied to one account".

You pretty much get to choose one or the other.

0xcde4c3db · on July 31, 2019

Also: how many executives, seeing that corpus of data on people who are demonstrably willing to spend money, would pass up the opportunity to extract more money by finding a way to sell it to marketers?

cortesoft · on July 31, 2019

The original search engines (like Yahoo) requires users to submit their websites. It worked well enough.

jasode · on July 31, 2019

>Yahoo) requires users to submit their websites. It worked well enough.

It didn't work well enough because back in 2000, Yahoo was one of the first companies to license Google's search engine. (The previous year, AOL licensed Google's search engine.)

Yahoo incorrectly thought that their "portal" and user-submitted "directories" were more important to web surfers than search algorithms like Google's Pagerank. They eventually realized their misjudgment and then tried to build a Google clone through acquisitions like Inktomi and Overture:

https://www.cnet.com/g00/news/yahoo-to-acquire-inktomi/

https://www.cbsnews.com/news/yahoo-buys-search-engine/

https://www.cnet.com/g00/news/yahoo-dumps-google-search-tech...

mgamache · on July 31, 2019

really? IMHO Yahoo was great only because we did't have pagerank(TM). It lead to purgatory, but produced superior results.

harryh · on July 31, 2019

There is no reason at all to believe this. Search engines provide a huge benefit to sites which is why they let search crawlers access them.

Even today anyone who wants to opt out of being in the Google (or MSFT, or whoever) search engine can trivially do so by updating their robots.txt. Very very few sites choose to do so.

tomhschmidt · on July 31, 2019

This article is a bit dated. AFIK the latest in web scraping legality is LinkedIn vs. HiQ, where HiQ was scraping public LinkedIn profiles. LinkedIn issued a C&D under CFAA, but HiQ received an injunction that allowed it to continue scraping. This was supposed to be tried in the Ninth Circuit court over a year ago, but not sure what happened

https://www.eff.org/cases/hiq-v-linkedin

staticautomatic · on July 31, 2019

Pretty sure the ninth published what's now the controlling opinion in the circuit. YMMV in other circuits.

EDIT: Yep. hiQ Labs, Inc. v. LinkedIn Corp., 273 F. Supp. 3d 1099 (N.D. Cal. 2017)

"In summary, the balance of hardships tips sharply in hiQ’s favor. hiQ has demonstrated there are serious questions on the merits. In particular, the Court is doubtful that the Computer Fraud and Abuse Act may be invoked by LinkedIn to punish hiQ for accessing publicly available data; the broad interpretation of the CFAA advocated by LinkedIn, if adopted, could profoundly impact open access to the Internet, a result that Congress could not have intended when it enacted the CFAA over three decades ago. Furthermore, hiQ has raised serious questions as to whether LinkedIn, in blocking hiQ’s access to public data, possibly as a means of limiting competition, violates state law."

snazz · on July 31, 2019

That’s an extremely well-thought-out court finding, especially since they thought about how the CFAA was intended to be applied. Far too often we see a more liberal interpretation of the CFAA used to the detriment of a small player.

brighter2morrow · on July 31, 2019

>In particular, the Court is doubtful that the Computer Fraud and Abuse Act may be invoked by LinkedIn to punish hiQ for accessing publicly available data; the broad interpretation of the CFAA advocated by LinkedIn, if adopted, could profoundly impact open access to the Internet, a result that Congress could not have intended when it enacted the CFAA over three decades ago.

I didn't realize original intent could be used in courts. How the heck did "original intent" lead to federal abortion and federal gay marriage when all law in letter and in practice had delegated these questions to the states?

javagram · on July 31, 2019

Courts have a variety of legal theories available. They can pick and choose from textualism, original meaning, original intent, evolving meanings/living constitution, stare decisis, or common law jurisprudence (i.e. law made up by judges) to get the result they want or believe should be the law.

calebclark · on July 31, 2019

What happened to the web's early vision of an information superhighway, an open repository of data that could be linked, shared and remixed?

The efforts (and lawsuits) to lock down and control data on the public web threatens to stifle innovation. Websites are only valuable because they're connected to the internet, which for all intents and purposes, is a general utility. It was funded by the U.S. government, and is now managed by a consortium of NGOs such as ICANN, W3C, etc.

When you publish a website, you're distributing it to the world. It's similar to when you publish a book. It has always been the right of those who buy and read a book to be able to use the facts contained therein. Although readers are not allowed to steal the creative elements, they have a basic human right to use the facts however they desire. And the medium doesn't change this truth -- whether you read the book yourself, ask your friend to read it out loud, or employ a bot to read it. So it should be with websites.

Data placed on the public Web is accessible by everyone and should be usable by everyone. It shouldn't matter who parses the page's HTML, whether it's a person, a web browser, a bot or a robot.

We're starting a movement to ensure that the facts of the world are always available for innovators to build on top of:

https://dataliberationfoundation.org/

masonic · on July 31, 2019

  It's similar to when you publish a book.

When you publish a book, copyright and Fair Use still apply.

calebclark · on July 31, 2019

100% agree.

nshepperd · on July 31, 2019

The idea that you could be sued[1] with breach of contract for violating a ToS in a case like this is goddamn insane. A contract can only be formed if both parties agree to it; and you can't be forced to agree to a contract against your will. Coercing someone into entering a contract is literally illegal.

So just don't accept the ToS. They can't make you accept it, no matter what "by [breathing] you accept our ToS" verbiage they put in it.

Sure, maybe that would make your access of the website "hacking" ("accessing a computer without permission", since you rejected the contract which would grant permission)... but the court rejected that argument in HiQ vs LinkedIn.

[I get that this probably isn't how a typical court would see it, but to me that reflects pro-corporate bias and corruption more than any kind of sane application of common law.]

[1]. Edit: originally this read "charged", but "sued" is more accurate. But I guess what I really mean is "successfully sued". Anyhow.

rayiner · on July 31, 2019

You can’t coerce someone into a contract against their will, but you also can’t try to get the benefits of a contractual arrangement without agreeing to the contract. And agreement does not have to mean signing on a dotted line. For example, if you say “I’ll $100 for someone to clean my yard,” and someone comes and does it, the contract is binding (acceptance by performance).

In the Linked-In case here, someone didn’t just happen to involuntarily agree to the TOS by making an HTTP request. They agreed to the TOS as part of creating a new account. That’s an affirmative act, as part of a bargained-for exchange. It’s no more “coercive,” in substance or form, than the release I signed the other day to rent a jet ski. You want the benefit of access? Abide by the terms.

After having agreed to the terms, the Does violated them repeatedly on a mass scale. Again, this wasn’t accidental. The complaint lists a litany of ways where the scrapers clearly encountered technical barriers Linked In put in place, and used technical means to get around them.

hermitdev · on July 31, 2019

This is a catch-22. I see this mostly on US sites in regards to cookies.

Something along the lines "by using this site, you agree to our terms of service and use of cookies". You'll definitely see a banner like this on paywalled news sites.

Thing is, I got to the site from a link from say HN. I haven't agreed to anything, let alone seen the ToS, let alone had a chance to affirm or unaffirm to the conditions until I've already used the website. Thus, they assume my tacit approval of their ToS. The only way to not agree to the ToS is to never have visited the site to begin with. Yet, you cannot know the ToS unless you visit the site, and even if you knew the exact link to their ToS, that is still construed as usage, and once again assumed implicit approval.

Additionally, if I'm accessing publicly available material without being authenticated, you cannot prove I ever agreed to your ToS.

I don't work on scrapers or anything, but I abhor the idea that I have implicitly entered into a contract by virtue of visiting a public, unauthenticated, webpage.

Want me into a contractual ToS? Make me sign up and sign in.

martin_a · on July 31, 2019

> Something along the lines "by using this site, you agree to our terms of service and use of cookies".

This is something I have seen getting a lot of rise since the introduction of the GDPR on German/European sites.

The law text of the GDPR on the other hand is kind of strict in that you really have to agree to something by clicking and this "accept by doing nothing"-method is explicitly not valid.

Curious to see when the first people will go to court against this.

drewmol · on July 31, 2019

I’ve always assumed those “by using this * you agree to *” statements are meant for someone else. Presumably they’ve read and agree with whatever is I’m skipping.

krageon · on July 31, 2019

People will not be going to court. It is up to the data protection agency of each respective country to build a case based on the available evidence (eg have they been informed, what was their response, have they been given adequate time to compensate, did they know they were breaking the law, etc).

martin_a · on July 31, 2019

You're right on this. I think we need a legal test case or whatever to stop this loophole, that's what I wanted to say.

EGreg · on July 31, 2019

Suppose a company allows anyone to access HTTP resources without authenticating.

In that case, can people get in trouble for scraping such HTTP resources at scale, all around the Internet?

I am worried not about breaching the TOS, but the kind of "hacking" definition that involves "circumventing the intended use" regardless of technological availability and never agreeing to a single contract.

rayiner · on July 31, 2019

If by “hacking” you’re talking about the CFAA, (1) violating that statute requires knowledge that you’re exceeding authorized access; (2) the Ninth Circuit has held that merely violating the TOS can’t be the basis for a CFAA violation anyway.

To go back to the Linked-In example of the article. The predicate for the CFAA claim is not merely scraping. It’s intentionally bypassing technological measures intended to block the scraping, after there was notice that the scrapping was not permitted.

It’s very much like real life. If a shop keeper has the door open, he can’t sue you for trespass for walking in and looking around. That’s called implied license. But if he kicks you out and tells you not to come back, you can be prosecuted for trespass if you come back, even if the shop keeper leaves the door open for everyone else. An open web server is similar.

EGreg · on July 31, 2019

So, scraping data at scale is fine until they rate-limit you or cut you off. Which means if you used a botnet and rate-limited yourself "just to be safe" or "just to be nice" without knowing whether they liked it or not, then that's fine.

onion2k · on July 31, 2019

There could be a strong argument that continuing to make additional http requests after the first one constitutes accepting the ToS. If you didn't accept them you shouldn't continue to use the website by making further requests, and you should discard the content of the first request without using it.

shakna · on July 31, 2019

> There could be a strong argument that continuing to make additional http requests after the first one constitutes accepting the ToS

You generally can't even _see_ the ToS without making further requests.

jammygit · on July 31, 2019

The legal innovation is this: by being here, you consent to <15 page contract>.

Imagine that in the real world. I honestly expect it to become commonplace - by entering our store, you agree to the terms posted in the binder you may ask to view.

rayiner · on July 31, 2019

At least in the Ninth Circuit, the rule is much more demanding than you imply: https://www.venable.com/insights/publications/2014/11/enforc....

> "[W]here a website makes its terms of use available via a conspicuous hyperlink on every page of the website but otherwise provides no notice to users nor prompts them to take any affirmative action to demonstrate assent, even close proximity of the hyperlink to relevant buttons users must click on – without more – is insufficient to give rise to constructive notice."

kg · on July 31, 2019

Many public places already have massive signs like this. Whether they're enforceable I have no idea, but the comparison is not direct because an owner of a business has the right to kick you out for almost any reason (there are exceptions for non-discrimination, of course).

sneak · on July 31, 2019

Every time I have seen one of those signs, it has contained provisions to describe manners, dress, or behavior common in poor people or minorities; it seems to me a way to codify and formalize discrimination against those groups.

filoleg · on July 31, 2019

That is kind of how it is already though.

For a specific example, it is legal to conceal carry in public spaces in my state, but I have seen a lot of stores having some extra rules posted on their window saying that by entering the store you agree to their no-carry rule, among others.

panarky · on July 31, 2019

I don't know where you live, but in Texas the sign in the store's window is backed up by state law.

If the property owner posts a sign that says no guns, and you carry anyway, you're guilty of an offence.

Source: https://statutes.capitol.texas.gov/Docs/PE/htm/PE.30.htm#30....

filoleg · on July 31, 2019

I think you misunderstood my comment. The parent was saying:

>Imagine that in the real world. I honestly expect it to become commonplace - by entering our store, you agree to the terms posted in the binder you may ask to view.

My point to this was that it isn't that much of a far-fetched scenario, as it already happens with "no guns on store premises" signs in my state. You consent to the arbitrary rule by entering the store. I bet it is already backed up by state law (not sure 100%, but I believe so), just like it is in Texas (I am in WA), so I agree with you.

sneak · on July 31, 2019

In Nevada, those signs do not carry force of law.

Spooky23 · on July 31, 2019

Already happens. Many private “public” spaces like malls or housing developments are governed by a code of conduct that you are subject to by being present.

The easy example are teen curfews at malls. If you look under 18, you consent to providing proof of age to enter certain areas of the mall, and cannot do so without an adult over a certain age.

rndgermandude · on July 31, 2019

This is different. They are enforcing their house rules, by expelling you for non-compliance and barring you from entering their property again. You are only allowed in there because they let you, after all.

They cannot however put a "If you're under 18 and violate the curfew, you hereby agree to pay a contractual penalty of $1000" in their house rules and sue you for that money even tho you did not consent to their contract.

escape_goat · on July 31, 2019

I am not a lawyer, so I will have to ask you to explain why not. It's not immediately obvious to me that the penalty doctrine would apply, as the penalty is not a mechanism to circumvent the recovery of damages through the courts.

rndgermandude · on July 31, 2019

First of all, also not a lawyer, but it is my understanding that the penalty doctrine would most likely apply, as they aren't trying to recover liquidated damages in a reasonable amount.

Moreover, it would fail the test of acceptance, because the teen in question would have had to have known about the contract and consented to it, neither of which the mall could likely demonstrate, unless e.g. they could prove they explicitly showed the thing to the teen before entering, not just hung it somewhere.

And lastly, the teen being a minor, the contract would be void in most jurisdictions anyway as teens lack the capacity to enter contracts in most jurisdictions.

If the teen caused actual damaged, then tort law is still an option, of course. And if the teen refuses to leave or sneaks back in later, then we're talking criminal offenses.

Aperocky · on July 31, 2019

Legal adapted with technology. All the protocols that may look overly elaborate are repeated 1000s of time everyday per user, and it cost very little compared to <insert 10MB javascript file>. The legal document are just part of it.

m3kw9 · on July 31, 2019

By entering this store you agree to buy everything you can with in your wallet

EGreg · on July 31, 2019

Put it in tiny text or a marginally hard to read font, then sue people who couldn't be bothered to read the note.

mgamache · on July 31, 2019

yes, dark grey text on light grey background

sam0x17 · on July 31, 2019

But see _you_ a person is not here. A bot is here, and a bot can't enter a contract.

crazygringo · on July 31, 2019

You wrote and are running the bot, and are entirely legally responsible for its behavior, like it or not.

By your logic, if someone shoots a bullet and it kills someone else, it would be the bullet's fault and not the person who pulled the trigger...?

sam0x17 · on July 31, 2019

That's a completely different situation, legally. There are existing laws saying that I am liable for bad things my bot/creation/whatever does, but a TOS works differently -- it is a contract that a human has to read and agree to. They can put in a clause saying that "by reading this document you agree to [TOS]" or "by clicking this link you agree to [TOS]" or my favorite "by using this website you agree to [TOS]". No one is using/reading/accessing anything when the Googlebot crawls a web page, and there is no way you could have a bot agree to a TOS for you. If someone wrote a "honeypot" saying that if you access this URL you owe the author $200,000, there is no way Google would now owe that person $200,000. No human being read and agreed to a contract, so there is nothing to enforce.

Now legally if you can prove that the bot author is aware of the TOS, that's a different situation, but there are a variety of situations where it is completely reasonable (search engine bots in particular) where the author will never know about the TOS.

EGreg · on July 31, 2019

By your logic, the fault is of the bullet maker.

If someone uses your open source software to commit a crime, is that your fault?

sam0x17 · on Aug 1, 2019

Seriously though. If you honestly think a bot can enter a contract, I'll set up a honeypot right now with a TOS saying that by using this website you agree to pay me $300,000, and then I'll send Google a bill when they index me. You can't have your cake and eat it too -- either it works that way, and we all should set up honeypots right now, or it doesn't work that way, and any bot I write can scrape anything I want as long as it is fully unattended and I never agree to the TOS myself.

kg · on July 31, 2019

Bots can purchase and sell shares or futures and perform other financial transactions. Is an agreement to exchange shares for currency not a contract?

sgentle · on July 31, 2019

Your honour, I did not sign this contract. My pen signed it – I merely held the pen.

sam0x17 · on Aug 1, 2019

This is why for serious things we have notaries. People forge signatures all the time. With a digital signature, you lose a lot of the forensic information (handwriting style, etc) that would aid in proving that you in fact signed something, but I digress.

If I write a script that goes to every web page on the internet and fills in all text fields with "jdJdkjl28d", selects a random choice from all drop-downs, checks all check boxes, and hits submit, no one is going to seriously argue that I would be bound by any of those agreements because I didn't read them, and the logic I wrote isn't even aware that they are agreements -- it is just checking checkboxes regardless of what the form says.

This is why some companies make you scroll all the way down to "read" the agreement before you can check the box because then it can be argued that you actually read the agreement.

mirimir · on July 31, 2019

There is no "be charged" in civil cases. Anyone can sue anyone about anything. And as TFA notes, if you're sued, you must defend yourself. Or you can get a default judgment against you.

kazinator · on July 31, 2019

A default judgment is just a document that wants you to do something, no different from that TOS that you ignored.

lotu · on July 31, 2019

Except the court is empowered to make you follow it.

nshepperd · on July 31, 2019

Thanks, you are correct. I have updated my comment accordingly.

edoo · on July 31, 2019

To me it is analogous to creating a ToS for your house where you claim if anyone takes a picture of it from a public space you are now owed money. If you wanted to enter the house and take pictures but couldn't without agreeing to the ToS that is a different matter. This might be why all LinkedIn profiles are behind a registration wall now.

parhamn · on July 31, 2019

Im actually curious what the actual legal requirement for 'acceptance' is here. Is using a website and knowing that it has a TOS and continuing using it implicit acceptance of the TOS? Can you simply put a "by a using this website you agree to the TOS" banner without explicitly getting consent?

TheSpiceIsLife · on July 31, 2019

If you access data available via a http request with software that doesn’t display the website in the typical web-browser-sense, are you “using” the website?

PopeDotNinja · on July 31, 2019

One thing that just occurred to me was the fact that pretty much TOS says it can be updated at anytime without notice. In theory they could update the TOS on a per request basis. Get request, decide caller has lots of money, and say in TOS "by scraping this site you owe us 14 bajillion dollars".

Alternatively, how can any website prove the TOS was even mentioned at the time a request was made? You can't enforce a contract that wasn't available to be consented to. You gonna put a TOS version at the head of every website, and have that version archived somewhere accessible later?

User23 · on July 31, 2019

NAL, but my understanding is that the Computer Fraud and Abuse Act[1] is what governs this. Essentially unauthorized access to a system is unlawful and authorization can be conditioned on using the system in a certain way, like, for example, not crawling it. Merely having a server openly serving data is no more of a justification for unauthorized access than leaving a door unlocked is justification for burglary.

Edit: This is also why you should never evade a ban. It's potentially a federal felony.

[1] https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act

cortesoft · on July 31, 2019

You can logically argue the ridiculousness, but the law doesn't care... since courts have sided with TOS restraints, they absolutely can enforce it.

Also, they aren't 'forcing' you to accept their TOS. You are free to stop requesting from their site.

dfee · on July 31, 2019

Excellent insight. I never knew how to understand those anti-scraping policies.

nkozyra · on July 31, 2019

The short version is you can be sued for pretty much any reason. And as an individual you're unlikely to have the resources to defend yourself in such a suit.

In most cases described as "gray area" it comes down to cases where the stronger party made a compelling argument against what might have been a very valid reason to scrape. The threat is often wielded as intimidation and intimidation alone.

wills_forward · on July 31, 2019

So true. And the United States has 2x the number of civil cases per capita than the next most litigious country, The United Kingdom. Tort reform just isn’t a sexy campaign promise either.

elliekelly · on July 31, 2019

The US is unusual in that if you bring a civil suit and lose you don’t have to compensate the other party for their legal fees.

Unusually high medical bills in the US are also probably a contributing factor. I’d be curious to see how often a civil suit is brought after a car accident in EU vs US.

PeterisP · on July 31, 2019

> I’d be curious to see how often a civil suit is brought after a car accident in EU vs US.

Depends on the country of course (UK is probably closer to USA instead of the rest of EU with civil law), but in general that would very rarely result in a dispute that needs to be resolved in court. The mandatory liability insurance would cover the usual claims without a civil suit, and unusual claims are unlikely to succeed and expensive to pursue, so they aren't.

fourthark · on July 31, 2019

It's pretty common after an auto accident in the US (at least in no-fault states) to sue the insurance company for pain and suffering. They will only cover medical bills and damages otherwise.

Vis the derogatory term "ambulance chasers" for personal injury lawyers.

Causality1 · on July 31, 2019

I've always wondered why there's not a federal law requiring that the losing party of a suit is required to pay the winner's legal fees, up to an amount equal to what the losing party itself spent.

dooglius · on July 31, 2019

"The rationale for the American rule is that people should not be discouraged from seeking redress for perceived wrongs in court or from trying to extend coverage of the law. The rationale continues that society would suffer if a person was unwilling to pursue a meritorious claim merely because that person would have to pay the defendant's expenses if they lost."

via https://en.wikipedia.org/wiki/American_rule_(attorney%27s_fe...

Causality1 · on Aug 1, 2019

Seems like it would have the opposite effect, discouraging everyone but the wealthy from filing claims. 60% of Americans don't have an extra thousand dollars, let alone the means to support real legal representation.

TrackerFF · on July 31, 2019

Where I'm from (Scandinavia), we have laws like that. You pay for your and the other sides legal fees.

While you rarely, if ever, see cases where losing parties have to pay millions - I personally haven't seen any like that - you can see people having to pay tens to lower hundred thousand $ in legal fees.

That can be a substantial sum for regular people, and those circumstances happen when the other party has hired in high-$$$ lawyers, and the case goes on for a while. Especially if there's multiple appeals.

So there's always the risk of that.

jjeaff · on July 31, 2019

Seems like that would have the undesirable effect of making large corporations untouchable by anyone except the government and other large corporations.

If you want to sue a big company for treating you poorly, you risk a very high price of you lose.

yub · on July 31, 2019

That’s the point of the American Rule - every side pays their own way.

RichEO · on July 31, 2019

That only holds if the laws allow the large corporation to treat people unfairly.

icebraining · on July 31, 2019

Even if the law is fair in 99.9% of the cases, if that 0.1% means you're on the hook for more than you can afford, it's rational not to risk it.

liability · on July 31, 2019

If "lawyer" were a political party, they would control the Senate. Perhaps the American legislature goes too easy on the lawsuit industry out of some form of professional courtesy.

CaliforniaKarl · on July 31, 2019

Here's a hypothetical scenario:

You've just made a container image that does something cool, and have published it on your site, which is running on the cheapest Amazon Lightsail plan (in Ohio). The final container image is 50 MB in size.

(Yes, it's weird that you wouldn't publish it to Docker Hub or the like. Suspension of disbelief, please…)

A grad student somewhere decides they like your container. They write a script that downloads your container to $TMP, and runs it with their input. Over a week, the student runs their job 50,000 times.

(Running a compute job 50,000 times is not unusual in HPC. Each run has a different input, and since your cluster would have a job scheduler, breaking up the work helps make better use of the cluster.)

Because the student downloaded the container to $TMP, a new download has to happen each time the job is run. 50,000 downloads of a 50 MB container is 2.5 TB.

(The docs say TB and GB, so I'm assuming base-10 instead of base-2 here.)

Your lightsail plan is $3.50 per month, and includes 1 TB of transfer. You will have at least 1.5 TB of extra transfer this month. At 9¢ per GB, 1500 GB of extra transfer is $135.

Please do not dismiss this out-of-hand. The scenario may seem a bit contrived, but unexpected bandwidth charges do hit people, and can make you really question your dedication, when having to pay for the access by people you don't know.

syntheticcdo · on July 31, 2019

I would be responsible. It is up to me - the server operator - the fulfill my contract with Amazon. I COULD have set up billing alerts, rate limiting, WAF, etc to avoid being charged or to shut down the unexpected usage, but I didn't. I put something on the internet so I am responsible for it.

democracy · on July 31, 2019

You are right and this is why billing alerting (like AWS's cloud watch) is a good thing.

codingslave · on July 31, 2019

If you don't want to be charged for the compute power, don't make the service.

CaliforniaKarl · on July 31, 2019

In this example, the person who created the container image isn’t paying for the compute power, they’re paying for the bandwidth cost of those users downloading the thing.

elchief · on July 31, 2019

It's funny how a 3-line bash script is considered an automated agent, but the 100's of millions of lines of code that Chrome uses to download, parse, and display HTML is somehow not "automated"

mantap · on July 31, 2019

Yes you can write that same bash script as a chrome extension and nobody will care. Many people use extensions that make automated requests not intended by the website operators.

Making automated requests isn't illegal and won't per se get you sued, it's literally how a web browser works. What gets you sued is pissing somebody off. So don't do that.

wruza · on July 31, 2019

How about someone who is pissed off by looks in a public place or by ignoring their claims. Maybe they should not leave their visually protected home and issue commands there instead?

Seriously, if one expects some privacy, they don’t live in a glass house and then complain if you see ‘em, no matter what ToS on the entrance claims.

gitgud · on July 31, 2019

Chrome (to my knowledge) doesn't automatically download webpages, it strictly downloads what you type into the url.

This begs the questions;

- At what point does web-scraping become research?

- Is it considered scraping if a human is actioning it (no automation)?

- Could I manually click links and copy the data from the browser, is this scraping?

There's no strict definition, so it's up to a judges and lawyers to decide what the definitions entail...

mantap · on July 31, 2019

Chrome does automatically download webpages: it supports prefetching and iframes.

Not only is it up to judges and lawyers but there's >190 juresdictions in which you could be sued. There are more important things to worry about, such as heart disease.

gitgud · on July 31, 2019

> There are more important things to worry about, such as heart disease.

Exactly, so much of humanity is wasting time needlessly arguing in courts... we need better systems of laws to reduce these redundant debates...

danso · on July 31, 2019

Why is the number of lines of code particularly relevant?

elchief · on July 31, 2019

Chrome uses a LOT of automation to do what it does

The actual number of lines is not important

hardwaresofton · on July 31, 2019

It seems like the biggest step forward that scraping/crawling could take is to get some relevant legal precedent -- is it possible to manufacture/make a case that could finally set the legal precedence for this?

For example on the issue of insta-agree TOS's that you are in no way forced to read ("by being on this website you agree to our TOS" BS), is it possible to make cases to establish precedence from that? Let's say I set up a site where the TOS says everyone that views the site agrees to give me $5 and I sue someone -- clearly that wouldn't hold up in court (gulp, hopefully) -- and having it struck down would help clear up this gray area...

hartator · on July 31, 2019

Disclaimer: I run SerpApi.com and IANAl.

In the US, the law has kind of settle in favor of scrapers. Thanks to the first amendment and the fair use exemption. It the data doesn’t require to sign up and your are not DDOS the website. Europe is another story.

CalRobert · on July 31, 2019

Do you have more of that story re: Europe? Also, how does this work in international scraping? (Does a French company have recourse against a US one via any treaty, for instance, or vice versa?)

mosselman · on July 31, 2019

Can you elaborate on the EU?

clashmeifyoucan · on July 31, 2019

Now, what I'm not sure about and don't have the means to consult a lawyer about is

1) How liable is the author of a scraper who put it on GitHub but never did use it themselves. Because I remember seeing a lot of implementations and tutorials of tools like BeautifulSoup that don't necessarily comply with the ToS.

2) What if you edit the User-Agent to something that resembles a browser provided the scraper is just substituting browser interaction, e.g. those search from your terminal applications. Could/would it be construed as impersonation or something?

ospider · on July 31, 2019

Things are a little different here in China. The government just published a law that if you scraping traffic is more the one third of the total traffic of a website, it is considered illegal.

Mirioron · on July 31, 2019

I hope there's some kind of minimum there or some warning system, because you might not know what a third of a specific website is. On the other hand, this kind of a rule seems somewhat sensible, because it prevents web scraping from ballooning the costs of the service.

meritt · on July 31, 2019

Whoa. Do you have a link to the legislation or news?

severine · on July 31, 2019

It doesn't reference that measure explicitely, but you might find this article interesting.

China’s Draft Data Security Measures and How They Compare to the GDPR: https://www.natlawreview.com/article/china-s-draft-data-secu...

jarfil · on July 31, 2019

1. Create a random web page with some random words.

2. Add a ToS notice stating that anyone downloading it now owes you $1000.

3. List every client IP in the logs and sue everyone even if you don't know who they belong to.

4. Profit?

rocky1138 · on July 31, 2019

No, thankfully.

http://www.internetlibrary.com/cases/lib_case456.cfm

grenoire · on July 31, 2019

How is the court ruling on that different from the LI situation?

0x4542 · on July 31, 2019

The author attempts to contrast web scraping and web crawling and gets it very wrong. This is what it actually is: web scraping - I download the content of a webpage or select webpages, web crawling - I download the content of the entire internet (or part thereof). How he can say that someone who's crawling the internet and building a search index with the contents of the downloaded data isn't also scraping the web, makes no sense.

kamfc · on July 31, 2019

It made billions for companies. The rule of thumb is to start off "legal", build a big audience with huge benefits to the companies you're burrowing the data from, then quickly shift to legal. If you're not fast enough, you're dead. Web scraping and crawling has become less acceptable because the market leaders carved out the niches before you in a very "legal" way, and DO NOT want you as a competitor.

hyfgfh · on July 31, 2019

"But law has apparently nothing to do with fairness. It's based on rules, interpreted by people." That`s why I`m chaotic!

mgamache · on July 31, 2019

Try and scrape scores or sports results or stock prices (and re-publishing). Pretty sure you will get a "cease and desist" if you get any traffic at all. That data is owned by corporations even if it is publicly displayed.

teddyh · on July 31, 2019

That data isn’t “owned”. Mere data such as you describe (sports results or stock prices) cannot be owned. There is no legal mechanism for it. Copyright can’t cover mere facts such as those, and patents or trademarks don’t apply, nor does trade secrets, obviously. However, the corporations involved sure would like you to think otherwise, and they will fight tooth and nail to keep it that way.

mgamache · on Aug 3, 2019

It's complicated...

https://pando.com/2014/02/06/who-owns-real-time-sports-data/

icebraining · on July 31, 2019

Well, not in the US. In the EU, on the other hand: https://en.wikipedia.org/wiki/Database_Directive

zaro · on July 31, 2019

I am confused. Where does robots.txt come into play here? I always thought that if url is allowed in robots.txt it's fair game to scrape.

maitredusoi · on July 31, 2019

Is life perfectly legal ?

tingletech · on July 31, 2019

needs a (2017)

patagonia · on July 31, 2019

Just a thought. The entire user data / data broker business is just businesses “scraping” my digital presence. No one is respecting my terms and conditions. They’re making trillions off that.

patagonia · on July 31, 2019

Comments accompanying downvotes always appreciated.