Good move. Google wants to digitize all the books in the world, and reCaptcha does just that. Google could have tried just becoming a client of reCaptcha, but their scale would have crushed reCaptcha, to say nothing of the business risk of outsourcing a key stage of your signup process to a third party.
The logical solution was to buy them and Google-size their infrastructure.
From a personal perspective, I don't think this really changes how I feel about reCaptcha either. They'll still have the same mission -- unless Google stops them reading out of copyright classics and puts them onto scanning in Dan Brown novels, which I think is unlikely.
Since most modern books either have high-quality print copies or digital copies available, I doubt that this technology will be used for much else than scanning old books for the foreseeable future.
I can't believe I didn't see this coming. Given Google's size and mission statement, any company whose business generates new data must partner with Google, sell out to Google, or kill Google if it wants to survive.
reCAPTCHA has some pretty good network effects due to the amount of data they have. reCAPTCHA detects suspect behaviour from IP addresses/ranges/sites etc. and blocks users based on that.
They learned from the Google school of public relations. ("What, make money? Hah, hah, we're just a bunch of geeks who like interesting problems, no crass lucre here, no siree!")
So reCaptcha's basis is a new field known as Human Computation that Luis invented himself. The term was coined in his Ph.D thesis. Google, however, has been the only company to really embrace Human Computation so it's not really a huge surprise that reCaptcha was sold to Google.
The most common phrase that uses the word lucre is "filthy lucre," a reference to money being something that distracts noble people from pursuing altruistic goals.
There are only so many times anyone wants to hear about his MacArthur Genius Award or how both Microsoft and Google fought over recruiting him and how instead he took the high road and decided to stay at CMU. Maybe it was because it was his first time leading 251 and he was out to prove he was worthy....
That said, he was a very good teacher. Great lectures with lots of student involvement. He also had some creative ways of catching cheaters.
I apologize in advance if some of the details are wrong. Please correct me if you remember this incident better:
When 251 started, everyone was required to sign an honesty policy about cheating, etc. The policy explicitly stated that students shouldn't use online (or other) resources for help while working on homework sets.
One of the early sets featured a fairly difficult problem. Many students succumbed to the power of the Google search box and typed in some keywords around the problem. In the first page of organic results there was a site full of algorithm question and answers, including this problem but none of the others in the set. The site's domain looked innocent; I think it even had the word math in it. Many clicked, found the answer and either used it to guide their thinking or copied it close to verbatim, cheating either way.
At the lecture following the assignment due date, Luis announced that he knew X number of people had cheated on the last homework set. He stated that anyone who came clean would receive a zero on the assignment but wouldn't be reported to the dean. Since many students worked on these problems as groups by splitting up the work, a few though that their friends had ratted them out. Others had cheated differently and thought they had been caught but didn't know how. Regardless of the method, everyone who cheated came clean.
At the next lecture, Luis told us how he did it. The innocent site was actually his, running off a box in his lab. A simple whois confirmed it. He had logged all IPs accessing the page and correlated them with student usernames. This wasn't that far-fetched: since everyone at CMU registers their computer with computing services in order to get network access, a reverse DNS to hostname, which usually contains the username, would be enough to identify. Besides, he could also have worked with computing services to get the original registration username as well.
Even if he had identified only 10% of X, he managed to get everyone who cheated to admit it and send a clear message at the same time.
Since reCaptcha tries to differentiate humans from computers/bots, I find it ironic that the same reCaptcha technology is being used to improve the accuracy of how computers scan text...
I would think that the human input is improving the OCR algorithms. I assume that every human correction trains the OCR in some way. With enough corrections the OCR should at some point learn to distinguish between the characters it couldn't before.
Which means the technology will eventually make an algorithm so good that it can solve a CAPTCHA as well as a human, thus making itself obsolete. What other technology does that?
Only the ones that are too aggressive. Most viruses shoot for a very nice balance between killing the host and keeping enough of them around for the next batch.
The weirdest effect of this is that the most dangerous viruses tend to burn out. If one of those ever came along with a really long incubation time for the disease but a much longer time for contagion that might be a problem.
Some users from 4chan managed to hack recaptcha and embarrass Time magazine seriously at the same time not long ago.
Basically they stuffed the ballots on the Time online voting page where users could vote for most influential person of the year. Most influential person ended up being Moot, founder of 4chan. Just to rub it in they managed to spell "marblecake also the game" out of the first letters of the top 21 entries.
Note: This hack shows more about Time's incompetence than recaptcha's.
No, they didn't come close to "hacking recaptcha":
Update – Just to be perfectly clear, anon didn’t hack reCAPTCHA. It did exactly what it was supposed to do. It shut down the auto voters instantly and effectively. The only option left after Time added reCAPTCHA to the poll was a brute force attack. Ben Maurer, (chief engineer on reCAPTCHA) comments on the hack: “reCAPTCHA put up a hard to break barrier that forced the attackers to spend hundreds of hours to obtain a relatively small number of votes. reCAPTCHA prevented numerous would-be attackers from engaging in an attack. In any high-profile system, it’s important to implement reCAPTCHA as part of a larger defense-in-depth strategy”. As Dr. von Ahn points out “had Time used reCAPTCHA from the beginning, this would have never happened — anon submitted tens of millions of votes before Time added reCAPTCHA, but they were only able to submit ~200k afterwards. And to do this, they had to resort to typing the CAPTCHAs by hand!” One thing that Time inc. did that made it much easier for the anonymous hack was to allow leave the door open for cross-site request forgeries which allowed anon to create a streamlined poll that never had to fetch data from Time.com.
While not completely cracking it, didn't they figure out that they could skip one of the two words, and which word to skip effectively halving the time it takes to solve the captcha?
Thanks for bringing that up, I'd almost forgotten about that.
The way I understand it, reCAPTCHA gives you 2 words to analyze, one which reCAPTCHA "knows" and one it is trying to learn about. As long as you get the word it knows about correct, it'll say you're a human. You're answer to it's unknown word is simply stored and (I'd assume) analyzed until it has enough responses to consider it a known word. Knowing this allowed 4chan users to nearly cut their response time to the captchas in half.
I'd be fascinated to hear more details (or be corrected) on how reCAPTCHA works if anyone has them.
I think you've pretty much got it right. Once there's a significant amount of "agreement" on what a word says, reCAPTCHA will assume it's correct. My guess is it will keep unknown words in its "unknown" pool until a minimum amount of responses are given, AND the responses are in agreement over a minimum (likely very high) percentage.
4chan users were never able to successfully break recaptcha without some significant brute force techniques as detailed in the blog post, so there is still a lot of hope yet for recaptcha.
That said, I doubt the attacks perpetuated by 4chan are the wrong place to start searching if what you're looking for is intelligent attempts to crack the system. They are among the truest supporters of the rule to "When in doubt, use brute force." The fact that they fell back on that technique probably means very little for the actual security of recaptcha.
I preferred thinking that the CAPTCHAs on my websites were doing a bit of good for the world - now I'm just facilitating free labor for Google's benefit.
You're still getting free bot-prevention. If you don't like it, use something like Damien Katz's Negative CAPTCHA (hide a field with CSS, call it email, and wait for a bot to fill it out, and check that that field is empty server-side).
And completely unsustainable. It only takes a parser that checks if the field is visible. It's just a simple matter of adapting. reCAPTCHA is much harder to work around for a spammer.
Checking to see if the field is visible is an extremely difficult task. It should be hidden by CSS, so you would need to download all external style sheets and parse all inline and embedded styles. Same goes if the field is hidden by JavaScript, only it becomes much harder then. The only problem with this technique are dumb auto form-fillers.
Absolutely. It only stops people who are half-heartedly malicious: bots stop and attackers get past either way. You now get the malicious uncoordinated person, but you also get a simpler code base and the peace of mind from knowing that your login system can't fail just because of a free third-party api.
No, but the deal was just announced today. I wonder if before long Google will use reCAPTCHA to digitize books as part of the Google books project - all without the permission of the original copyright owners.
Edit - To clarify, I think that making information in the public domain more accessible is a very worthwhile project.
All the new books are pretty high quality, compared to old printing techniques. There's a lot of valuable Public Domain data that we're all itching to have, and having Google be the source for that is a distinct possibility.
Well I think if anything that you are helping computers perform OCR better, which I would consider to be "doing a bit of good for the world", but I guess it depends on if you associate Google with "evil".
in what sense was it doing good? I dont think any of their current content has been seen by anyone much (It is old NYT's I think... so I dont know if they do it under contract for NYT?)
I remember reading at least a year after its introduction that nothing had been done with the OCR data it had produced to date.
It's only very recently that they started talking publicly about their relationship with the NYT, and they've still said diddly-squat about public availability of the results.
In my opinion, digitizing old books and newspapers is doing good for the world no matter who is profiting from it. The information will still be made publicly available on the internet, which it is not now. More information is a good thing.
I read this as Google basically giving up on CAPTCHA.
Let me explain.
For a while google had one of the hardest CAPTCHAs but bots just keep getting better, and they were cracking it more and more often. But you can't just make the CAPTCHA even more difficult, it was already fooling a lot of the humans.
Note that a bot does not need to successfully solve it 100% or even 90% of the time. I'm not sure of what the exact figure is but at some point the bot reaches parity with humans even if it only succeeds say 25% of the time. That's because on average now it only takes 4 guesses to guess correctly.
And I don't think reCAPTCHA is stronger then the multicolored CAPTCHA google had (still has?)
I think google is realizing that they just can not stop the best CAPTCHA cracking bots, maybe they can stop a lot of them, or a lot of the not so smart bots.
But they also can't just give on CAPTCHA, and let anyone and any bot in without even trying.
Thus reCAPTCHA because you might as well do a little public good while you're at it. If nothing else, you've at least made some old texts more readable.
It's my guess Google is in this for the cookies and analytics that can be gathered by (however many) websites that use ReCAPTCHA.
As a trivial side-note: When encountered with a ReCAPTCHA, I'll fill out one of the words and put in gibberish (or other text) for the other. For some reason I find it satisfying to "pull a fast one" on any captcha service.
> For some reason I find it satisfying to "pull a fast one" on any captcha service.
Why? What do you achieve except being told you're wrong every once in a while? I mean, you're trying to bother a machine. Isn't that a bit like a reverse Turing test?
I suspect the amount of noise that ReCAPTCHA filters out from automatic attempts is several orders of magnitude larger than anything any group of actual humans can generate.
Since computers have trouble reading squiggly words like these, CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there’s a twist — the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books. Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.
The way I understand this, is that the user is presented letters from archival newspapers and must type in the text he sees, and recaptcha uses that text to improve OCR. But doesn't that imply that recaptcha was unable to interpret the scanned text before ? If so, how can it then verify the correctness of the text the user types in ? If not, how exactly is this helping OCR ?
> But if a computer can't read such a CAPTCHA, how does the
> system know the correct answer to the puzzle? Here's how:
> Each new word that cannot be read correctly by OCR is
> given to a user in conjunction with another word for which
> the answer is already known. The user is then asked to
> read both words. If they solve the one for which the
> answer is known, the system assumes their answer is
> correct for the new one. The system then gives the new
> image to a number of other people to determine, with
> higher confidence, whether the original answer was
> correct.
reCAPTCHA presents two words. One is known, the other is unknown. The idea is that if you type the known word correctly, then you've typed the unknown word correctly as well. It also shows the same unknown word to many people as a distributed accuracy check.
So reCaptcha uses a statistical accuracy and machine learning algorithms to compare the user input for the same word over hundreds of users. Thus by doing this they minimize the probability that a human who incorrectly deciphers a word skews their data.
Also reCaptcha's present two words. One of which is known, and used as a normal captcha, the other of which is an unreadable word. Thus if a someone passes the captcha word, its somewhat assumed that he also correctly typed in the unknown word because the words are presented in randomized order.
Finally, as a funny anecdote, Luis told me once of this time that they noticed there was a captcha farm in latin America paying pennies for poor latin american's to solve reCaptcha's for them in order to gain the ability to generate illegal spam. Luis started sending them whole sentences of reCaptcha's and eventually made it to full paragraph's. Ironically, the reCaptcha farm continued to solve these captcha's. He eventually stopped their activity, but he said the data they generated was some of the best he's ever gotten... :P
Righting your own CAPTCHA seems like a small task for Google. reCAPTCHA wasn't exactly encroaching on Google's market share... so what exactly did this get Google? I think they have a bigger plan, as always.
I'm currently doing research for Luis as part of Gwap, and as far as I can say, he's going to remain on the faculty at CMU but also work for Google out of the Google PGH offices. He's no longer teaching 251, however, which is what he's famous for around here (at least to students). What this means for me, however,...
The logical solution was to buy them and Google-size their infrastructure.
From a personal perspective, I don't think this really changes how I feel about reCaptcha either. They'll still have the same mission -- unless Google stops them reading out of copyright classics and puts them onto scanning in Dan Brown novels, which I think is unlikely.