Google acquires reCAPTCHA

seldo · on Sept 16, 2009

Good move. Google wants to digitize all the books in the world, and reCaptcha does just that. Google could have tried just becoming a client of reCaptcha, but their scale would have crushed reCaptcha, to say nothing of the business risk of outsourcing a key stage of your signup process to a third party.

The logical solution was to buy them and Google-size their infrastructure.

From a personal perspective, I don't think this really changes how I feel about reCaptcha either. They'll still have the same mission -- unless Google stops them reading out of copyright classics and puts them onto scanning in Dan Brown novels, which I think is unlikely.

fuzzmeister · on Sept 16, 2009

Since most modern books either have high-quality print copies or digital copies available, I doubt that this technology will be used for much else than scanning old books for the foreseeable future.

amanfredi · on Sept 16, 2009

Google already has a great captcha for their signup process. The primary motivation must have been the OCR benefits.

seldo · on Sept 16, 2009

Isn't that what I said?

amanfredi · on Sept 16, 2009

"to say nothing of the business risk of outsourcing a key stage of your signup process to a third party."

I thought you meant that Google would have actually considered ditching their own captcha in favor of recaptcha for google sign-up.

shimon · on Sept 16, 2009

Or the team.

breck · on Sept 16, 2009

Or the analytics. It's on 100,000 sites.

byrneseyeview · on Sept 16, 2009

I can't believe I didn't see this coming. Given Google's size and mission statement, any company whose business generates new data must partner with Google, sell out to Google, or kill Google if it wants to survive.

dabeeeenster · on Sept 16, 2009

What? Why? You can write your own CAPTCHA or use another provider...Unless I'm missing something?

lucumo · on Sept 16, 2009

reCAPTCHA has some pretty good network effects due to the amount of data they have. reCAPTCHA detects suspect behaviour from IP addresses/ranges/sites etc. and blocks users based on that.

codexon · on Sept 16, 2009

Does this mean, that if recaptcha for some reason thinks you are suspicious, you won't be able to login to your google account, etc?

lucumo · on Sept 16, 2009

Yes.

Makes it a good fit for Google, doesn't it?

antiismist · on Sept 16, 2009

That's an interesting insight but it goes too far. For example, Mint.com/Intuit and Facebook can survive independent of Google.

maggie · on Sept 16, 2009

spoiledtechie · on Sept 16, 2009

Google has approached Facebook to be bought out...

Keyframe · on Sept 16, 2009

I haven't even realized that recaptcha was a company.

henning · on Sept 16, 2009

I thought reCAPTCHA was a hip university research project.

patio11 · on Sept 16, 2009

They learned from the Google school of public relations. ("What, make money? Hah, hah, we're just a bunch of geeks who like interesting problems, no crass lucre here, no siree!")

jchonphoenix · on Sept 16, 2009

So reCaptcha's basis is a new field known as Human Computation that Luis invented himself. The term was coined in his Ph.D thesis. Google, however, has been the only company to really embrace Human Computation so it's not really a huge surprise that reCaptcha was sold to Google.

div · on Sept 16, 2009

That or the fact that hooking up reCaptcha to Google's project Gutenberg would seem to solve some of the problems inherent to it.

SwellJoe · on Sept 17, 2009

When did Project Gutenberg become "Google's"?

mmt · on Sept 17, 2009

Perhaps he meant Google's book scanning effort, similar to Proj. Gutenberg?

div · on Sept 22, 2009

Yes, that's what I meant. What a great way to embarrass myself mixing that up.

yan · on Sept 16, 2009

Wow, googling for 'crass lucre' brought this post up as result #2. Definition please? :)

mhartl · on Sept 16, 2009

N.B. The official term is "filthy lucre".

tokenadult · on Sept 16, 2009

The definition of "lucre" is monetary reward.

http://www.merriam-webster.com/dictionary/lucre

The most common phrase that uses the word lucre is "filthy lucre," a reference to money being something that distracts noble people from pursuing altruistic goals.

arram · on Sept 16, 2009

"Vulgar profiteering"

colinprince · on Sept 16, 2009

lucre is money, and crass would be one potential view on the acquiring thereof.

heyitsnick · on Sept 17, 2009

You are now result #1!

jcl · on Sept 16, 2009

These are the same guys who sold the hip university research project "ESP Game" to Google.

http://en.wikipedia.org/wiki/ESP_game

nixme · on Sept 16, 2009

As much as many of us despised him as a teacher, I do admit Luis von Ahn is pretty creative and brilliant at the same time.

mad44 · on Sept 16, 2009

Why did you despise him as a teacher? Too demanding, too harsh? Or did he not care? Just out of curiosity.

nixme · on Sept 16, 2009

There are only so many times anyone wants to hear about his MacArthur Genius Award or how both Microsoft and Google fought over recruiting him and how instead he took the high road and decided to stay at CMU. Maybe it was because it was his first time leading 251 and he was out to prove he was worthy....

That said, he was a very good teacher. Great lectures with lots of student involvement. He also had some creative ways of catching cheaters.

cake · on Sept 17, 2009

Could you describe some of the "creative ways of catching cheaters" he has ?

nixme · on Sept 17, 2009

I apologize in advance if some of the details are wrong. Please correct me if you remember this incident better:

When 251 started, everyone was required to sign an honesty policy about cheating, etc. The policy explicitly stated that students shouldn't use online (or other) resources for help while working on homework sets.

One of the early sets featured a fairly difficult problem. Many students succumbed to the power of the Google search box and typed in some keywords around the problem. In the first page of organic results there was a site full of algorithm question and answers, including this problem but none of the others in the set. The site's domain looked innocent; I think it even had the word math in it. Many clicked, found the answer and either used it to guide their thinking or copied it close to verbatim, cheating either way.

At the lecture following the assignment due date, Luis announced that he knew X number of people had cheated on the last homework set. He stated that anyone who came clean would receive a zero on the assignment but wouldn't be reported to the dean. Since many students worked on these problems as groups by splitting up the work, a few though that their friends had ratted them out. Others had cheated differently and thought they had been caught but didn't know how. Regardless of the method, everyone who cheated came clean.

At the next lecture, Luis told us how he did it. The innocent site was actually his, running off a box in his lab. A simple whois confirmed it. He had logged all IPs accessing the page and correlated them with student usernames. This wasn't that far-fetched: since everyone at CMU registers their computer with computing services in order to get network access, a reverse DNS to hostname, which usually contains the username, would be enough to identify. Besides, he could also have worked with computing services to get the original registration username as well.

Even if he had identified only 10% of X, he managed to get everyone who cheated to admit it and send a clear message at the same time.

n-named · on Sept 17, 2009

Interesting paper by him: http://www.cs.cmu.edu/~biglou/GWAP_CACM.pdf

chengas123 · on Sept 17, 2009

Six years later and I'm still glad to be done with 251. Sorry you missed out on Rudich's magic tricks though.

vibhavs · on Sept 16, 2009

I second that.

breck · on Sept 16, 2009

Me too. i thought it was a non profit for some reason.

gehant · on Sept 16, 2009

Since reCaptcha tries to differentiate humans from computers/bots, I find it ironic that the same reCaptcha technology is being used to improve the accuracy of how computers scan text...

ludwig · on Sept 16, 2009

Only because every time it gets stuck, it "asks" a human.

elemenohpee · on Sept 16, 2009

Yeah I thought the article was a little misleading on that point. They made it seem like the human input was improving the OCR algorithms.

asnyder · on Sept 16, 2009

I would think that the human input is improving the OCR algorithms. I assume that every human correction trains the OCR in some way. With enough corrections the OCR should at some point learn to distinguish between the characters it couldn't before.

rsingel · on Sept 16, 2009

Which means the technology will eventually make an algorithm so good that it can solve a CAPTCHA as well as a human, thus making itself obsolete. What other technology does that?

furyg3 · on Sept 16, 2009

guns.

mildweed · on Sept 16, 2009

viruses (the biological type)

jacquesm · on Sept 16, 2009

Only the ones that are too aggressive. Most viruses shoot for a very nice balance between killing the host and keeping enough of them around for the next batch.

The weirdest effect of this is that the most dangerous viruses tend to burn out. If one of those ever came along with a really long incubation time for the disease but a much longer time for contagion that might be a problem.

slyn · on Sept 16, 2009

HIV/AIDs?

jacquesm · on Sept 16, 2009

No, think Zaire Ebola with a month or two of transmission before the first symptoms. That would seriously suck.

There are horror movies that scare me much less than something like that.

mixmax · on Sept 16, 2009

Some users from 4chan managed to hack recaptcha and embarrass Time magazine seriously at the same time not long ago.

Basically they stuffed the ballots on the Time online voting page where users could vote for most influential person of the year. Most influential person ended up being Moot, founder of 4chan. Just to rub it in they managed to spell "marblecake also the game" out of the first letters of the top 21 entries.

Note: This hack shows more about Time's incompetence than recaptcha's.

In-depths story here: http://musicmachinery.com/2009/04/27/moot-wins-time-inc-lose...

brown9-2 · on Sept 16, 2009

No, they didn't come close to "hacking recaptcha":

Update – Just to be perfectly clear, anon didn’t hack reCAPTCHA. It did exactly what it was supposed to do. It shut down the auto voters instantly and effectively. The only option left after Time added reCAPTCHA to the poll was a brute force attack. Ben Maurer, (chief engineer on reCAPTCHA) comments on the hack: “reCAPTCHA put up a hard to break barrier that forced the attackers to spend hundreds of hours to obtain a relatively small number of votes. reCAPTCHA prevented numerous would-be attackers from engaging in an attack. In any high-profile system, it’s important to implement reCAPTCHA as part of a larger defense-in-depth strategy”. As Dr. von Ahn points out “had Time used reCAPTCHA from the beginning, this would have never happened — anon submitted tens of millions of votes before Time added reCAPTCHA, but they were only able to submit ~200k afterwards. And to do this, they had to resort to typing the CAPTCHAs by hand!” One thing that Time inc. did that made it much easier for the anonymous hack was to allow leave the door open for cross-site request forgeries which allowed anon to create a streamlined poll that never had to fetch data from Time.com.

mixmax · on Sept 16, 2009

That's why I wrote This hack shows more about Time's incompetence than recaptcha's.

Sorry if I didn't make it clear enough that the fault lay with Time and not recaptcha.

tlrobinson · on Sept 16, 2009

While not completely cracking it, didn't they figure out that they could skip one of the two words, and which word to skip effectively halving the time it takes to solve the captcha?

mrduncan · on Sept 16, 2009

Thanks for bringing that up, I'd almost forgotten about that.

The way I understand it, reCAPTCHA gives you 2 words to analyze, one which reCAPTCHA "knows" and one it is trying to learn about. As long as you get the word it knows about correct, it'll say you're a human. You're answer to it's unknown word is simply stored and (I'd assume) analyzed until it has enough responses to consider it a known word. Knowing this allowed 4chan users to nearly cut their response time to the captchas in half.

I'd be fascinated to hear more details (or be corrected) on how reCAPTCHA works if anyone has them.

sam_in_nyc · on Sept 17, 2009

I think you've pretty much got it right. Once there's a significant amount of "agreement" on what a word says, reCAPTCHA will assume it's correct. My guess is it will keep unknown words in its "unknown" pool until a minimum amount of responses are given, AND the responses are in agreement over a minimum (likely very high) percentage.

ALee · on Sept 16, 2009

4chan users were never able to successfully break recaptcha without some significant brute force techniques as detailed in the blog post, so there is still a lot of hope yet for recaptcha.

gloob · on Sept 16, 2009

That said, I doubt the attacks perpetuated by 4chan are the wrong place to start searching if what you're looking for is intelligent attempts to crack the system. They are among the truest supporters of the rule to "When in doubt, use brute force." The fact that they fell back on that technique probably means very little for the actual security of recaptcha.

apowell · on Sept 16, 2009

I preferred thinking that the CAPTCHAs on my websites were doing a bit of good for the world - now I'm just facilitating free labor for Google's benefit.

lsb · on Sept 16, 2009

You're still getting free bot-prevention. If you don't like it, use something like Damien Katz's Negative CAPTCHA (hide a field with CSS, call it email, and wait for a bot to fill it out, and check that that field is empty server-side).

secret · on Sept 16, 2009

That's clever!

lucumo · on Sept 16, 2009

And completely unsustainable. It only takes a parser that checks if the field is visible. It's just a simple matter of adapting. reCAPTCHA is much harder to work around for a spammer.

natrius · on Sept 16, 2009

Distinguishing humans from computers is an inherently unsustainable problem.

lucumo · on Sept 16, 2009

Yes. But it would be nice to have something that at least requires somewhat of a break-through instead of another afternoon of coding.

coderdude · on Sept 16, 2009

Checking to see if the field is visible is an extremely difficult task. It should be hidden by CSS, so you would need to download all external style sheets and parse all inline and embedded styles. Same goes if the field is hidden by JavaScript, only it becomes much harder then. The only problem with this technique are dumb auto form-fillers.

lucumo · on Sept 17, 2009

Not that difficult. If you make your spambot a Firefox extension (for example) you get all that for free.

lsb · on Sept 16, 2009

Absolutely. It only stops people who are half-heartedly malicious: bots stop and attackers get past either way. You now get the malicious uncoordinated person, but you also get a simpler code base and the peace of mind from knowing that your login system can't fail just because of a free third-party api.

sp332 · on Sept 16, 2009

Currently, reCaptcha is digitizing old copies of the New York Times. I don't think that's changed.

apowell · on Sept 16, 2009

No, but the deal was just announced today. I wonder if before long Google will use reCAPTCHA to digitize books as part of the Google books project - all without the permission of the original copyright owners.

Edit - To clarify, I think that making information in the public domain more accessible is a very worthwhile project.

lsb · on Sept 16, 2009

All the new books are pretty high quality, compared to old printing techniques. There's a lot of valuable Public Domain data that we're all itching to have, and having Google be the source for that is a distinct possibility.

brown9-2 · on Sept 16, 2009

Well I think if anything that you are helping computers perform OCR better, which I would consider to be "doing a bit of good for the world", but I guess it depends on if you associate Google with "evil".

ErrantX · on Sept 16, 2009

in what sense was it doing good? I dont think any of their current content has been seen by anyone much (It is old NYT's I think... so I dont know if they do it under contract for NYT?)

blasdel · on Sept 17, 2009

I remember reading at least a year after its introduction that nothing had been done with the OCR data it had produced to date.

It's only very recently that they started talking publicly about their relationship with the NYT, and they've still said diddly-squat about public availability of the results.

fuzzmeister · on Sept 16, 2009

In my opinion, digitizing old books and newspapers is doing good for the world no matter who is profiting from it. The information will still be made publicly available on the internet, which it is not now. More information is a good thing.

biohacker42 · on Sept 16, 2009

I read this as Google basically giving up on CAPTCHA. Let me explain.

For a while google had one of the hardest CAPTCHAs but bots just keep getting better, and they were cracking it more and more often. But you can't just make the CAPTCHA even more difficult, it was already fooling a lot of the humans.

Note that a bot does not need to successfully solve it 100% or even 90% of the time. I'm not sure of what the exact figure is but at some point the bot reaches parity with humans even if it only succeeds say 25% of the time. That's because on average now it only takes 4 guesses to guess correctly.

And I don't think reCAPTCHA is stronger then the multicolored CAPTCHA google had (still has?)

I think google is realizing that they just can not stop the best CAPTCHA cracking bots, maybe they can stop a lot of them, or a lot of the not so smart bots.

But they also can't just give on CAPTCHA, and let anyone and any bot in without even trying.

Thus reCAPTCHA because you might as well do a little public good while you're at it. If nothing else, you've at least made some old texts more readable.

sam_in_nyc · on Sept 17, 2009

It's my guess Google is in this for the cookies and analytics that can be gathered by (however many) websites that use ReCAPTCHA.

As a trivial side-note: When encountered with a ReCAPTCHA, I'll fill out one of the words and put in gibberish (or other text) for the other. For some reason I find it satisfying to "pull a fast one" on any captcha service.

mseebach · on Sept 17, 2009

> For some reason I find it satisfying to "pull a fast one" on any captcha service.

Why? What do you achieve except being told you're wrong every once in a while? I mean, you're trying to bother a machine. Isn't that a bit like a reverse Turing test?

I suspect the amount of noise that ReCAPTCHA filters out from automatic attempts is several orders of magnitude larger than anything any group of actual humans can generate.

sam_in_nyc · on Sept 17, 2009

It takes no extra effort to enter in the wrong item, so the cost of doing so is about zero.

As far as what satisfaction I gain from it... I suppose I find it to be a sort of rebellious act. Also, I did not get accepted into Carnegie Mellon.

ryanpetrich · on Sept 17, 2009

Unless you happen to be 4chan

stingraycharles · on Sept 16, 2009

Wait. I must be missing something:

Since computers have trouble reading squiggly words like these, CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there’s a twist — the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books. Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.

The way I understand this, is that the user is presented letters from archival newspapers and must type in the text he sees, and recaptcha uses that text to improve OCR. But doesn't that imply that recaptcha was unable to interpret the scanned text before ? If so, how can it then verify the correctness of the text the user types in ? If not, how exactly is this helping OCR ?

fizx · on Sept 16, 2009

  > But if a computer can't read such a CAPTCHA, how does the 
  > system know the correct answer to the puzzle? Here's how: 
  > Each new word that cannot be read correctly by OCR is 
  > given to a user in conjunction with another word for which 
  > the answer is already known. The user is then asked to 
  > read both words. If they solve the one for which the 
  > answer is known, the system assumes their answer is 
  > correct for the new one. The system then gives the new 
  > image to a number of other people to determine, with 
  > higher confidence, whether the original answer was 
  > correct.

http://recaptcha.net/learnmore.html

apowell · on Sept 16, 2009

reCAPTCHA presents two words. One is known, the other is unknown. The idea is that if you type the known word correctly, then you've typed the unknown word correctly as well. It also shows the same unknown word to many people as a distributed accuracy check.

jchonphoenix · on Sept 16, 2009

So reCaptcha uses a statistical accuracy and machine learning algorithms to compare the user input for the same word over hundreds of users. Thus by doing this they minimize the probability that a human who incorrectly deciphers a word skews their data.

Also reCaptcha's present two words. One of which is known, and used as a normal captcha, the other of which is an unreadable word. Thus if a someone passes the captcha word, its somewhat assumed that he also correctly typed in the unknown word because the words are presented in randomized order.

Finally, as a funny anecdote, Luis told me once of this time that they noticed there was a captcha farm in latin America paying pennies for poor latin american's to solve reCaptcha's for them in order to gain the ability to generate illegal spam. Luis started sending them whole sentences of reCaptcha's and eventually made it to full paragraph's. Ironically, the reCaptcha farm continued to solve these captcha's. He eventually stopped their activity, but he said the data they generated was some of the best he's ever gotten... :P

gengstrand · on Sept 18, 2009

It's a brilliant move in using collective intelligence. You get better quality CAPTCHAs. You get more accurate digitization on the cheap. I go into more detail on this at http://it.toolbox.com/blogs/future-of-work/google-acquires-r...

DarrenMills · on Sept 16, 2009

Righting your own CAPTCHA seems like a small task for Google. reCAPTCHA wasn't exactly encroaching on Google's market share... so what exactly did this get Google? I think they have a bigger plan, as always.

Ideas?

fizx · on Sept 16, 2009

I find it funny that Luis also posted:

http://vonahn.blogspot.com/2009/07/hottest-people-in-cs.html

yarapavan · on Sept 17, 2009

A list of about 50 scientific articles on "captcha" available at http://www.citeulike.org/tag/captcha

beeker · on Sept 16, 2009

Is Louis von Ahn going to work for Google as part of the deal?

jchonphoenix · on Sept 16, 2009

I'm currently doing research for Luis as part of Gwap, and as far as I can say, he's going to remain on the faculty at CMU but also work for Google out of the Google PGH offices. He's no longer teaching 251, however, which is what he's famous for around here (at least to students). What this means for me, however,...

tlrobinson · on Sept 16, 2009

His name is on the blog post, so I assume he is.

He's had a lot of neat ideas which align well with Google. I think he'd be a great addition to their team.

fizx · on Sept 16, 2009

He's a professor at CMU. I imagine some grad students will go to Google, but I'd imagine he'll stay at CMU?

http://www.cs.cmu.edu/~biglou/

dunk010 · on Sept 17, 2009

Think of all the sites which use recaptcha for their signups - google will get some interesting stats through this.

joshu · on Sept 16, 2009

Heh. Now there are three TR35s from the same year at Google.

mercury888 · on Sept 17, 2009

yep - they are going to monetize captchas. How typical of google :)