ReCAPTCHA: Difference between revisions

Content deleted Content added

Inline

Revision as of 18:45, 9 October 2011

reCAPTCHA is a system originally developed at Carnegie Mellon University that uses CAPTCHA to help digitize the text of books through slave labor while protecting websites from bots attempting to access restricted areas.^[1] On September 16, 2009, Google acquired reCAPTCHA.^[2] reCAPTCHA is currently digitizing the archives of The New York Times and books from Google Books.^[3] Twenty years of The New York Times have been digitized and the project planned to have completed the remaining years by the end of 2010.^[4]

reCAPTCHA supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects.

The system is reported to display over 200 million CAPTCHAs every day,^[5] and among its subscribers are such popular sites as Facebook, TicketMaster, Twitter, 4chan, CNN.com, and StumbleUpon.^[6] Craigslist began using reCAPTCHA in June 2008.^[7] The U.S. National Telecommunications and Information Administration also used reCAPTCHA for its digital TV converter box coupon program website as part of the US DTV transition.^[8]

Origin

The reCAPTCHA program originated with Guatemalan computer scientist Luis von Ahn, aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles."^[9]

Operation

An example of a reCAPTCHA challenge from 2007, containing the words *following finding*. The waviness and horizontal stroke have been added to increase the difficulty of breaking the CAPTCHA with a computer program.

Scanned text is subjected to analysis by two different optical character recognition programs; in cases where the programs disagree, the questionable word is converted into a CAPTCHA. The word is displayed along with a control word already known. The system assumes that if the human types the control word correctly, the questionable word is also correct.^{[example needed]} The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 votes, the word is considered called. Those words that are consistently given a single identity by human judges are recycled as control words.^[10]

Implementation

reCAPTCHA tests are taken from the central site of the reCAPTCHA project, which supplies the words to be deciphered. This is done through a JavaScript API with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free service (that is, the CAPTCHA images are provided to websites free of charge, in return for assistance with the decipherment),^[11] but the reCAPTCHA software itself is not open source.

reCAPTCHA offers plugins for several web-application platforms, like ASP.NET, Ruby, or PHP, to ease the implementation of the service.

Security

An example of a reCAPTCHA challenge from 2010, containing the words *and chisels*. The distortion style has since been altered.

The basis of the CAPTCHA system is to prevent automated access to a system by computer programs or "bots". On December 14, 2009, Jonathan Wilkins released a paper describing weaknesses in reCAPTCHA that allowed a solve rate of 18%.^[12]^[13]^[14] On August 1, 2010, Chad Houck gave a presentation to the DEF CON 18 Hacking Conference detailing a method to reverse the distortion added to images which allowed a computer program to determine a valid response 10% of the time.^[15]^[16] The reCAPTCHA system was modified on 21 July 2010, before Houck was to speak on his method. Houck modified his method to what he described as an "easier" CAPTCHA to determine a valid response 31.8% of the time. Houck also mentioned security defenses in the system such as a high security lock out if a valid response isn't given 32 times in a row.^[17] reCAPTCHA frequently modifies its system which would require the author of a similar program to frequently update the method of decoding, which may frustrate potential abusers.

Mailhide

reCAPTCHA has also created project Mailhide, which protects email addresses on web pages from being harvested by spammers.^[18] By default, the email address is converted into a format that does not allow a crawler to see the full email address. For example, "[email protected]" would be converted to "[email protected]". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address. One can also edit the popup code so that none of the address is visible.

Notes

^ Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum (2008). "reCAPTCHA: Human-Based Character Recognition via Web Security Measures" (PDF). Science. 321 (5895): 1465–1468. doi:10.1126/science.1160379. PMID 18703711. {{cite journal}}: More than one of |number= and |issue= specified (help)CS1 maint: multiple names: authors list (link) CS1 maint: postscript (link)
^ "Teaching computers to read: Google acquires reCAPTCHA". Google. Retrieved 2009-09-16.
^ "reCAPTCHA FAQ". Google. Retrieved 2011-06-12.
^ Luis von Ahn (2009). NOVA ScienceNow s04e01 (Television production). Event occurs at 46:58. The New York Times has this huge archive, over 130 years of newspaper archive there. And we've done maybe about 20 years so far of The New York Times in the last few months and I believe we're going to be done next year by just having people do a word at a time. {{cite AV media}}: |access-date= requires |url= (help)
^ "reCAPTCHA FAQ". Google. Retrieved 2011-06-12.
^ Rubens, Paul (2007-10-02). "Spam weapon helps preserve books". BBC.
^ "Fight Spam, Digitize Books". Craigslist Blog. 2008-06. {{cite web}}: Check date values in: |date= (help)
^ TV Converter Box Program
^ Hutchinson, Alex (March 2009). "Human Resources: The job you didn't even know you had". The Walrus. pp. 15–16.{{cite news}}: CS1 maint: postscript (link)
^ Timmer, John (2008-08-14). "CAPTCHAs work? for digitizing old, damaged texts, manuscripts". Ars Technica. Retrieved 2008-12-09.
^ "FAQ". reCAPTCHA.net.
^ "Strong CAPTCHA Guidelines" (PDF).
^ "Google's reCAPTCHA busted by new attack".
^ "Google's reCAPTCHA dented".
^ "Def Con 18 Speakers". defcon.org.
^ "Decoding reCAPTCHA Paper". Chad Houck.
^ "Decoding reCAPTCHA Power Point". Chad Houck.
^ "Mailhide: Free Spam Protection". reCAPTCHA.net.

External links

The reCAPTCHA project
Try reCAPTCHA at google.com
ReCAPTCHA: The job you didn't even know you had Two-page article in The Walrus magazine

[vonAhn2008-1] Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum (2008). "reCAPTCHA: Human-Based Character Recognition via Web Security Measures" (PDF). Science. 321 (5895): 1465–1468. doi:10.1126/science.1160379. PMID 18703711. {{cite journal}}: More than one of |number= and |issue= specified (help)CS1 maint: multiple names: authors list (link) CS1 maint: postscript (link)

[2] "Teaching computers to read: Google acquires reCAPTCHA". Google. Retrieved 2009-09-16.

[3] "reCAPTCHA FAQ". Google. Retrieved 2011-06-12.

[4] Luis von Ahn (2009). NOVA ScienceNow s04e01 (Television production). Event occurs at 46:58. The New York Times has this huge archive, over 130 years of newspaper archive there. And we've done maybe about 20 years so far of The New York Times in the last few months and I believe we're going to be done next year by just having people do a word at a time. {{cite AV media}}: |access-date= requires |url= (help)

[5] "reCAPTCHA FAQ". Google. Retrieved 2011-06-12.

[BBCreport-6] Rubens, Paul (2007-10-02). "Spam weapon helps preserve books". BBC.

[craig-7] "Fight Spam, Digitize Books". Craigslist Blog. 2008-06. {{cite web}}: Check date values in: |date= (help)

[8] TV Converter Box Program

[9] Hutchinson, Alex (March 2009). "Human Resources: The job you didn't even know you had". The Walrus. pp. 15–16.{{cite news}}: CS1 maint: postscript (link)

[10] Timmer, John (2008-08-14). "CAPTCHAs work? for digitizing old, damaged texts, manuscripts". Ars Technica. Retrieved 2008-12-09.

[FAQ-11] "FAQ". reCAPTCHA.net.

[Strong_CAPTCHA_Guidelines-12] "Strong CAPTCHA Guidelines" (PDF).

[Register_Article-13] "Google's reCAPTCHA busted by new attack".

[H-online-14] "Google's reCAPTCHA dented".

[Speaker_Program-15] "Def Con 18 Speakers". defcon.org.

[Decoding_reCAPTCHA-16] "Decoding reCAPTCHA Paper". Chad Houck.

[Decoding_reCAPTCHA_pptx-17] "Decoding reCAPTCHA Power Point". Chad Houck.

[Mailhide-18] "Mailhide: Free Spam Protection". reCAPTCHA.net.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

@@ Line 2: / Line 2: @@
 [[File:RecaptchaLogo.svg|thumb|The reCAPTCHA logo]]
-'''reCAPTCHA''' is a system originally developed at [[Carnegie Mellon University]] that uses [[CAPTCHA]] to help [[digitizing|digitize]] the text of books while protecting websites from [[Internet bot|bot]]s attempting to access restricted areas.<ref name="vonAhn2008">{{Cite journal| author = Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum | date= 2008 | url = http://www.cs.cmu.edu/~biglou/reCAPTCHA_Science.pdf| format = PDF | title = reCAPTCHA: Human-Based Character Recognition via Web Security Measures| journal=Science | volume=321 |number=5895 | pages=1465–1468 |doi=10.1126/science.1160379 | pmid = 18703711 | issue = 5895 | postscript = .}}</ref> On September 16, 2009, [[Google]] acquired reCAPTCHA.<ref>{{Cite web|url=http://googleblog.blogspot.com/2009/09/teaching-computers-to-read-google.html | publisher=Google |title=Teaching computers to read: Google acquires reCAPTCHA |accessdate=2009-09-16}}</ref> reCAPTCHA is currently digitizing the archives of ''[[The New York Times]]'' and books from [[Google Books]].<ref>{{Cite web|url=http://www.google.com/recaptcha/faq|title=reCAPTCHA FAQ|accessdate=2011-06-12|publisher=[[Google]]}}</ref> Twenty years of ''The New York Times'' have been digitized and the project planned to have completed the remaining years by the end of 2010.<ref>{{Cite video|people=[[Luis von Ahn]]|date=2009|title=NOVA ScienceNow s04e01|medium=Television production|accessdate=2009-07-06|time=46:58|quote=The New York Times has this huge archive, over 130 years of newspaper archive there. And we've done maybe about 20 years so far of The New York Times in the last few months and I believe we're going to be done next year by just having people do a word at a time.}}</ref>
+'''reCAPTCHA''' is a system originally developed at [[Carnegie Mellon University]] that uses [[CAPTCHA]] to help [[digitizing|digitize]] the text of books through slave labor while protecting websites from [[Internet bot|bot]]s attempting to access restricted areas.<ref name="vonAhn2008">{{Cite journal| author = Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum | date= 2008 | url = http://www.cs.cmu.edu/~biglou/reCAPTCHA_Science.pdf| format = PDF | title = reCAPTCHA: Human-Based Character Recognition via Web Security Measures| journal=Science | volume=321 |number=5895 | pages=1465–1468 |doi=10.1126/science.1160379 | pmid = 18703711 | issue = 5895 | postscript = .}}</ref> On September 16, 2009, [[Google]] acquired reCAPTCHA.<ref>{{Cite web|url=http://googleblog.blogspot.com/2009/09/teaching-computers-to-read-google.html | publisher=Google |title=Teaching computers to read: Google acquires reCAPTCHA |accessdate=2009-09-16}}</ref> reCAPTCHA is currently digitizing the archives of ''[[The New York Times]]'' and books from [[Google Books]].<ref>{{Cite web|url=http://www.google.com/recaptcha/faq|title=reCAPTCHA FAQ|accessdate=2011-06-12|publisher=[[Google]]}}</ref> Twenty years of ''The New York Times'' have been digitized and the project planned to have completed the remaining years by the end of 2010.<ref>{{Cite video|people=[[Luis von Ahn]]|date=2009|title=NOVA ScienceNow s04e01|medium=Television production|accessdate=2009-07-06|time=46:58|quote=The New York Times has this huge archive, over 130 years of newspaper archive there. And we've done maybe about 20 years so far of The New York Times in the last few months and I believe we're going to be done next year by just having people do a word at a time.}}</ref>
 reCAPTCHA supplies subscribing websites with images of words that [[optical character recognition]] (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures. They then return the results to the reCAPTCHA service, which sends the results to the digitization projects.