Saturday, October 15, 2011

We interrupt this blog for an important public service announcement

A few months ago, I was listening to NPR, which I enjoy doing while driving to and from work.  I was listening to All Things Considered...and the story was a re-broadcast of a 2008 story on those pesky Captcha verifications.

Now, although I have that silly word verification on my own blog comments, I don't really like it when I have to do the verifying.

It seems like in the early days of word verifications, the words were much, MUCH harder to read.  I remember having to request different words at least once per attempt.

And, after just looking on Wikipedia (which although I wouldn't use a Wikipedia citation for a business document or Master's thesis, for my blog, I think most things are credible enough), modern Captcha is much easier for a human to read and much more difficult for a machine to read.  The entire reason Captcha verifications were invented.

So...back to the radio story.

Luis von Ahn is a computer scientist at Carnegie Mellon who developed Captcha.  He found it amazing that the human mind could pretty easily perform a task that even super-powerful computers couldn't do.  He even said, "Each time you type one of these, your brain is doing something amazing."

However, he also realized that each time you type of these, your brain is doing something wasteful.  After all, even though it is amazing, you are wasting precious seconds.  Imagine, if you spend a lot of time on the internet, buying tickets, commenting on blogs, setting up a Craigslist posting, you have wasted a couple of minutes a month.  Multiply that by all the people worldwide wasting time and you can only guess the time that is wasted.

He estimated 500,000 hours per day.  And he's a scientist, so it's got to be right!

And so, von Ahn thought of an ingenious way to harness those otherwise wasted hours and hours and hours.

Since he is both a scientist and an academic, he knew libraries have been trying to digitize pretty much every newspaper, book, pamphlet, magazine, and thing-with-writing-on-it out there.  Basically, the libraries scan each page as a pdf and then convert it to text.

But, as we already established above, since computers can't easily recognize text, there are lots and lots of words that the computers just can't figure out.  Von Ahn noted that older documents, especially those written before the turn of the 20th Century, are especially difficult for the computers to read.  The ink has faded or smudged; the pages are yellowed or stained; and the computers make a lot of mistakes.

The libraries employed people whose job it was to do decipher the words and enter them into the database.

Until von Ahn thought to himself, "I wonder if instead of having people waste 500,000 hours per day as they enter those annoying Captcha verifications they do the decipherization of hard-to-read words for the digitization of all things written, we could create a win-win-win situation?"

Ok...that's my fictionalizing his inner dialogue that was explained on the radio.  But, it's pretty accurate.  I'm going to call it my first work of historical fiction.

So, he approached the New York Times and the Internet Archive (click here and check out the Internet Archive's "Wayback Machine"'s pretty amazing stuff), and they formed re-Captcha.

Now, when you type those crazy words, one is a "key" the company has programmed as their password and one is a word that you are entering into a database since computers aren't as smart as they would like to think they are.  So really, only one of the words has be typed correctly, since no one really knows what the other word is (although sometimes the first word is the "key" and sometimes it's the second word).

But don't fear...if you accidentally type the wrong word into the database, several other people have the same word they are trying to decipher.  If a certain number of people agree on the same word for that "picture," the database considers that word accurately transcribed and the word is incorporated into the digital version of that document.

And if you are wondering what has been accomplished in the 500,000 hours per day of re-Captcha work, von Ahn estimated in 2008 it was "like 1.3 billion."  Yeah...with a "B."  (And I enjoy that this scientist/academic used the word "like" in the same context as a Valley Girl might.)

It was estimated that internet users transcribed enough words to fill up over 17,600 books at 99% accuracy.  And I'm betting they aren't talking about Dr. Seuss-sized books.  They also estimated that in 2009, they would digitize 70 years' worth of New York Times newspapers.

So the next time you have to type those pesky word verifications, remember that you are doing something for the greater good.

No comments: