In this post I’ll demonstrate how we can use OpenCV and Tesseract to apply general Optical Character Recognition (OCR) techniques to bypass a captcha programatically. This post is meant to be educational, and act as a warning to show how easily certain simple captchas can be beaten.
This is the sort of captcha we’ll be trying to solve. I won’t reveal the origin of this captcha. I’ll just say that many websites in the same niche are using some variation of these php-generated captchas in their forms.
Preparing the Image
One general approach for preparing and image like this for OCR is by applying appropriate morphological transformations to a binary version of the image.
Luckily for us this captcha is already in greyscale so we can get straight into binarizing it. When you convert an image into binary, you have to decide for each pixel if it’s going to be black or white. We do this by applying a threshold to the image. The goal is to remove as many extraneous pixels as we can without degrading the characters too much.
You can determine the best threshold through trial and error, or as in the gif above, by using an image editing application to test. It looks like around 75 is the sweet spot for this type of captcha. If there was more variation in the images another approach we could take is to apply a range of thresholds and run the solver on each one, then compare the results.
We apply the threshold using OpenCV to get the following result:
This is much better already after this first step. Now we will apply some morphological transformations to make the rest of that noise disappear. Specifically we will be applying dilation and erosion (in that order) to remove all the noise surrounding our characters.
Again, this takes some tweaking to get just right, but we end up with a resulting image like the one below.
This captcha has yet another weakness; it is entirely made up of digits. That takes our guesswork from 36+ characters down to only 10. We can tell Tesseract exactly what it’s looking at, a single word of digits between 0 and 9. We invoke it like so:
$ tesseract processed.png out -psm 8 digits $ cat out.txt 709776
-psm flag tells Tesseract to look for a single word and the
digits arg tells it to use the built-in digits-only config.
If a website is serious about security they should be using a more advanced captcha such as Google’s ReCaptcha to avoid simple captcha bypass solutions such as this.