A deep learning model to break CAPTCHA codes perfectly using CRNN model (Convolutional Recurrent Neural Network).
CAPTCHAs stands for the Completely Automated Public Turing test to tell Computers and Humans Apart. CAPTCHAs are tools you can use to differentiate between real users and automated users, such as bots. CAPTCHAs provide challenges that are difficult for computers to perform but relatively easy for humans.
However, these wormy and curvy text images are even difficult for humans to figure out sometimes.
Therefore, it would be great to have an super accurate machine learning model to help you reveal these correct texts everytime without fail.
- tensorflow 2.0+
- scikit-learn
- opencv-python
- editdistance
The dataset is generated by the most popular Wordpress CAPTCHAs plugin with nearly 8 millions downloads (https://wordpress.org/plugins/really-simple-captcha/)
It generates 9,955 images of 4-letter CAPTCHAs using a random mix of four different fonts.
Ideally, we want to detect text from a text image:
However, character segmentation is not practical because:
- Too time comsuming
- Too expensive
- Impossible in most cases
For example, the above character segmentation is fine but the below one is challenging. In fact, the traditional method will face a problem where two or more characters are too close to each other like this:
This project will use state of the art CRNN model which is a combination of CNN, RNN and CTC loss for image-based sequence recognition tasks, specially OCR (Optical Character Recognition) task which is perfect for CAPTCHAs.
This model is much more superior than traditional way which does not involve any bounding box detection for each character (character segmentation).
In this model, the image will be dissected by a fixed number of timesteps in the RNN layers so as long as each character is seperated by two or three parts to be processed and decoded later then the spacing between each character is irrelevant like so:
Here is more details of CRNN architecture:
As you can see in this diagram, the last layer of CNN produces a feature vector of the shape 4*8*4 then we flatten the first and third dimension to be 16 and keep the second dimension to be the same to produce 16*8. It's effective to cut the original image to be 8 vertical parts (red lines) and each parts contains 16 feature numbers. Since we have 8 parts to be processed as the output of CNN then we also choose 8 for our time step in the LSTM layer. After stacked LSTM layers with softmax (SM) activation function, we have CTC loss to optimize our probability table.
More information regarding the implementation can be found in the jupyter notebook in the github.
We need to have the right evaluation/metrics for OCR task with edit distance library.
This is inspired from https://github.com/arthurflor23/handwritten-text-recognition/blob/master/src/data/evaluation.py
This only helps to calculate three evaluation metris for any OCR task:
- CER (Character Error Rate)
- WER (Word Error Rate)
- SER (Sequence Error Rate)
Here is my result for a test set:
This is a easy dataset so I got absolutely perfect score for 200 images of the test set! Not even a challenge for CRNN power:
- Character Error Rate: 0.0
- Word Error Rate: 0.0
- Sequence Error Rate: 0.0
- CRNN + CTC is not that challenging, just want sure we follow above process step by step like in the notebook
- Keep our height and width is a power of 2 or at least even number is making our time much easier to divide by half (it is not really important, since it is related to design your model and preprocessing)
- The number bi LSTM paramater is larger the number of timestep since our biLSTM /2 will be at least the size of hidden node for each single LSTM.
- The max label length should be the same to the number of time steps, but some people report if they set it to be slightly lower than time step, it helps. But you should stick with the basics!
- The data is super clean and same image dimension. So for other datasets, maybe a bit of noise cleaning and binarization may help!
- Resize image logics with multiple image sizes (maybe as following):
- find min, max of height and width
- resize to a fixed height you want
- calculate the max width of all resized images
- padding to all images to that max width
- Combine the logic of preprocessing of train set and test set together
- Convert them to tfdataset pipeline (note that it is challenging since OpenCV won't work with tensor)\
This project is licensed under the MIT License - see the LICENSE.md file for details
- Huynh Nguyen Minh Thong (Tom Huynh) - tomhuynhsg@gmail.com