We propose an extraction system that use knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighbouring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains, but are also interpretable in classic document processing.
- Extract Information from Scanned Invoices to a XML file
- Multilanguage Support
Our complete Model works in the following eight steps:
- Convert PDF to JPG
- Detecting Bbox for all Text
- Bbox_mapper extract contours, sort them in all manners and extract text from them.
- Recognition of Text using OCR-Tesseract LSTM
- Ensemble Searching the keyword to locate Table Header
- Segregate the Image into Info (Non-Table Part) and Sheet (The item Table)
- Direct filling the value of Sheet in the XLS file.
- Searching Key to extract Info values and mapping it in the XLS File.
For This Task We use [https://pypi.org/project/pdf2image/]
- Binary images in ExtractStructure class for image processing
Bbox_mapper extract contours, sort them in all manners and extract text from them in sequential Order
Optimization Techniques incorporated to extract the Grid Structure Class with higher accuracy involves:
- Gaussian Blur
- getStructuringElement (to get Kernel size)
- Dilate
- Erode
- Convolution
This is main task in overall Process. For this task we applied Three methods.
- Use Pytesseract [https://pypi.org/project/pytesseract/]
- Advantage : Every Word is Detecting and creating one and more bbox per an word
- Disadvantage : It cannot be able to detect Semantic pair with one bbox like Invoice no and its value is in different bbox
- Use EfficientDet [https://arxiv.org/abs/1911.09070]
- Advantage : We tried to divide the image to some classes like Shipping, Buying, Header,Footer, Table.
- For this task we used our own labeled dataset of around 2500 images.
- With a good training pipeline effdet d5 we able to acheive good loss of 0.83
- We added WBF [https://arxiv.org/abs/1910.13302] and get loss of 0.42
- Disadvantage : We cannot detect line and words because lack of data
- Advantage : We tried to divide the image to some classes like Shipping, Buying, Header,Footer, Table.
- Use CRAFT Model [https://arxiv.org/pdf/1904.01941.pdf]
- Advantage : We can detect the lines and semantic pair both adjusting the best threshold
- Disadvantage : There is no disadvantage but model should need to be optimized
Lastly We Used CRAFT Model for it's effictive ness with less data.
CAN BE IMPROVED MORE: with using both CRAFT AND EFFDET we can know which text belong to which box and we need not to other processing
Our model demonstrate a higher accuracy with use of Transfer Learning of OCR-Tesseract LSTM on our annotated dataset and can be highly scalable to our documents with small amount of labeled training.
For this task aslo we applied two methods:
- Use Pytesseract [https://pypi.org/project/pytesseract/]
- Advantage : We can extract text easily from text
- Add other
- Use Text Recognition [https://arxiv.org/abs/1904.01906]
- Advantage : It support Multilanguage and no need to optimize and we can train with our data
- Distadvantage : It's performance little worse than Pytesseract.
Lastly We Used Pytesseract for it's effictiveness with this sample data.