-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Nurminen table detection #56
Conversation
… simple spreadsheet algorithm
…d tables - this results in a few more tests passing
…h Page coordinate space
…ons when snapping nonexistent lines to points)
- Merge edges after finding crossing points so that it's easier to find cells - Halve coordinate space after thable areas are detected - Move the cell check so that tables with less than 4 cells are correctly detected
… text as guidelines
OK, I was able to reuse I'm still using my own |
Awesome. Deleting huge chunks of code always makes me happy :)
Can you point me to the difference between your method and ours? If incorporating that change to |
OK, from looking closer at the I think to modify the |
Small refactor suggestion: We already have a top-leftmost |
Right. Let's keep yours for now, then. |
I think it now looks good for merging into Thanks! |
@jazzido it looks like the tests are failing because PDFBox is logging too much. Do you know if there's an easy way to temporarily turn off logging or will I have to add log4j as a dependency and do it programatically for the tests? |
Well that's fun. I guess there must be enough of a difference between my flavour of java and travis' flavour that there are failures in the integration tests that don't appear locally. I'll try to think of some clever way around this but suggestions are welcome! |
OK I can reproduce the failures in an old vagrant box. I'll dig into the failures. |
…rs/tabula-java into nurminen-table-detection
Added the ability to specify the amount colinear lines should be expanded during merging so that the nurminen detection could take advantage. This extra wiggle room seems to help deal with some of the inconsistencies in platform specific differences in the edge detection.
Phew! OK, by adding a little more "fuzziness" to a couple parts of the algorithm I was able to align the tests over a bunch of platforms (windows, os x & ubuntu). Might be ready to go now? |
Sometimes when text is arranged into lines it can get messed up and have chunks on more lines than we actually recognize as lines. Make sure that we don't get array out of bounds exceptions.
Merged! Thanks a lot @mcharters. Your PR is the biggest contribution we received in the history of Tabula. |
I've just integrated the new detector in Tabula (web). This is awesome (@jeremybmerrill , @mtigas, take a look :)) |
This branch implements a more sophisticated table detection algorithm based off Anssi Nurminen's masters thesis (more or less) which can be found here: http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3
With this algorithm, 49 of the 67 ground truth table detection tests pass. The remaining failures are mostly either tricky tables or false positives (which I'm guessing are more useful to tabula than not finding anything).
Note that this branch is based off mcharters:add-table-detection-tests, so it's got some extra changes in there.