Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Nurminen table detection #56

Merged
merged 56 commits into from
Feb 2, 2016

Conversation

mcharters
Copy link
Contributor

This branch implements a more sophisticated table detection algorithm based off Anssi Nurminen's masters thesis (more or less) which can be found here: http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3

With this algorithm, 49 of the 67 ground truth table detection tests pass. The remaining failures are mostly either tricky tables or false positives (which I'm guessing are more useful to tabula than not finding anything).

Note that this branch is based off mcharters:add-table-detection-tests, so it's got some extra changes in there.

Matt Charters added 30 commits December 14, 2015 14:52
…d tables - this results in a few more tests passing
…ons when snapping nonexistent lines to points)
- Merge edges after finding crossing points so that it's easier to find cells
- Halve coordinate space after thable areas are detected
- Move the cell check so that tables with less than 4 cells are correctly detected
@mcharters
Copy link
Contributor Author

OK, I was able to reuse Ruling.collapseOrientedRulings and SpreadsheetExtractionAlgorithm.findCells (which also let me get rid of my duplicate crossing point algorithm).

I'm still using my own getTableAreasFromCells method because I allow cells that are a small distance apart to be part of the same table (this cuts down on a few failures in the ground truth data). I'll keep it that way unless we should be doing that for the SpreadsheetExtractionAlgorithm as well?

@jazzido
Copy link
Contributor

jazzido commented Jan 11, 2016

OK, I was able to reuse Ruling.collapseOrientedRulings and SpreadsheetExtractionAlgorithm.findCells (which also let me get rid of my duplicate crossing point algorithm).

Awesome. Deleting huge chunks of code always makes me happy :)

I'm still using my own getTableAreasFromCells method because I allow cells that are a small distance apart to be part of the same table (this cuts down on a few failures in the ground truth data). I'll keep it that way unless we should be doing that for the SpreadsheetExtractionAlgorithm as well?

Can you point me to the difference between your method and ours? If incorporating that change to SpreadsheetExtractionAlgorithm.findSpreadsheetsFromCells doesn't break our tests, we can get rid of one of them.

@mcharters
Copy link
Contributor Author

OK, from looking closer at the findSpreadhseetsFromCells method it's much more sophisticated than mine. I'm just taking each cell (from top-left-most to bottom-right-most) and checking cell by cell to see if the next cell shares any corners with any cells in my current "table" and that's where I add the distance threshold - cells can be a little bit away and be considered part of the table.

I think to modify the findSpreadhseetsFromCells algorithm to be similar it would involve writing more code to snap all the points in the pointSet to some sort of grid before finding the horizontal and vertical lines - unless I'm missing something. This seems like it would be tricky to me?

@jazzido
Copy link
Contributor

jazzido commented Jan 11, 2016

Small refactor suggestion:

We already have a top-leftmost Point2D comparator here — We could move it to Utils and reuse it from NurminenDetectionAlgorithm.

@jazzido
Copy link
Contributor

jazzido commented Jan 11, 2016

OK, from looking closer at the findSpreadhseetsFromCells method it's much more sophisticated than mine […]

Right. Let's keep yours for now, then.

@jazzido
Copy link
Contributor

jazzido commented Jan 12, 2016

I think it now looks good for merging into master. Last request: can you set up the tests so the ones that pass at this point are ran by default in the test suite? That way we make sure not to introduce regressions in the detector.

Thanks!

@mcharters
Copy link
Contributor Author

@jazzido it looks like the tests are failing because PDFBox is logging too much. Do you know if there's an easy way to temporarily turn off logging or will I have to add log4j as a dependency and do it programatically for the tests?

@mcharters
Copy link
Contributor Author

Well that's fun. I guess there must be enough of a difference between my flavour of java and travis' flavour that there are failures in the integration tests that don't appear locally.

I'll try to think of some clever way around this but suggestions are welcome!

@mcharters
Copy link
Contributor Author

OK I can reproduce the failures in an old vagrant box. I'll dig into the failures.

mcharters and others added 6 commits January 14, 2016 14:29
Added the ability to specify the amount colinear lines should be expanded during merging so that the nurminen detection could take advantage.

This extra wiggle room seems to help deal with some of the inconsistencies in platform specific differences in the edge detection.
@mcharters
Copy link
Contributor Author

Phew! OK, by adding a little more "fuzziness" to a couple parts of the algorithm I was able to align the tests over a bunch of platforms (windows, os x & ubuntu). Might be ready to go now?

Sometimes when text is arranged into lines it can get messed up and have
chunks on more lines than we actually recognize as lines. Make sure that
we don't get array out of bounds exceptions.
jazzido added a commit that referenced this pull request Feb 2, 2016
@jazzido jazzido merged commit 96ce1dc into tabulapdf:master Feb 2, 2016
@jazzido
Copy link
Contributor

jazzido commented Feb 2, 2016

Merged! Thanks a lot @mcharters. Your PR is the biggest contribution we received in the history of Tabula.

@jazzido
Copy link
Contributor

jazzido commented Feb 2, 2016

I've just integrated the new detector in Tabula (web). This is awesome

screen shot 2016-02-02 at 1 55 07 pm

(@jeremybmerrill , @mtigas, take a look :))

jazzido added a commit to tabulapdf/tabula that referenced this pull request Feb 2, 2016
@mcharters mcharters deleted the nurminen-table-detection branch February 2, 2016 20:35
@mcharters mcharters restored the nurminen-table-detection branch February 4, 2016 20:42
jeremybmerrill pushed a commit to tabulapdf/tabula that referenced this pull request Jun 26, 2018
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants