Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

PDFTextStripper - parsing incorrectness #458

Closed
fungc opened this issue Mar 12, 2020 · 5 comments
Closed

PDFTextStripper - parsing incorrectness #458

fungc opened this issue Mar 12, 2020 · 5 comments

Comments

@fungc
Copy link

fungc commented Mar 12, 2020

Hello,

I am using PDFTextStripper, from the PDFbox library, to parse the text out of the pdf generated from html using openhtmltopdf.

Code for parsing:
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
return pdfTextStripper.getText(document);

However, I am seeing a few problems:

  1. Invisible, redundant text
    sometimes the PDF will have invisible text in front of the actual text.
    e.g.

HTML:
line1
line2
line3

PDF:
line1
line2 (<--- invisible)
line2
line3

This happens even when you just open the pdf and select / copy the text.

  1. commas are places in the wrong position, when parsed
    commas show up correctly, but when parsed, they show in incorrect position
    e.g.
    HTML:
    hello, my name, is

PDF:
,,hello my name is

NOTE this does not happen when you open the pdf and select / copy the text.

  1. Interestingly, the comma problem goes away when I parse like this
    final PDDocument document = PDDocument.load(pdfBytes);
    final PDFTextStripper pdfTextStripper = new PDFTextStripper();
    pdfTextStripper.setSortByPosition(true);
    return pdfTextStripper.getText(document);

However, all superscripts / subscripts then gets messed up on the output
e.g. receptiońs becomes receptións

Do you know why these happens?

Thank you!

@danfickle
Copy link
Owner

Number 1 may be a serious bug in this library, so I'd love to get the html to reproduce it.

Number 2 and 3, I'm not sure. Does this happen with other PDFs or just ones produced by this library?

@fungc
Copy link
Author

fungc commented Mar 16, 2020

Financier-Extraordinaire.pdf

I can't get you the html at the moment, but here is an output pdf
I think (1) has to do with paging, it always happens at the end of a page or at the beginning.

(2) (3) does not happen with other PDFs; I was testing with Apache FOP.

do you have an email we can chat?

@fungc
Copy link
Author

fungc commented Mar 17, 2020

Financier-Extraordinaire-long.pdf

Found another bug. For extra long strings, the end of the string becomes invisible but copy-able

@leonorader
Copy link
Contributor

leonorader commented Mar 22, 2020

@fungc could you please provide html code for these issues?

danfickle added a commit that referenced this issue Aug 21, 2020
Seems to be confied to ordered lists as far as I can tell.
@danfickle
Copy link
Owner

@fungc, I know it has been a while, but I was able to reproduce but only with ordered list items. Was that your experience?

Anyway, I will try to debug.

danfickle added a commit that referenced this issue Aug 22, 2020
danfickle added a commit that referenced this issue Nov 27, 2020
This fixes repeating content in page margins when line-height is other than one. It also fixes the PDF UA crash caused by the repeating content.

However, it is a behavior changing fix. Documents with text split over two pages (usually undesired) will now get a forced page break before the split text.
danfickle added a commit that referenced this issue Nov 28, 2020
With changes to get it working and test proof.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants