-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
PDFTextStripper - parsing incorrectness #458
Comments
Number 1 may be a serious bug in this library, so I'd love to get the html to reproduce it. Number 2 and 3, I'm not sure. Does this happen with other PDFs or just ones produced by this library? |
I can't get you the html at the moment, but here is an output pdf (2) (3) does not happen with other PDFs; I was testing with Apache FOP. do you have an email we can chat? |
Financier-Extraordinaire-long.pdf Found another bug. For extra long strings, the end of the string becomes invisible but copy-able |
@fungc could you please provide html code for these issues? |
Seems to be confied to ordered lists as far as I can tell.
@fungc, I know it has been a while, but I was able to reproduce but only with ordered list items. Was that your experience? Anyway, I will try to debug. |
This fixes repeating content in page margins when line-height is other than one. It also fixes the PDF UA crash caused by the repeating content. However, it is a behavior changing fix. Documents with text split over two pages (usually undesired) will now get a forced page break before the split text.
With changes to get it working and test proof.
Hello,
I am using PDFTextStripper, from the PDFbox library, to parse the text out of the pdf generated from html using openhtmltopdf.
Code for parsing:
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
return pdfTextStripper.getText(document);
However, I am seeing a few problems:
sometimes the PDF will have invisible text in front of the actual text.
e.g.
HTML:
line1
line2
line3
PDF:
line1
line2 (<--- invisible)
line2
line3
This happens even when you just open the pdf and select / copy the text.
commas show up correctly, but when parsed, they show in incorrect position
e.g.
HTML:
hello, my name, is
PDF:
,,hello my name is
NOTE this does not happen when you open the pdf and select / copy the text.
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
pdfTextStripper.setSortByPosition(true);
return pdfTextStripper.getText(document);
However, all superscripts / subscripts then gets messed up on the output
e.g. receptiońs becomes receptións
Do you know why these happens?
Thank you!
The text was updated successfully, but these errors were encountered: