Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

--space-to-offset 1 drops characters #445

Open
davidhedley opened this issue Nov 14, 2014 · 4 comments
Open

--space-to-offset 1 drops characters #445

davidhedley opened this issue Nov 14, 2014 · 4 comments

Comments

@davidhedley
Copy link

--space-to-offset 1 is incorrectly dropping some characters.

Test case:http://download.vistair.com/pdf2htmlEX/Page-2fromBAW-ALL-LHRSB.pdf

If you process with The "v" of "Effective" is dropped (converted to a space).
The font has a custom encoding, however the text is extractable.

@duanyao
Copy link
Collaborator

duanyao commented Nov 14, 2014

The text drawing code in PDF is [(\033\036\036\023.\020\r \023\007)10.479(\013\024\026"\007)]TJ, corresponding to "Effective From". The 8th char(after \r) in PDF is a space, but it is mapped to 'v' in the font's encoding. So when --space-as-offset is on, the space is converted to a offset, not 'v'.

According to the comment in pdf2htmlEX code, this is a known limitation.

@coolwanglu can we make --space-as-offset more restricted? E.g. change text.cc:102 from if(is_space && (param.space_as_offset)) to
if(is_space && (param.space_as_offset) && (uLen == 1) && (u[0] == ' ')). This seems to bring the 'v' back.

@duanyao
Copy link
Collaborator

duanyao commented Nov 15, 2014

I tried to fix this in #446, @davidhedley can you test the patch with your PDFs?

@davidhedley
Copy link
Author

As an update to this, --optimize-text is also broken in the same way - it should not make any changes to the text unless the character encoding is "standard". There are many subsetted fonts out there that use character 32 as a normal character and not a space.

@davidhedley
Copy link
Author

And also --space-threshold is similarly broken. We need a flag for the current font to say "custom font encoding - leave alone", and then skip --optimize-text, --space-as-offset and --space-threshold for characters in that font as they are breaking the output.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants