Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix issue 964 #965

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open

fix issue 964 #965

wants to merge 3 commits into from

Conversation

jnhyperion
Copy link

@jnhyperion jnhyperion commented Aug 10, 2023

I found that this issue is caused by some blank chars is overlapped with the following non blank chars.
The simple solution is to remove these overlapped blank chars.

fix: #964

@jsvine
Copy link
Owner

jsvine commented Aug 16, 2023

Thanks for this proposal, @jnhyperion. I think this particular change isn't quite right for the library, as it's quite specific to a particular (and relatively uncommon) edge case. I find that changes like those might fix the handling of some PDFs, but risk causing problems for others, as there's such a wide variety of PDFs. But perhaps we can think of a more general feature that would still help for your use case, such as a simple .extract_text(ignore_whitespace=True) parameter or Page.remove_whitespace(..., only_overlapping=True) method (in a similar spirit to Page.dedupe_chars(...)).

Added `page.remove_whitespace(only_overlapping=False, ...)`
@jnhyperion
Copy link
Author

you're right, I added a new method Page.remove_whitespace.

@jnhyperion jnhyperion changed the base branch from stable to develop August 28, 2023 02:05
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extracted word is broken
2 participants