Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

extract_text inserts newlines where it shouldn't #292

Open
Heinenen opened this issue Aug 7, 2024 · 1 comment
Open

extract_text inserts newlines where it shouldn't #292

Heinenen opened this issue Aug 7, 2024 · 1 comment

Comments

@Heinenen
Copy link
Collaborator

Heinenen commented Aug 7, 2024

Continuing from the discussion #125 (comment).

The responsible code is found at https://github.com/J-F-Liu/lopdf/blob/master/src/parser_aux.rs#L94.

The way that other PDF viewers handle this is through some heuristics, we can see what pdf.js does in https://github.com/mozilla/pdf.js/blob/341a0b6d477d2909fcb14bcbfdf0d2fd37406cb0/src/core/evaluator.js#L2966.
The crux of it being: if the x- or y-coordinate change above a certain threshold (which indicates a new column/new line), a newline is inserted.

@Heinenen
Copy link
Collaborator Author

Heinenen commented Aug 7, 2024

Another related problem: In some places, lopdf should add whitespace (a single space?) where the PDF doesn't specifically have one.

An example probably demonstrates this best:

#[test]
fn test_extract() {
    let doc = Document::load("extract_text_dkp.pdf").unwrap();
    let text = doc.extract_text(&[4]).unwrap();
    println!("{}", text);
}

prints

4InhaltSeiteSozialismusvorstellungen:Sozialismus - die historische Alternativezum Kapitalismus5Als Arbeits- und Diskussionsgrundlagebeschlossene Abänderungs- oderErgänzungsanträge und beschlosseneAnträge13

This is what the page, that the text is extracted from, looks like:
image

(Example PDF: extract_text_dkp.pdf, taken from #217 (comment))

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant