Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use the correct dimension to know if we have to add an EOL in vertical mode #14428

Merged
merged 2 commits into from
Jan 15, 2022

Conversation

calixteman
Copy link
Contributor

No description provided.

@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/ae36aaa0b5608a0/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/ecab8ec69c05843/output.txt

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in PR #14418 this unfortunately causes a (pretty clear) regression, w.r.t. the positioning of some of the textLayer spans, on page 3 of the TaroUTR50SortedList112.pdf document.
Hence, on its own, I don't think that this patch is correct/enough to address this unfortunately.

Also, in the commit message you probably want to replace the word had with add instead?

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/ae36aaa0b5608a0/output.txt

Total script time: 23.96 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 13
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/ae36aaa0b5608a0/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/ecab8ec69c05843/output.txt

Total script time: 41.73 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 12
  different first/second rendering: 2

Image differences available at: http://54.193.163.58:8877/ecab8ec69c05843/reftest-analyzer.html#web=eq.log

@calixteman calixteman changed the title Use the correct dimension to know if we have to had an EOL in vertical mode Use the correct dimension to know if we have to add an EOL in vertical mode Jan 7, 2022
@calixteman
Copy link
Contributor Author

As mentioned in PR #14418 this unfortunately causes a (pretty clear) regression, w.r.t. the positioning of some of the textLayer spans, on page 3 of the TaroUTR50SortedList112.pdf document. Hence, on its own, I don't think that this patch is correct/enough to address this unfortunately.

Also, in the commit message you probably want to replace the word had with add instead?

The goal of the patch is just to fix an inconsistency.
For horizontal mode we've:

pdf.js/src/core/evaluator.js

Lines 2519 to 2528 in 290cbc5

0.5 * textContentItem.height /* not the same line */
) {
appendEOL();
return;
}
flushTextContentItem();
return;
}
if (Math.abs(advanceY) > textContentItem.height) {

(we use advanceY and height)
when for vertical mode we've:

pdf.js/src/core/evaluator.js

Lines 2468 to 2479 in 290cbc5

Math.abs(advanceX) >
0.5 * textContentItem.width /* not the same column */
) {
appendEOL();
return;
}
flushTextContentItem();
return;
}
if (Math.abs(advanceX) > textContentItem.height) {

(we use advanceX, width and height)

Anyway, right now if you copy/paste the last column you'll get an extra EOL between the whitespace and which is wrong,
and the patch fixes this problem.
In order to fix the position of the text chunk we would need to split the chunk in 2 chunks (one with the whitespace and one with the visible chars) and then we'll get a span for the whitespace and an other one for the visible glyphes (which will be correctly positioned).

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Jan 7, 2022

The goal of the patch is just to fix an inconsistency.

I get that, but as mentioned (and evident by the test results) the patch causes a regression. Given that there's no other documents, that we know of, which are fixed by this patch I really don't think that it's OK to "purposely" introduce a regression here.

Anyway, right now if you copy/paste the last column you'll get an extra EOL between the whitespace and which is wrong,
and the patch fixes this problem.

Sure, but the regression is worse than that (comparatively) small problem as far as I'm concerned. As-is the textLayer position no longer agrees with the rendered text, which would make it more difficult for users to actually select the intended text.

@calixteman
Copy link
Contributor Author

Agreed and the user selects the text generally to copy/paste it and so the EOL bug... and the circle is closed !

The whitespace on top of the last column is an ideographic space (https://www.compart.com/en/unicode/U+3000).
In the pdf we've two text runs for the last column: 1 for the ideographic space and one for the visible chars.
We ignore latin whitespace when they're in first position of a text run:

(i === 0 ||

I suppose we should do the same for any kind of whitespace.

In Chrome, this space is stripped out but it's present in Acrobat.

An other possibility is to detect that the space and the rest aren't really on the same column (not the same X) and in this case push the chunk with space to have a new one for the rest. It'd make sense in horizontal mode too: create a new chunk each time we don't have the same Y.

@Snuffleupagus, what do you think ?

@Snuffleupagus
Copy link
Collaborator

We ignore latin whitespace when they're in first position of a text run:
[...]
I suppose we should do the same for any kind of whitespace.

I'd guess that'd require changing all of the existing ... = " " checks, in this code, into something like WhitespaceRegexp.test(...) instead?
If so my only concern with that approach would be regarding the possible performance impact, since regular expression matching is probably slower than string comparisons. Other than that, this sounds like a fairly simple solution.

An other possibility is to detect that the space and the rest aren't really on the same column (not the same X) and in this case push the chunk with space to have a new one for the rest. It'd make sense in horizontal mode too: create a new chunk each time we don't have the same Y.

It sounds like that approach could potentially help improve other cases as well, although I suppose it could also lead to more textLayer elements being created (for some documents) since we'd break up existing text-runs more than we currently do.

@calixteman
Copy link
Contributor Author

We ignore latin whitespace when they're in first position of a text run:
[...]
I suppose we should do the same for any kind of whitespace.

I'd guess that'd require changing all of the existing ... = " " checks, in this code, into something like WhitespaceRegexp.test(...) instead? If so my only concern with that approach would be regarding the possible performance impact, since regular expression matching is probably slower than string comparisons. Other than that, this sounds like a fairly simple solution.

According to https://jsbench.me/b1ku0bggvi/1 (found in stackoverflow), the fastest solution is the direct comparison.
In order to avoid to make the check again and again (the i-1, i+1 thing in the code you mentioned), we could just create and array containing isWhitespace(glyph.unicode) for each glyph.
The main advantage of this solution is that we would treat all the whitespaces in the same way which is consistent. So I'd be in favor of this solution.

An other possibility is to detect that the space and the rest aren't really on the same column (not the same X) and in this case push the chunk with space to have a new one for the rest. It'd make sense in horizontal mode too: create a new chunk each time we don't have the same Y.

It sounds like that approach could potentially help improve other cases as well, although I suppose it could also lead to more textLayer elements being created (for some documents) since we'd break up existing text-runs more than we currently do.

I suppose that most of the times all glyphs are on the same line... but who knows.

@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/fc6ea31a9faa9e4/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/b6af57d37c11e4f/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/fc6ea31a9faa9e4/output.txt

Total script time: 24.32 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 28
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/fc6ea31a9faa9e4/reftest-analyzer.html#web=eq.log

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a quick look, at least issue7878 seem to regress a bit with the latest version of the PR.

@@ -2572,6 +2593,8 @@ class PartialEvaluator {

const glyphs = font.charsToGlyphs(chars);
const scale = textState.fontMatrix[0] * textState.fontSize;
const whitespaces = glyphs.map(glyph => isWhitespace(glyph.unicode));
Copy link
Collaborator

@Snuffleupagus Snuffleupagus Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: Rather than looping through the glyphs here, would it possibly make sense (and be more efficient) to add a new property (e.g. isWhitespace or similar) to the Glyph-instances and utilize the new helper function there instead to determine this for each glyph as part of the existing parsing in https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3153?

That would obviously require extending the Glyph class, but given the existence of both a charsCache and a glyphCache in the font code that should actually help reduce the overall amount of parsing required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

@pdfjsbot
Copy link

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/b6af57d37c11e4f/output.txt

Total script time: 41.63 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 32
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/b6af57d37c11e4f/reftest-analyzer.html#web=eq.log

@calixteman
Copy link
Contributor Author

Based on a quick look, at least issue7878 seem to regress a bit with the latest version of the PR.

The space between Z and a is a non-breakable space (0xA0) so before the patch it was considered as a "normal" char.
And since the chars in the string are slightly tightened we've a span by char:

flushTextContentItem();

const NEGATIVE_SPACE_FACTOR = -0.2;

With the patch, the 0xA0 is removed (it's the only one char in its text run) and when we add the a we compare its position with the Z and there is enough space for a white. Consequently it results in a chunk with Z a which is not consistent with the drawn glyphes... I'm sad...
I can cheat in removing the 0xA0 from the whitespace list...

I think this regression is acceptable compared to the "improvements" we have in some other cases: it's always this problem to find a good trade-off between number of chunks and chars positions in the text layer.

For your information, I've a wip locally to replace the html-based text layer by a svg-based one: it'll greatly improve the char positions, but I've few regressions I need to figure out.

@@ -212,6 +215,8 @@ class Glyph {
this.operatorListId = operatorListId;
this.isSpace = isSpace;
this.isInFont = isInFont;
this.isWhitespace = isWhitespace(unicode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the existing isSpace property, could we perhaps call this new one e.g. isUnicodeSpace (or something) instead to more clearly distinguish them from each other?

Also, please update the name of the new helper function accordingly to avoid confusion with this already existing utility function:

// Checks if ch is one of the following characters: SPACE, TAB, CR or LF.
function isWhiteSpace(ch) {
return ch === 0x20 || ch === 0x09 || ch === 0x0d || ch === 0x0a;
}

Comment on lines 1653 to 1654
c === "\u2000" ||
c === "\u200a" ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a bunch of characters missing here, since https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet#character_classes contains the following information:
[ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]

Note in particular the hyphen between \u2000 and \u200a, which indicates a range of characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I moved the diacritic test (I added it in the past to avoid to take into account the diacritic width in the chunk width) in the Glyph class to enjoy the cache. This test (diacritic one) is achieved thanks a regex which is "slow", so since we've it I put the white space test into the regex too and consequently it's simplified (just use \s and no need to have all the possibilities).

@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/6f3a7c3b156b1f0/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/270de969d0be7ae/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/6f3a7c3b156b1f0/output.txt

Total script time: 22.61 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 28
  different first/second rendering: 2

Image differences available at: http://54.241.84.105:8877/6f3a7c3b156b1f0/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/270de969d0be7ae/output.txt

Total script time: 40.67 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: Passed
  • Regression tests: FAILED
  different ref/snapshot: 27
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/270de969d0be7ae/reftest-analyzer.html#web=eq.log

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, since this seems like a good change overall by reducing the usage of regular expressions during text-extraction.

@@ -212,6 +213,10 @@ class Glyph {
this.operatorListId = operatorListId;
this.isSpace = isSpace;
this.isInFont = isInFont;

const categories = checkCharUnicodeCategory(unicode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Perhaps getCharUnicodeCategory instead, since that more clearly (to me at least) suggests that the function returns an actual value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, for consistency const category = ... instead here to not mix singular and plural in the names :-)

@calixteman calixteman merged commit da953f4 into mozilla:master Jan 15, 2022
@calixteman
Copy link
Contributor Author

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @calixteman received. Current queue size: 1

Live output at: http://54.193.163.58:8877/7f99d3996341de7/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/5b0e25f77527b19/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/5b0e25f77527b19/output.txt

Total script time: 20.25 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/7f99d3996341de7/output.txt

Total script time: 37.05 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants