Use the correct dimension to know if we have to add an EOL in vertical mode #14428

calixteman · 2022-01-07T13:05:03Z

No description provided.

calixteman · 2022-01-07T13:05:19Z

/botio test

pdfjsbot · 2022-01-07T13:05:20Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/ae36aaa0b5608a0/output.txt

pdfjsbot · 2022-01-07T13:05:20Z

From: Bot.io (Windows)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/ecab8ec69c05843/output.txt

Snuffleupagus

As mentioned in PR #14418 this unfortunately causes a (pretty clear) regression, w.r.t. the positioning of some of the textLayer spans, on page 3 of the TaroUTR50SortedList112.pdf document.
Hence, on its own, I don't think that this patch is correct/enough to address this unfortunately.

Also, in the commit message you probably want to replace the word had with add instead?

pdfjsbot · 2022-01-07T13:29:18Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/ae36aaa0b5608a0/output.txt

Total script time: 23.96 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 13
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/ae36aaa0b5608a0/reftest-analyzer.html#web=eq.log

pdfjsbot · 2022-01-07T13:47:05Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/ecab8ec69c05843/output.txt

Total script time: 41.73 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 12
  different first/second rendering: 2

Image differences available at: http://54.193.163.58:8877/ecab8ec69c05843/reftest-analyzer.html#web=eq.log

calixteman · 2022-01-07T14:16:53Z

As mentioned in PR #14418 this unfortunately causes a (pretty clear) regression, w.r.t. the positioning of some of the textLayer spans, on page 3 of the TaroUTR50SortedList112.pdf document. Hence, on its own, I don't think that this patch is correct/enough to address this unfortunately.

Also, in the commit message you probably want to replace the word had with add instead?

The goal of the patch is just to fix an inconsistency.
For horizontal mode we've:

pdf.js/src/core/evaluator.js

Lines 2519 to 2528 in 290cbc5

    
               0.5 * textContentItem.height /* not the same line */ 
        
             ) { 
        
               appendEOL(); 
        
               return; 
        
             } 
        
             flushTextContentItem(); 
        
             return; 
        
           } 
        
           if (Math.abs(advanceY) > textContentItem.height) {

(we use advanceY and height)
when for vertical mode we've:

pdf.js/src/core/evaluator.js

Lines 2468 to 2479 in 290cbc5

    
               Math.abs(advanceX) > 
        
               0.5 * textContentItem.width /* not the same column */ 
        
             ) { 
        
               appendEOL(); 
        
               return; 
        
             } 
        
             flushTextContentItem(); 
        
             return; 
        
           } 
        
           if (Math.abs(advanceX) > textContentItem.height) {

(we use advanceX, width and height)

Anyway, right now if you copy/paste the last column you'll get an extra EOL between the whitespace and 縦 which is wrong,
and the patch fixes this problem.
In order to fix the position of the text chunk we would need to split the chunk in 2 chunks (one with the whitespace and one with the visible chars) and then we'll get a span for the whitespace and an other one for the visible glyphes (which will be correctly positioned).

…l mode

Snuffleupagus · 2022-01-07T14:35:33Z

The goal of the patch is just to fix an inconsistency.

I get that, but as mentioned (and evident by the test results) the patch causes a regression. Given that there's no other documents, that we know of, which are fixed by this patch I really don't think that it's OK to "purposely" introduce a regression here.

Anyway, right now if you copy/paste the last column you'll get an extra EOL between the whitespace and 縦 which is wrong,
and the patch fixes this problem.

Sure, but the regression is worse than that (comparatively) small problem as far as I'm concerned. As-is the textLayer position no longer agrees with the rendered text, which would make it more difficult for users to actually select the intended text.

calixteman · 2022-01-07T15:50:58Z

Agreed and the user selects the text generally to copy/paste it and so the EOL bug... and the circle is closed !

The whitespace on top of the last column is an ideographic space (https://www.compart.com/en/unicode/U+3000).
In the pdf we've two text runs for the last column: 1 for the ideographic space and one for the visible chars.
We ignore latin whitespace when they're in first position of a text run:

pdf.js/src/core/evaluator.js

Line 2589 in 290cbc5

(i === 0 ||

I suppose we should do the same for any kind of whitespace.

In Chrome, this space is stripped out but it's present in Acrobat.

An other possibility is to detect that the space and the rest aren't really on the same column (not the same X) and in this case push the chunk with space to have a new one for the rest. It'd make sense in horizontal mode too: create a new chunk each time we don't have the same Y.

@Snuffleupagus, what do you think ?

Snuffleupagus · 2022-01-07T16:46:47Z

We ignore latin whitespace when they're in first position of a text run:
[...]
I suppose we should do the same for any kind of whitespace.

I'd guess that'd require changing all of the existing ... = " " checks, in this code, into something like WhitespaceRegexp.test(...) instead?
If so my only concern with that approach would be regarding the possible performance impact, since regular expression matching is probably slower than string comparisons. Other than that, this sounds like a fairly simple solution.

An other possibility is to detect that the space and the rest aren't really on the same column (not the same X) and in this case push the chunk with space to have a new one for the rest. It'd make sense in horizontal mode too: create a new chunk each time we don't have the same Y.

It sounds like that approach could potentially help improve other cases as well, although I suppose it could also lead to more textLayer elements being created (for some documents) since we'd break up existing text-runs more than we currently do.

calixteman · 2022-01-07T19:13:22Z

We ignore latin whitespace when they're in first position of a text run:
[...]
I suppose we should do the same for any kind of whitespace.

I'd guess that'd require changing all of the existing ... = " " checks, in this code, into something like WhitespaceRegexp.test(...) instead? If so my only concern with that approach would be regarding the possible performance impact, since regular expression matching is probably slower than string comparisons. Other than that, this sounds like a fairly simple solution.

According to https://jsbench.me/b1ku0bggvi/1 (found in stackoverflow), the fastest solution is the direct comparison.
In order to avoid to make the check again and again (the i-1, i+1 thing in the code you mentioned), we could just create and array containing isWhitespace(glyph.unicode) for each glyph.
The main advantage of this solution is that we would treat all the whitespaces in the same way which is consistent. So I'd be in favor of this solution.

An other possibility is to detect that the space and the rest aren't really on the same column (not the same X) and in this case push the chunk with space to have a new one for the rest. It'd make sense in horizontal mode too: create a new chunk each time we don't have the same Y.

It sounds like that approach could potentially help improve other cases as well, although I suppose it could also lead to more textLayer elements being created (for some documents) since we'd break up existing text-runs more than we currently do.

I suppose that most of the times all glyphs are on the same line... but who knows.

calixteman · 2022-01-07T20:21:40Z

/botio test

pdfjsbot · 2022-01-07T20:21:41Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/fc6ea31a9faa9e4/output.txt

pdfjsbot · 2022-01-07T20:21:41Z

From: Bot.io (Windows)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/b6af57d37c11e4f/output.txt

pdfjsbot · 2022-01-07T20:46:01Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/fc6ea31a9faa9e4/output.txt

Total script time: 24.32 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 28
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/fc6ea31a9faa9e4/reftest-analyzer.html#web=eq.log

Snuffleupagus

Based on a quick look, at least issue7878 seem to regress a bit with the latest version of the PR.

Snuffleupagus · 2022-01-07T20:48:47Z

src/core/evaluator.js

@@ -2572,6 +2593,8 @@ class PartialEvaluator {

      const glyphs = font.charsToGlyphs(chars);
      const scale = textState.fontMatrix[0] * textState.fontSize;
+      const whitespaces = glyphs.map(glyph => isWhitespace(glyph.unicode));


One question: Rather than looping through the glyphs here, would it possibly make sense (and be more efficient) to add a new property (e.g. isWhitespace or similar) to the Glyph-instances and utilize the new helper function there instead to determine this for each glyph as part of the existing parsing in https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3153?

That would obviously require extending the Glyph class, but given the existence of both a charsCache and a glyphCache in the font code that should actually help reduce the overall amount of parsing required.

pdfjsbot · 2022-01-07T21:03:19Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/b6af57d37c11e4f/output.txt

Total script time: 41.63 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 32
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/b6af57d37c11e4f/reftest-analyzer.html#web=eq.log

calixteman · 2022-01-07T21:27:46Z

Based on a quick look, at least issue7878 seem to regress a bit with the latest version of the PR.

The space between Z and a is a non-breakable space (0xA0) so before the patch it was considered as a "normal" char.
And since the chars in the string are slightly tightened we've a span by char:

pdf.js/src/core/evaluator.js

Line 2524 in 290cbc5

flushTextContentItem();

pdf.js/src/core/evaluator.js

Line 2210 in 290cbc5

const NEGATIVE_SPACE_FACTOR = -0.2;

With the patch, the 0xA0 is removed (it's the only one char in its text run) and when we add the a we compare its position with the Z and there is enough space for a white. Consequently it results in a chunk with Z a which is not consistent with the drawn glyphes... I'm sad...
I can cheat in removing the 0xA0 from the whitespace list...

I think this regression is acceptable compared to the "improvements" we have in some other cases: it's always this problem to find a good trade-off between number of chunks and chars positions in the text layer.

For your information, I've a wip locally to replace the html-based text layer by a svg-based one: it'll greatly improve the char positions, but I've few regressions I need to figure out.

Snuffleupagus · 2022-01-08T22:12:37Z

src/core/fonts.js

@@ -212,6 +215,8 @@ class Glyph {
    this.operatorListId = operatorListId;
    this.isSpace = isSpace;
    this.isInFont = isInFont;
+    this.isWhitespace = isWhitespace(unicode);


Given the existing isSpace property, could we perhaps call this new one e.g. isUnicodeSpace (or something) instead to more clearly distinguish them from each other?

Also, please update the name of the new helper function accordingly to avoid confusion with this already existing utility function:

pdf.js/src/core/core_utils.js

Lines 247 to 250 in 8ac0ccc

// Checks if ch is one of the following characters: SPACE, TAB, CR or LF.

function isWhiteSpace(ch) {

return ch === 0x20 || ch === 0x09 || ch === 0x0d || ch === 0x0a;

}

Snuffleupagus · 2022-01-09T13:14:07Z

src/core/unicode.js

+    c === "\u2000" ||
+    c === "\u200a" ||


Isn't there a bunch of characters missing here, since https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet#character_classes contains the following information:
[ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]

Note in particular the hyphen between \u2000 and \u200a, which indicates a range of characters.

So, I moved the diacritic test (I added it in the past to avoid to take into account the diacritic width in the chunk width) in the Glyph class to enjoy the cache. This test (diacritic one) is achieved thanks a regex which is "slow", so since we've it I put the white space test into the regex too and consequently it's simplified (just use \s and no need to have all the possibilities).

calixteman · 2022-01-15T19:25:15Z

/botio test

pdfjsbot · 2022-01-15T19:25:16Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/6f3a7c3b156b1f0/output.txt

pdfjsbot · 2022-01-15T19:25:16Z

From: Bot.io (Windows)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/270de969d0be7ae/output.txt

src/core/unicode.js

pdfjsbot · 2022-01-15T19:47:53Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/6f3a7c3b156b1f0/output.txt

Total script time: 22.61 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 28
  different first/second rendering: 2

Image differences available at: http://54.241.84.105:8877/6f3a7c3b156b1f0/reftest-analyzer.html#web=eq.log

pdfjsbot · 2022-01-15T20:05:56Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/270de969d0be7ae/output.txt

Total script time: 40.67 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 27
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/270de969d0be7ae/reftest-analyzer.html#web=eq.log

Snuffleupagus

r=me, since this seems like a good change overall by reducing the usage of regular expressions during text-extraction.

Snuffleupagus · 2022-01-15T20:31:21Z

src/core/fonts.js

@@ -212,6 +213,10 @@ class Glyph {
    this.operatorListId = operatorListId;
    this.isSpace = isSpace;
    this.isInFont = isInFont;
+
+    const categories = checkCharUnicodeCategory(unicode);


Nit: Perhaps getCharUnicodeCategory instead, since that more clearly (to me at least) suggests that the function returns an actual value.

And, for consistency const category = ... instead here to not mix singular and plural in the names :-)

calixteman · 2022-01-15T20:48:38Z

/botio makeref

pdfjsbot · 2022-01-15T20:48:39Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @calixteman received. Current queue size: 1

Live output at: http://54.193.163.58:8877/7f99d3996341de7/output.txt

pdfjsbot · 2022-01-15T20:48:39Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/5b0e25f77527b19/output.txt

pdfjsbot · 2022-01-15T21:08:55Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/5b0e25f77527b19/output.txt

Total script time: 20.25 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2022-01-15T21:29:04Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/7f99d3996341de7/output.txt

Total script time: 37.05 mins

Lint: Passed
Make references: Passed
Check references: Passed

Snuffleupagus requested changes Jan 7, 2022

View reviewed changes

calixteman changed the title ~~Use the correct dimension to know if we have to had an EOL in vertical mode~~ Use the correct dimension to know if we have to add an EOL in vertical mode Jan 7, 2022

Use the correct dimension to know if we have to add an EOL in vertica…

9bb6364

…l mode

calixteman force-pushed the typo branch from 03c6265 to 9bb6364 Compare January 7, 2022 14:19

Snuffleupagus requested changes Jan 7, 2022

View reviewed changes

timvandermeij added the text-selection label Jan 8, 2022

calixteman force-pushed the typo branch 2 times, most recently from d17f523 to 3ac3817 Compare January 8, 2022 21:50

Snuffleupagus reviewed Jan 8, 2022

View reviewed changes

Snuffleupagus reviewed Jan 9, 2022

View reviewed changes

calixteman force-pushed the typo branch from 3ac3817 to 41197a7 Compare January 15, 2022 19:19

Snuffleupagus reviewed Jan 15, 2022

View reviewed changes

src/core/unicode.js Show resolved Hide resolved

Snuffleupagus approved these changes Jan 15, 2022

View reviewed changes

Handle all the whitespaces the same way when creating text chunks

9dae421

calixteman force-pushed the typo branch from 41197a7 to 9dae421 Compare January 15, 2022 20:44

calixteman merged commit da953f4 into mozilla:master Jan 15, 2022

Snuffleupagus mentioned this pull request Jan 15, 2022

Make the normalizeWhitespace handling, in the PartialEvaluator, more efficient (PR 14428 follow-up) #14455

Merged

Snuffleupagus mentioned this pull request Jan 31, 2022

The normalizeWhitespace = false option now is being (mostly) ignored #14519

Closed

	// Checks if ch is one of the following characters: SPACE, TAB, CR or LF.
	function isWhiteSpace(ch) {
	return ch === 0x20 \|\| ch === 0x09 \|\| ch === 0x0d \|\| ch === 0x0a;
	}

Use the correct dimension to know if we have to add an EOL in vertical mode #14428

Use the correct dimension to know if we have to add an EOL in vertical mode #14428

Conversation

calixteman commented Jan 7, 2022

calixteman commented Jan 7, 2022

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)

Received

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)

Failed

calixteman commented Jan 7, 2022

Snuffleupagus commented Jan 7, 2022 • edited Loading

calixteman commented Jan 7, 2022

Snuffleupagus commented Jan 7, 2022

calixteman commented Jan 7, 2022

calixteman commented Jan 7, 2022

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Jan 7, 2022

From: Bot.io (Linux m4)

Failed

Snuffleupagus left a comment

Choose a reason for hiding this comment

Snuffleupagus Jan 7, 2022 • edited Loading

Choose a reason for hiding this comment

calixteman Jan 7, 2022

Choose a reason for hiding this comment

pdfjsbot commented Jan 7, 2022

From: Bot.io (Windows)

Failed

calixteman commented Jan 7, 2022

Snuffleupagus Jan 8, 2022

Choose a reason for hiding this comment

Snuffleupagus Jan 9, 2022

Choose a reason for hiding this comment

calixteman Jan 15, 2022

Choose a reason for hiding this comment

calixteman commented Jan 15, 2022

pdfjsbot commented Jan 15, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Jan 15, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Jan 15, 2022

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Jan 15, 2022

From: Bot.io (Windows)

Failed

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

Snuffleupagus Jan 15, 2022

Choose a reason for hiding this comment

Snuffleupagus Jan 15, 2022

Choose a reason for hiding this comment

calixteman commented Jan 15, 2022

pdfjsbot commented Jan 15, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Jan 15, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Jan 15, 2022

From: Bot.io (Linux m4)

Success

pdfjsbot commented Jan 15, 2022

Snuffleupagus left a comment •

edited

Loading

Snuffleupagus commented Jan 7, 2022 •

edited

Loading

Snuffleupagus Jan 7, 2022 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading