-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Use the correct dimension to know if we have to add an EOL in vertical mode #14428
Conversation
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/ae36aaa0b5608a0/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/ecab8ec69c05843/output.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in PR #14418 this unfortunately causes a (pretty clear) regression, w.r.t. the positioning of some of the textLayer spans, on page 3 of the TaroUTR50SortedList112.pdf
document.
Hence, on its own, I don't think that this patch is correct/enough to address this unfortunately.
Also, in the commit message you probably want to replace the word had
with add
instead?
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/ae36aaa0b5608a0/output.txt Total script time: 23.96 mins
Image differences available at: http://54.241.84.105:8877/ae36aaa0b5608a0/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/ecab8ec69c05843/output.txt Total script time: 41.73 mins
Image differences available at: http://54.193.163.58:8877/ecab8ec69c05843/reftest-analyzer.html#web=eq.log |
The goal of the patch is just to fix an inconsistency. Lines 2519 to 2528 in 290cbc5
(we use advanceY and height) when for vertical mode we've: Lines 2468 to 2479 in 290cbc5
(we use advanceX, width and height) Anyway, right now if you copy/paste the last column you'll get an extra EOL between the whitespace and |
I get that, but as mentioned (and evident by the test results) the patch causes a regression. Given that there's no other documents, that we know of, which are fixed by this patch I really don't think that it's OK to "purposely" introduce a regression here.
Sure, but the regression is worse than that (comparatively) small problem as far as I'm concerned. As-is the textLayer position no longer agrees with the rendered text, which would make it more difficult for users to actually select the intended text. |
Agreed and the user selects the text generally to copy/paste it and so the EOL bug... and the circle is closed ! The whitespace on top of the last column is an ideographic space (https://www.compart.com/en/unicode/U+3000). Line 2589 in 290cbc5
I suppose we should do the same for any kind of whitespace. In Chrome, this space is stripped out but it's present in Acrobat. An other possibility is to detect that the space and the rest aren't really on the same column (not the same X) and in this case push the chunk with space to have a new one for the rest. It'd make sense in horizontal mode too: create a new chunk each time we don't have the same Y. @Snuffleupagus, what do you think ? |
I'd guess that'd require changing all of the existing
It sounds like that approach could potentially help improve other cases as well, although I suppose it could also lead to more textLayer elements being created (for some documents) since we'd break up existing text-runs more than we currently do. |
According to https://jsbench.me/b1ku0bggvi/1 (found in stackoverflow), the fastest solution is the direct comparison.
I suppose that most of the times all glyphs are on the same line... but who knows. |
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/fc6ea31a9faa9e4/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/b6af57d37c11e4f/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/fc6ea31a9faa9e4/output.txt Total script time: 24.32 mins
Image differences available at: http://54.241.84.105:8877/fc6ea31a9faa9e4/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on a quick look, at least issue7878
seem to regress a bit with the latest version of the PR.
src/core/evaluator.js
Outdated
@@ -2572,6 +2593,8 @@ class PartialEvaluator { | |||
|
|||
const glyphs = font.charsToGlyphs(chars); | |||
const scale = textState.fontMatrix[0] * textState.fontSize; | |||
const whitespaces = glyphs.map(glyph => isWhitespace(glyph.unicode)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: Rather than looping through the glyphs here, would it possibly make sense (and be more efficient) to add a new property (e.g. isWhitespace
or similar) to the Glyph
-instances and utilize the new helper function there instead to determine this for each glyph as part of the existing parsing in https://github.com/mozilla/pdf.js/blob/master/src/core/fonts.js#L3153?
That would obviously require extending the Glyph
class, but given the existence of both a charsCache and a glyphCache in the font code that should actually help reduce the overall amount of parsing required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/b6af57d37c11e4f/output.txt Total script time: 41.63 mins
Image differences available at: http://54.193.163.58:8877/b6af57d37c11e4f/reftest-analyzer.html#web=eq.log |
The space between Line 2524 in 290cbc5
Line 2210 in 290cbc5
With the patch, the 0xA0 is removed (it's the only one char in its text run) and when we add the I think this regression is acceptable compared to the "improvements" we have in some other cases: it's always this problem to find a good trade-off between number of chunks and chars positions in the text layer. For your information, I've a wip locally to replace the html-based text layer by a svg-based one: it'll greatly improve the char positions, but I've few regressions I need to figure out. |
d17f523
to
3ac3817
Compare
src/core/fonts.js
Outdated
@@ -212,6 +215,8 @@ class Glyph { | |||
this.operatorListId = operatorListId; | |||
this.isSpace = isSpace; | |||
this.isInFont = isInFont; | |||
this.isWhitespace = isWhitespace(unicode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the existing isSpace
property, could we perhaps call this new one e.g. isUnicodeSpace
(or something) instead to more clearly distinguish them from each other?
Also, please update the name of the new helper function accordingly to avoid confusion with this already existing utility function:
Lines 247 to 250 in 8ac0ccc
// Checks if ch is one of the following characters: SPACE, TAB, CR or LF. | |
function isWhiteSpace(ch) { | |
return ch === 0x20 || ch === 0x09 || ch === 0x0d || ch === 0x0a; | |
} |
src/core/unicode.js
Outdated
c === "\u2000" || | ||
c === "\u200a" || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a bunch of characters missing here, since https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet#character_classes contains the following information:
[ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]
Note in particular the hyphen between \u2000
and \u200a
, which indicates a range of characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I moved the diacritic test (I added it in the past to avoid to take into account the diacritic width in the chunk width) in the Glyph
class to enjoy the cache. This test (diacritic one) is achieved thanks a regex which is "slow", so since we've it I put the white space test into the regex too and consequently it's simplified (just use \s
and no need to have all the possibilities).
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/6f3a7c3b156b1f0/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/270de969d0be7ae/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/6f3a7c3b156b1f0/output.txt Total script time: 22.61 mins
Image differences available at: http://54.241.84.105:8877/6f3a7c3b156b1f0/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/270de969d0be7ae/output.txt Total script time: 40.67 mins
Image differences available at: http://54.193.163.58:8877/270de969d0be7ae/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me, since this seems like a good change overall by reducing the usage of regular expressions during text-extraction.
src/core/fonts.js
Outdated
@@ -212,6 +213,10 @@ class Glyph { | |||
this.operatorListId = operatorListId; | |||
this.isSpace = isSpace; | |||
this.isInFont = isInFont; | |||
|
|||
const categories = checkCharUnicodeCategory(unicode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Perhaps getCharUnicodeCategory
instead, since that more clearly (to me at least) suggests that the function returns an actual value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, for consistency const category = ...
instead here to not mix singular and plural in the names :-)
/botio makeref |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @calixteman received. Current queue size: 1 Live output at: http://54.193.163.58:8877/7f99d3996341de7/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/5b0e25f77527b19/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/5b0e25f77527b19/output.txt Total script time: 20.25 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/7f99d3996341de7/output.txt Total script time: 37.05 mins
|
No description provided.