Add a (global) cache to the `getCharUnicodeCategory` function #14490

Snuffleupagus · 2022-01-24T18:12:12Z

Given that the regular expression has already become more complex (after the initial patch adding it), it seems to me that it probably cannot hurt to add a global cache to reduce unnecessary re-parsing.
Obviously the Glyph-instances are being cached per font, however in most documents multiple fonts are being used and in practice there's very often a fair amount of overlap between the /ToUnicode-data in different fonts[1].

Consider for example loading and rendering the entire tracemonkey.pdf document (from the test-suite), which isn't a particularily large document. In that case the getCharUnicodeCategory function is being called a total of 601 times, however there's only 106 unique unicode-chars being checked.

Please note: In practice I suppose that this won't have a huge effect on overall performance, however given the relative simplicity of this patch I figured that it'd not hurt to submit it for review.

[1] Consider e.g. how there's usually different fonts used for regular, bold, respectively italic text.

calixteman

Good idea r+.

timvandermeij · 2022-01-24T18:46:56Z

/botio test

pdfjsbot · 2022-01-24T18:46:57Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/e669ee84700b8c1/output.txt

pdfjsbot · 2022-01-24T18:46:57Z

From: Bot.io (Windows)

Received

Command cmd_test from @timvandermeij received. Current queue size: 1

Live output at: http://54.193.163.58:8877/a7cc380bab47521/output.txt

pdfjsbot · 2022-01-24T19:10:24Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/e669ee84700b8c1/output.txt

Total script time: 23.45 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 10
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/e669ee84700b8c1/reftest-analyzer.html#web=eq.log

pdfjsbot · 2022-01-24T19:30:39Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/a7cc380bab47521/output.txt

Total script time: 41.70 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 9
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/a7cc380bab47521/reftest-analyzer.html#web=eq.log

Given that the regular expression has already become more complex (after the initial patch adding it), it seems to me that it probably cannot hurt to add a global cache to reduce unnecessary re-parsing. Obviously the `Glyph`-instances are being cached *per* font, however in most documents multiple fonts are being used and in practice there's very often a fair amount of overlap between the /ToUnicode-data in different fonts[1]. Consider for example loading and rendering the entire `tracemonkey.pdf` document (from the test-suite), which isn't a particularily large document. In that case the `getCharUnicodeCategory` function is being called a total of `601` times, however there's only `106` *unique* unicode-chars being checked. *Please note:* In practice I suppose that this won't have a *huge* effect on overall performance, however given the relative simplicity of this patch I figured that it'd not hurt to submit it for review. --- [1] Consider e.g. how there's usually different fonts used for regular, bold, respectively italic text.

Snuffleupagus added the core label Jan 24, 2022

calixteman approved these changes Jan 24, 2022

View reviewed changes

Snuffleupagus force-pushed the getCharUnicodeCategory-cache branch from 026ff52 to 8836593 Compare January 25, 2022 08:59

Snuffleupagus merged commit 583c39b into mozilla:master Jan 25, 2022

Snuffleupagus deleted the getCharUnicodeCategory-cache branch January 25, 2022 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a (global) cache to the `getCharUnicodeCategory` function #14490

Add a (global) cache to the `getCharUnicodeCategory` function #14490

Snuffleupagus commented Jan 24, 2022

calixteman left a comment

timvandermeij commented Jan 24, 2022

pdfjsbot commented Jan 24, 2022

pdfjsbot commented Jan 24, 2022

pdfjsbot commented Jan 24, 2022

pdfjsbot commented Jan 24, 2022

Add a (global) cache to the getCharUnicodeCategory function #14490

Add a (global) cache to the getCharUnicodeCategory function #14490

Conversation

Snuffleupagus commented Jan 24, 2022

calixteman left a comment

Choose a reason for hiding this comment

timvandermeij commented Jan 24, 2022

pdfjsbot commented Jan 24, 2022

From: Bot.io (Linux m4)

Received

pdfjsbot commented Jan 24, 2022

From: Bot.io (Windows)

Received

pdfjsbot commented Jan 24, 2022

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Jan 24, 2022

From: Bot.io (Windows)

Failed

Add a (global) cache to the `getCharUnicodeCategory` function #14490

Add a (global) cache to the `getCharUnicodeCategory` function #14490