-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
PDF.js slow at rendering complex image #2618
Comments
[22:56:16.345] Error: bad XRef entry @ resource://pdf.js/build/pdf.js:656 |
Same as #2388 ? |
Still there with todays version:
|
More console output. Error is the same as above, only in Dutch ;-) error@resource://pdf.js/build/pdf.js:1184 XRef_fetch@resource://pdf.js/build/pdf.js:5554 XRef_fetchIfRef@resource://pdf.js/build/pdf.js:5521 Dict_get@resource://pdf.js/build/pdf.js:4768 pdfjsWrapper/Catalog.prototype.documentOutline@resource://pdf.js/build/pdf.js:4900 LocalPdfManager_ensure@resource://pdf.js/build/pdf.js:542 BasePdfManager_ensureCatalog@resource://pdf.js/build/pdf.js:497 parseSuccess@resource://pdf.js/build/pdf.js:36266 Promise_then@resource://pdf.js/build/pdf.js:1923 pdfjsWrapper/wphSetup/loadDocument//@resource://pdf.js/build/pdf.js:36297 Promise_then@resource://pdf.js/build/pdf.js:1923 pdfjsWrapper/wphSetup/loadDocument/@resource://pdf.js/build/pdf.js:36298 Promise_then@resource://pdf.js/build/pdf.js:1923 loadDocument@resource://pdf.js/build/pdf.js:36299 pdfManagerReady@resource://pdf.js/build/pdf.js:36441 Promise_then@resource://pdf.js/build/pdf.js:1923 wphSetupDoc@resource://pdf.js/build/pdf.js:36440 messageHandlerComObjOnMessage@resource://pdf.js/build/pdf.js:36220 Er is een fout opgetreden bij het laden van het PDF-bestand. PDF.js versie 0.8.180 (build 3641c22) Bericht: bad XRef entry |
Still renders forever, but there are less (or no) errors in the console now: Warning: Unable to read document outline PDF c79c336f9cc7a242a4cd36f2317f7b59 [1.4 PDFsharp 1.31.1789-g (www.pdfsharp.com) / PDFsharp 1.31.1789-g (www.pdfsharp.com)] (PDF.js: 0.8.291) Note that this PDF, http://march12.rsf.org/i/Report_EnemiesoftheInternet_2012.pdf, has the same console warnings, but does render. Maybe that helps to track down the problem. |
An update: using PDF.js 0.8.296 the file (http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471) actually renders, but it takes approximately 1.5 minutes before it finishes. |
After #3461 landed, this file renders somewhat quicker than before. In my testing, the rendering time has gone down from approximately 90 to 75 seconds. More importantly than the speed-up, is that with the latest version of PDF.js the text is rendered immediately and you only have to wait for the figure to finish rendering. |
Renders in 37 seconds for me using PDF.js 0.8.396. Maybe the difference with your results is a difference in configuration, @Snuffleupagus? Indeed, the text is now available immediately, so that is a big advantage. |
This is indeed just a performance problem now, somebody should either rename or close this bug. |
@fmms You can rename it yourself since you opened the issue ;-) The issue title should have an Edit button next to it. Just rename it to 'PDF.js slow at rendering complex image' or something like that. |
Still there with todays version:
|
This document is notable for having millions of inline images, which are used to build up the textured parts of the picture. |
And disabling QueueOptimizer slows it down a lot, because the |
makeInlineImage() has a "are the next five chars ASCII?" check which is run after an "EI" sequence has been found. This check involves the creation of a new object because peekBytes() calls subarray(). Unfortunately, the check is currently run on whitespace chars even when an "EI" sequence has not yet been found, i.e. when it's not needed. For the PDF in mozilla#2618, there are over 820,000 such checks. This change reworks the relevant loop so that the check is only done once an "EI" sequence has been seen. This reduces the number of checks to 157,000, and speeds up rendering by somewhere between 2% and 7% (the measurements are noisy).
makeInlineImage() has a "are the next five chars ASCII?" check which is run after an "EI" sequence has been found. This check involves the creation of a new object because peekBytes() calls subarray(). Unfortunately, the check is currently run on whitespace chars even when an "EI" sequence has not yet been found, i.e. when it's not needed. For the PDF in mozilla#2618, there are over 820,000 such checks. This change reworks the relevant loop so that the check is only done once an "EI" sequence has been seen. This reduces the number of checks to 157,000, and speeds up rendering by somewhere between 2% and 7% (the measurements are noisy).
As described in mozilla#5444, the evaluator will perform identity checking of paintImageMaskXObjects to decide if it can use paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup. This can only ever work if the entry is a cache hit. However the previous caching implementation was doing a lazy caching, which would only consider a image cache worthy if it is repeated. Only then the repeated instance would be cached. As a result of this the sequence of identical images A B C D would be seen as A B B B by the evaluator, which prevents using the "repeat" optimization. Also the previous cache implementation was only checking the last used image. Thus the sequence A1 B1 A2 B2 A3 B3 would be 6 instances of images, even when there are only two different ones. The new implementation drops the "lazy" init of the cache. The threshold for enabling an image to be cached is rather small, so the potential waste in storage and adler32 calculation is rather low. Also this implementation will now keep hold of any cachable images. The two examples from above would now be A A A A and A1 B1 A1 B1 A1 B1, which not only saves temporary storage, but also prevents computing identical masks over and over again (which is the main performance impact of mozilla#2618)
As described in mozilla#5444, the evaluator will perform identity checking of paintImageMaskXObjects to decide if it can use paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup. This can only ever work if the entry is a cache hit. However the previous caching implementation was doing a lazy caching, which would only consider a image cache worthy if it is repeated. Only then the repeated instance would be cached. As a result of this the sequence of identical images A B C D would be seen as A B B B by the evaluator, which prevents using the "repeat" optimization. The new implementation drops the "lazy" init of the cache. The threshold for enabling an image to be cached is rather small, so the potential waste in storage and adler32 calculation is rather low. The two examples from above would now be A A A A which not only saves temporary storage, but also prevents computing identical masks over and over again (which is the main performance impact of mozilla#2618)
As described in mozilla#5444, the evaluator will perform identity checking of paintImageMaskXObjects to decide if it can use paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup. This can only ever work if the entry is a cache hit. However the previous caching implementation was doing a lazy caching, which would only consider a image cache worthy if it is repeated. Only then the repeated instance would be cached. As a result of this the sequence of identical images A B C D would be seen as A B B B by the evaluator, which prevents using the "repeat" optimization. The new implementation drops the "lazy" init of the cache. The threshold for enabling an image to be cached is rather small, so the potential waste in storage and adler32 calculation is rather low. The example from above would now be A A A A which not only saves temporary storage, but also prevents computing identical masks over and over again (which is the main performance impact of mozilla#2618)
As described in mozilla#5444, the evaluator will perform identity checking of paintImageMaskXObjects to decide if it can use paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup. This can only ever work if the entry is a cache hit. However the previous caching implementation was doing a lazy caching, which would only consider a image cache worthy if it is repeated. Only then the repeated instance would be cached. As a result of this the sequence of identical images A1 A2 A3 A4 would be seen as A1 A2 A2 A2 by the evaluator, which prevents using the "repeat" optimization. Also only the last encountered image is cached, so A1 B1 A2 B2, would stay A1 B1 A2 B2. The new implementation drops the "lazy" init of the cache. The threshold for enabling an image to be cached is rather small, so the potential waste in storage and adler32 calculation is rather low. It also caches any eligible image by its adler32. The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1 which not only saves temporary storage, but also prevents computing identical masks over and over again (which is the main performance impact of mozilla#2618)
As described in mozilla#5444, the evaluator will perform identity checking of paintImageMaskXObjects to decide if it can use paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup. This can only ever work if the entry is a cache hit. However the previous caching implementation was doing a lazy caching, which would only consider a image cache worthy if it is repeated. Only then the repeated instance would be cached. As a result of this the sequence of identical images A1 A2 A3 A4 would be seen as A1 A2 A2 A2 by the evaluator, which prevents using the "repeat" optimization. Also only the last encountered image is cached, so A1 B1 A2 B2, would stay A1 B1 A2 B2. The new implementation drops the "lazy" init of the cache. The threshold for enabling an image to be cached is rather small, so the potential waste in storage and adler32 calculation is rather low. It also caches any eligible image by its adler32. The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1 which not only saves temporary storage, but also prevents computing identical masks over and over again (which is the main performance impact of mozilla#2618)
#5445 has landed, i would go so far that the rendering is now acceptable. what do you guys think? |
I think it is good. Other viewers (Preview and Chrome) have performance issues with that as well -- try zooming. Closing as resolved |
Nice work, Fabian. |
…andard classes This patch was tested using the PDF file from [issue 2618](mozilla#2618), i.e. https://bug570667.bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file: ``` [ { "id": "issue2618", "file": "../web/pdfs/issue2618.pdf", "md5": "", "rounds": 50, "type": "eq" } ] ``` which gave the following results when comparing this patch against the `master` branch: ``` -- Grouped By browser, stat -- browser | stat | Count | Baseline(ms) | Current(ms) | +/- | % | Result(P<.05) ------- | ------------ | ----- | ------------ | ----------- | --- | ---- | ------------- firefox | Overall | 50 | 3417 | 3426 | 9 | 0.27 | firefox | Page Request | 50 | 1 | 1 | 0 | 5.41 | firefox | Rendering | 50 | 3416 | 3426 | 9 | 0.27 | ``` Based on these results, there's not significant performance regression from using standard classes and this patch should thus be OK.
1799927) *Please note:* This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927. It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly. To avoid that we replace it with the `MurmurHash3_64` class instead, which is already used in various other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering). One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images. --- [1] Doing some quick benchmarking in the viewer, using `#pdfBug=Stats`, with the PDF document from issue mozilla#2618 shows at least a 10 percent regression.
1799927) *Please note:* This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927. It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly. To avoid that we replace it with the `MurmurHash3_64` class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering). One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images. --- [1] Doing some quick benchmarking in the viewer, using `#pdfBug=Stats`, with the PDF document from issue mozilla#2618 shows at least a 10 percent regression.
I tried to look at http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471. The progress bar moves to 100% and then nothing happens.
I am using Firefox 19 and pdf.js 0.7.99 on Ubuntu 13.04.
The text was updated successfully, but these errors were encountered: