Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

PDF.js slow at rendering complex image #2618

Closed
fmms opened this issue Jan 27, 2013 · 16 comments
Closed

PDF.js slow at rendering complex image #2618

fmms opened this issue Jan 27, 2013 · 16 comments

Comments

@fmms
Copy link

fmms commented Jan 27, 2013

I tried to look at http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471. The progress bar moves to 100% and then nothing happens.

I am using Firefox 19 and pdf.js 0.7.99 on Ubuntu 13.04.

@gigaherz
Copy link
Contributor

[22:56:16.345] Error: bad XRef entry @ resource://pdf.js/build/pdf.js:656

@gigaherz
Copy link
Contributor

Same as #2388 ?

@fmms
Copy link
Author

fmms commented May 26, 2013

Still there with todays version:

[11:10:46.743] An error occured while loading the file
PDF.js Version 0.8.180 (build: 3641c22)
Message: bad XRef entry

@timvandermeij
Copy link
Contributor

More console output. Error is the same as above, only in Dutch ;-)

error@resource://pdf.js/build/pdf.js:1184
XRef_fetch@resource://pdf.js/build/pdf.js:5554
XRef_fetchIfRef@resource://pdf.js/build/pdf.js:5521
Dict_get@resource://pdf.js/build/pdf.js:4768
pdfjsWrapper/Catalog.prototype.documentOutline@resource://pdf.js/build/pdf.js:4900
LocalPdfManager_ensure@resource://pdf.js/build/pdf.js:542
BasePdfManager_ensureCatalog@resource://pdf.js/build/pdf.js:497
parseSuccess@resource://pdf.js/build/pdf.js:36266
Promise_then@resource://pdf.js/build/pdf.js:1923
pdfjsWrapper/wphSetup/loadDocument//@resource://pdf.js/build/pdf.js:36297
Promise_then@resource://pdf.js/build/pdf.js:1923
pdfjsWrapper/wphSetup/loadDocument/@resource://pdf.js/build/pdf.js:36298
Promise_then@resource://pdf.js/build/pdf.js:1923
loadDocument@resource://pdf.js/build/pdf.js:36299
pdfManagerReady@resource://pdf.js/build/pdf.js:36441
Promise_then@resource://pdf.js/build/pdf.js:1923
wphSetupDoc@resource://pdf.js/build/pdf.js:36440
messageHandlerComObjOnMessage@resource://pdf.js/build/pdf.js:36220
Er is een fout opgetreden bij het laden van het PDF-bestand. PDF.js versie 0.8.180 (build 3641c22) Bericht: bad XRef entry

@timvandermeij
Copy link
Contributor

Still renders forever, but there are less (or no) errors in the console now:

Warning: Unable to read document outline
PDF c79c336f9cc7a242a4cd36f2317f7b59 [1.4 PDFsharp 1.31.1789-g (www.pdfsharp.com) / PDFsharp 1.31.1789-g (www.pdfsharp.com)] (PDF.js: 0.8.291)

Note that this PDF, http://march12.rsf.org/i/Report_EnemiesoftheInternet_2012.pdf, has the same console warnings, but does render. Maybe that helps to track down the problem.

@Snuffleupagus
Copy link
Collaborator

An update: using PDF.js 0.8.296 the file (http://bugzilla-attachments.gnome.org/attachment.cgi?id=226471) actually renders, but it takes approximately 1.5 minutes before it finishes.
As a comparison, it's even slow in Adobe Reader, taking 15 seconds before finishing rendering.

@Snuffleupagus
Copy link
Collaborator

After #3461 landed, this file renders somewhat quicker than before. In my testing, the rendering time has gone down from approximately 90 to 75 seconds. More importantly than the speed-up, is that with the latest version of PDF.js the text is rendered immediately and you only have to wait for the figure to finish rendering.

@timvandermeij
Copy link
Contributor

Renders in 37 seconds for me using PDF.js 0.8.396. Maybe the difference with your results is a difference in configuration, @Snuffleupagus? Indeed, the text is now available immediately, so that is a big advantage.

@fmms
Copy link
Author

fmms commented Aug 11, 2013

This is indeed just a performance problem now, somebody should either rename or close this bug.

@timvandermeij
Copy link
Contributor

@fmms You can rename it yourself since you opened the issue ;-) The issue title should have an Edit button next to it. Just rename it to 'PDF.js slow at rendering complex image' or something like that.

@fmms
Copy link
Author

fmms commented Dec 14, 2013

Still there with todays version:

"Warning: Unable to read document outline" pdf.js:198
 "PDF c79c336f9cc7a242a4cd36f2317f7b59 [1.4 PDFsharp 1.31.1789-g (www.pdfsharp.com) / PDFsharp 1.31.1789-g (www.pdfsharp.com)] (PDF.js: 0.8.773)"

@nnethercote
Copy link
Contributor

This document is notable for having millions of inline images, which are used to build up the textured parts of the picture.

@nnethercote
Copy link
Contributor

And disabling QueueOptimizer slows it down a lot, because the paintImageMaskXObjectGroup optimization can't be performed.

nnethercote added a commit to nnethercote/pdf.js that referenced this issue Aug 14, 2014
makeInlineImage() has a "are the next five chars ASCII?" check which is
run after an "EI" sequence has been found. This check involves the
creation of a new object because peekBytes() calls subarray().

Unfortunately, the check is currently run on whitespace chars even when
an "EI" sequence has not yet been found, i.e. when it's not needed. For
the PDF in mozilla#2618, there are over 820,000 such checks.

This change reworks the relevant loop so that the check is only done
once an "EI" sequence has been seen. This reduces the number of checks
to 157,000, and speeds up rendering by somewhere between 2% and 7% (the
measurements are noisy).
nnethercote added a commit to nnethercote/pdf.js that referenced this issue Aug 14, 2014
makeInlineImage() has a "are the next five chars ASCII?" check which is
run after an "EI" sequence has been found. This check involves the
creation of a new object because peekBytes() calls subarray().

Unfortunately, the check is currently run on whitespace chars even when
an "EI" sequence has not yet been found, i.e. when it's not needed. For
the PDF in mozilla#2618, there are over 820,000 such checks.

This change reworks the relevant loop so that the check is only done
once an "EI" sequence has been seen. This reduces the number of checks
to 157,000, and speeds up rendering by somewhere between 2% and 7% (the
measurements are noisy).
CodingFabian added a commit to CodingFabian/pdf.js that referenced this issue Oct 26, 2014
As described in mozilla#5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.

This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a lazy caching, which would
only consider a image cache worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A B C D would be
seen as A B B B by the evaluator, which prevents using the "repeat"
optimization.

Also the previous cache implementation was only checking the last used
image.
Thus the sequence A1 B1 A2 B2 A3 B3 would be 6 instances of images, even
when there are only two different ones.

The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is rather low.
Also this implementation will now keep hold of any cachable images.

The two examples from above would now be A A A A and A1 B1 A1 B1 A1 B1,
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of mozilla#2618)
CodingFabian added a commit to CodingFabian/pdf.js that referenced this issue Oct 26, 2014
As described in mozilla#5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.

This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a lazy caching, which would
only consider a image cache worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A B C D would be
seen as A B B B by the evaluator, which prevents using the "repeat"
optimization.

The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is rather low.

The two examples from above would now be A A A A
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of mozilla#2618)
CodingFabian added a commit to CodingFabian/pdf.js that referenced this issue Oct 27, 2014
As described in mozilla#5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.

This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a lazy caching, which would
only consider a image cache worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A B C D would be
seen as A B B B by the evaluator, which prevents using the "repeat"
optimization.

The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is rather low.

The example from above would now be A A A A
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of mozilla#2618)
CodingFabian added a commit to CodingFabian/pdf.js that referenced this issue Oct 27, 2014
As described in mozilla#5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.

This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a lazy caching, which would
only consider a image cache worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A1 A2 A3 A4 would
be seen as A1 A2 A2 A2 by the evaluator, which prevents using the
"repeat" optimization. Also only the last encountered image is cached,
so A1 B1 A2 B2, would stay A1 B1 A2 B2.

The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is rather low. It also caches any
eligible image by its adler32.

The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of mozilla#2618)
CodingFabian added a commit to CodingFabian/pdf.js that referenced this issue Oct 28, 2014
As described in mozilla#5444, the evaluator will perform identity checking of
paintImageMaskXObjects to decide if it can use
paintImageMaskXObjectRepeat instead of paintImageMaskXObjectGroup.

This can only ever work if the entry is a cache hit. However the
previous caching implementation was doing a lazy caching, which would
only consider a image cache worthy if it is repeated.
Only then the repeated instance would be cached.
As a result of this the sequence of identical images A1 A2 A3 A4 would
be seen as A1 A2 A2 A2 by the evaluator, which prevents using the
"repeat" optimization. Also only the last encountered image is cached,
so A1 B1 A2 B2, would stay A1 B1 A2 B2.

The new implementation drops the "lazy" init of the cache. The threshold
for enabling an image to be cached is rather small, so the potential waste
in storage and adler32 calculation is rather low. It also caches any
eligible image by its adler32.

The two example from above would now be A1 A1 A1 A1 and A1 B1 A1 B1
which not only saves temporary storage, but also prevents computing
identical masks over and over again (which is the main performance impact
of mozilla#2618)
@CodingFabian
Copy link
Contributor

#5445 has landed, i would go so far that the rendering is now acceptable. what do you guys think?

@yurydelendik
Copy link
Contributor

I think it is good. Other viewers (Preview and Chrome) have performance issues with that as well -- try zooming. Closing as resolved

@nnethercote
Copy link
Contributor

Nice work, Fabian.

Snuffleupagus added a commit to Snuffleupagus/pdf.js that referenced this issue May 12, 2021
…andard classes

This patch was tested using the PDF file from [issue 2618](mozilla#2618), i.e. https://bug570667.bugzilla-attachments.gnome.org/attachment.cgi?id=226471, with the following manifest file:
```
[
    {  "id": "issue2618",
       "file": "../web/pdfs/issue2618.pdf",
       "md5": "",
       "rounds": 50,
       "type": "eq"
    }
]
```

which gave the following results when comparing this patch against the `master` branch:
```
-- Grouped By browser, stat --
browser | stat         | Count | Baseline(ms) | Current(ms) | +/- |   %  | Result(P<.05)
------- | ------------ | ----- | ------------ | ----------- | --- | ---- | -------------
firefox | Overall      |    50 |         3417 |        3426 |   9 | 0.27 |
firefox | Page Request |    50 |            1 |           1 |   0 | 5.41 |
firefox | Rendering    |    50 |         3416 |        3426 |   9 | 0.27 |
```

Based on these results, there's not significant performance regression from using standard classes and this patch should thus be OK.
Snuffleupagus added a commit to Snuffleupagus/pdf.js that referenced this issue Nov 9, 2022
1799927)

*Please note:* This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927.

It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.
To avoid that we replace it with the `MurmurHash3_64` class instead, which is already used in various other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering).

One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images.

---
[1] Doing some quick benchmarking in the viewer, using `#pdfBug=Stats`, with the PDF document from issue mozilla#2618 shows at least a 10 percent regression.
Snuffleupagus added a commit to Snuffleupagus/pdf.js that referenced this issue Nov 9, 2022
1799927)

*Please note:* This is a tentative patch, which only fixes the "wrong letter" part of bug 1799927.

It appears that the simple `computeAdler32` function, used when caching inline images, generates hash collisions for some (very short) TypedArrays. In this case that leads to some of the "letters", which are actually inline images, being rendered incorrectly.
To avoid that we replace it with the `MurmurHash3_64` class instead, which is already used in other parts of the code-base. The one disadvantage of doing this is that it's slightly slower, which in some cases will lead to a performance regression.[1] However I believe that we'll have to accept a smaller regression here, since the alternative is much worse (i.e. broken rendering).

One small benefit of these changes is that we can avoid creating lots of `Stream`-instances for already cached inline images.

---
[1] Doing some quick benchmarking in the viewer, using `#pdfBug=Stats`, with the PDF document from issue mozilla#2618 shows at least a 10 percent regression.
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

7 participants