Skip to content

PDF Read/Merge generates broken documents since update 2.10.6 #1344

Closed
@Merinorus

Description

@Merinorus

Hello,

Since update 2.10.6, some PDF documents are not merged correctly. Same with version 2.10.7.
Previous versions (2.10.5 and below) behave correctly.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-Ubuntu-20.04-focal
Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-debian-11.2

$ python3 -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.6

Code + PDF

This is a minimal, complete example that shows the issue:

# requirements.txt
diff-pdf-visually==1.7.0
PyPDF2==2.10.6
pytest==7.1.3
# test_same_pdf.py
import io
import shutil
import tempfile
from PyPDF2 import PdfMerger, PdfReader
from diff_pdf_visually import pdf_similar
import os
import pytest
import logging


logger = logging.getLogger(__name__)

FILE_INPUT_URI = "arret_maladie.pdf"
FILE_OUTPUT_URI = "output.pdf"


def file_as_bytesio(filepath: str):
    """Open a file as BytesIO, read only."""
    with open(filepath, "rb") as f:
        return io.BytesIO(f.read())


def test_pdf_merger():

    # Open document and merge it into a temporary file
    merger = PdfMerger()

    merger.append(PdfReader(file_as_bytesio(FILE_INPUT_URI)))

    # Write the final merged document
    temp_file = tempfile.NamedTemporaryFile()
    merger.write(temp_file.name)

    temp_file_path = temp_file.name

    # Compare VISUALLY the content of the newly generated file with the expected content
    if not pdf_similar(temp_file_path, FILE_INPUT_URI):
        # If files don't match visually, the test fails.
        # Copy the newly generated file to the current directory to manually check what is wrong
        current_dir = os.path.dirname(__file__)
        new_file_path = os.path.join(current_dir, FILE_OUTPUT_URI)
        shutil.copy2(temp_file.name, new_file_path)
        logger.error(f"The newly merged file does not match with the intput file.")
        # Fail the test on purpose
        assert False

Here is the PDF that caused the issue:
input.pdf

Here is the output (simple PdfReader -> PdfMerger):
output.pdf

Traceback

This is the complete Traceback I see:

Converting each page of the PDFs to an image...
  PDFs have same number of pages. Checking each pair of converted images...
Min sig = 13.4442, significant?=True. The PDFs are different. The most different pages are: page 1 (sgf. 13.4442), page 2 (sgf. 13.6684), page 3 (sgf. 13.9239).
ERROR    test_same_pdf:test_same_pdf.py:44 The newly merged file does not match with the intput file.

Thank you for taking the time to investigate.
The document is a French official form, I guess it's fine for using it in automated tests, but not sure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions