Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ROB: Rebuild xref table if one entry is invalid #2528

Merged
merged 10 commits into from
Mar 24, 2024

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Mar 17, 2024

closes #2516

cope with cases where the xref entries do not point to valid headers

fixes py-pdf#2523
situation met:
* length field is not correct
* xref may contains not ordered stream datas
* xref contains some free entries (i.e. not contains stream offset)
Copy link

codecov bot commented Mar 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.48%. Comparing base (c4641d1) to head (553165c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2528      +/-   ##
==========================================
- Coverage   94.52%   94.48%   -0.04%     
==========================================
  Files          49       49              
  Lines        8178     8181       +3     
  Branches     1659     1660       +1     
==========================================
  Hits         7730     7730              
- Misses        277      280       +3     
  Partials      171      171              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

I am going to wait with the review of these changes until #2526 is merged as #2528 already incorporates the changes of #2526.

@pubpub-zz
Copy link
Collaborator Author

test ,file:
iss2516.pdf

@stefan6419846 stefan6419846 changed the title ROB: rebuild xref table if one entry is invalid ROB: Rebuild xref table if one entry is invalid Mar 18, 2024
@stefan6419846
Copy link
Collaborator

It seems like this small change has quite some impact on the coverage as

pypdf/pypdf/_reader.py

Lines 1277 to 1301 in c4641d1

except Exception:
if hasattr(self.stream, "getbuffer"):
buf = bytes(self.stream.getbuffer())
else:
p = self.stream.tell()
self.stream.seek(0, 0)
buf = self.stream.read(-1)
self.stream.seek(p, 0)
m = re.search(
rf"\s{indirect_reference.idnum}\s+{indirect_reference.generation}\s+obj".encode(),
buf,
)
if m is not None:
logger_warning(
f"Object ID {indirect_reference.idnum},{indirect_reference.generation} ref repaired",
__name__,
)
self.xref[indirect_reference.generation][
indirect_reference.idnum
] = (m.start(0) + 1)
self.stream.seek(m.start(0) + 1)
idnum, generation = self.read_object_header(self.stream)
else:
idnum = -1 # exception will be raised below
if idnum != indirect_reference.idnum and self.xref_index:
is not being covered by the tests any more. Is there something we can do about this without ignoring it or excluding the error handling from the coverage?

@pubpub-zz
Copy link
Collaborator Author

This is the best I can propose

@stefan6419846 stefan6419846 merged commit f8edf3c into py-pdf:main Mar 24, 2024
14 of 15 checks passed
stefan6419846 added a commit that referenced this pull request Apr 7, 2024
REL: 4.2.0

## What's new

### New Features (ENH)
- Allow multiple charsets for NameObject.read_from_stream (#2585) by @pubpub-zz
- Add support for /Kids in page labels (#2562) by @stefan6419846
- Allow to update fields on many pages (#2571) by @pubpub-zz
- Tolerate PDF with invalid xref pointed objects (#2335) by @pubpub-zz
- Add Enforce from PDF2.0 in viewer_preferences (#2511) by @pubpub-zz
- Add += and -= operators to ArrayObject (#2510) by @pubpub-zz

### Bug Fixes (BUG)
- Fix merge_page sometimes generating unknown operator 'QQ' (#2588) by @rfotino
- Fix fields update where annotations are kids of field (#2570) by @pubpub-zz
- Process CMYK images without a filter correctly (#2557) by @pubpub-zz
- Extract text in layout mode without finding resources (#2555) by @pubpub-zz
- Prevent recursive loop in some PDF files (#2505) by @pubpub-zz

### Robustness (ROB)
- Tolerate "truncated" xref (#2580) by @pubpub-zz
- Replace error by warning for EOD in RunLengthDecode/ASCIIHexDecode (#2334) by @pubpub-zz
- Rebuild xref table if one entry is invalid (#2528) by @pubpub-zz
- Robustify stream extraction (#2526) by @pubpub-zz

### Documentation (DOC)
- Update release process for latest changes (#2564) by @stefan6419846
- Encryption/decryption: Clone document instead of copying all pages (#2546) by @redfast00
- Minor improvements (#2542) by @j-t-1
- Update annotation list (#2534) by @j-t-1
- Update references and formatting (#2529) by @j-t-1
- Correct threads reference, plus minor changes (#2521) by @j-t-1
- Minor readability increases (#2515) by @j-t-1
- Simplify PaperSize examples (#2504) by @j-t-1
- Minor improvements (#2501) by @j-t-1

### Developer Experience (DEV)
- Remove unused dependencies (#2572) by @stefan6419846
- Remove page labels PR link from message (#2561) by @stefan6419846
- Fix changelog generator regarding whitespace and handling of "Other" group (#2492) by @stefan6419846
- Add REL to known PR prefixes (#2554) by @stefan6419846
- Release using the REL commit instead of git tag (#2500) by @MartinThoma
- Unify code between PdfReader and PdfWriter (#2497) by @pubpub-zz
- Bump softprops/action-gh-release from 1 to 2 (#2514) by @dependabot[bot]

### Maintenance (MAINT)
- Ressources → Resources (and internal name childs) (#2550) by @pubpub-zz
- Fix typos found by codespell (#2549) by @stefan6419846
- Update Read the Docs configuration (#2538) by @j-t-1
- Add root_object, _info and _ID to PdfReader (#2495) by @pubpub-zz

### Testing (TST)
- Allow loading truncated images if required (#2586) by @stefan6419846
- Fix download issues from #2562 (#2578) by @pubpub-zz
- Improve test_get_contents_from_nullobject to show real use-case (#2524) by @stefan6419846
- Add missing test annotations (#2507) by @stefan6419846

[Full Changelog](4.1.0...4.2.0)
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"/Pages" might be undefined
2 participants