Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

cache collision #78

Open
patxoca opened this issue Feb 20, 2020 · 1 comment
Open

cache collision #78

patxoca opened this issue Feb 20, 2020 · 1 comment

Comments

@patxoca
Copy link

patxoca commented Feb 20, 2020

Scrapping two different PDFs yields the exact same results when using the FileCache.

The problem is that set_hash_key() always computes the same key because the file is already seek at the end (md5("") == "d41d8cd98f00b204e9800998ecf8427e") and pdfquery ends up using the same cached data for both PDFs.

Adding file.seek(0) before computing the md5 seems to solve the issue.

@patxoca
Copy link
Author

patxoca commented Mar 11, 2020

Temporary workaround until the issue is fixed, define a custom cache class:

from pdfquery.cache import FileCache as _FileCache

class FileCache(_FileCache):

    def set_hash_key(self, file):
        file.seek(0)
        return super().set_hash_key(file)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant