Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Allow direct indexing of WACZ files #113

Open
anjackson opened this issue Sep 5, 2023 · 1 comment
Open

Allow direct indexing of WACZ files #113

anjackson opened this issue Sep 5, 2023 · 1 comment

Comments

@anjackson
Copy link
Contributor

Might make more sense to integrate it into py-wacz, which has cdxj-indexer as a dependency.

e.g. follow how py-wacz validation works to go through the indexes (https://specs.webrecorder.net/wacz/1.1.1/#indexes and grab the zip offsets from the file to work out the whole-file offsets (https://stackoverflow.com/questions/44799018/how-to-get-offset-values-of-all-files-or-given-filename-in-a-zipfile-using-pyt).

Unit tests can go like: https://github.com/webrecorder/py-wacz/blob/47b3eefbaa8f70d839a048cc3d36d7014de06c2c/tests/test_validate_wacz.py

Validation of the approach should include indexing POST requests in OutbackCDX, see nla/outbackcdx#106 (comment)

@anjackson
Copy link
Contributor Author

Basic initial implementation now at webrecorder/py-wacz#38

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant