-
-
Notifications
You must be signed in to change notification settings - Fork 65
Description
warcio recompress
will silently add a WARC-Payload-Digest
field to records that don't already have a payload digest field. This appears to only happen if the record already has a WARC-Block-Digest
field.
In my testing, I've seen this happen to both "metadata" records and non-HTTP "request" records. This is strange since warcio doesn't know what subset of these records's content block constitutes a "payload", so how could it be able to calculate a digest? The created Payload-Digest appears to just be a hash of the entire block. (This issue is in many ways an inverse of #156).
Attached is a ZIP file with orig.warc
and warc-recompress.warc
which was created by:
example-warcs.zip
warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz
The "metadata" record in orig.warc
contains a X.509 certificate and uses a Content-Type
field of application/x-pem-file
. The original has a block digest field, and no payload digest, since this metadata does not have a meaningful payload beyond the block digest. However if you look at ware-recompress.warc
you will see that a WARC-Payload-Digest
header has been added to the "metadata" record at the end. Additionally, the "request" record is for the Gemini protocol, and is not HTTP. Gemini requests do not have a meaningful payload, so the request record in in orig.warc
does not have a WARC-Payload-Digest
field. However warcio-compress.warc
shows one has been added.
While similar to #161, I believe this is a higher severity. Payload digests have meaning, and are used in other tool chains like CDX indexes. However warcio is adding payload digests to records that don't have them, and without having any concept of what the payload is or its meaning for these records. This is in addition to strangeness documented in #161 like:
- I would not expect a recompression operation to alter the records in the WARC.
- This behavior isn't documented
- It (very slightly) increases the size of the WARC
My suggestion would be that warcio recompress
should not alter the records of the WARC it is operating on.