Skip to content

warcio recompress adds "WARC-Payload-Digest" to records without understanding them #162

@acidus99

Description

@acidus99

warcio recompress will silently add a WARC-Payload-Digest field to records that don't already have a payload digest field. This appears to only happen if the record already has a WARC-Block-Digest field.

In my testing, I've seen this happen to both "metadata" records and non-HTTP "request" records. This is strange since warcio doesn't know what subset of these records's content block constitutes a "payload", so how could it be able to calculate a digest? The created Payload-Digest appears to just be a hash of the entire block. (This issue is in many ways an inverse of #156).

Attached is a ZIP file with orig.warc and warc-recompress.warc which was created by:
example-warcs.zip

warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz

The "metadata" record in orig.warc contains a X.509 certificate and uses a Content-Type field of application/x-pem-file. The original has a block digest field, and no payload digest, since this metadata does not have a meaningful payload beyond the block digest. However if you look at ware-recompress.warc you will see that a WARC-Payload-Digest header has been added to the "metadata" record at the end. Additionally, the "request" record is for the Gemini protocol, and is not HTTP. Gemini requests do not have a meaningful payload, so the request record in in orig.warc does not have a WARC-Payload-Digest field. However warcio-compress.warc shows one has been added.

While similar to #161, I believe this is a higher severity. Payload digests have meaning, and are used in other tool chains like CDX indexes. However warcio is adding payload digests to records that don't have them, and without having any concept of what the payload is or its meaning for these records. This is in addition to strangeness documented in #161 like:

  • I would not expect a recompression operation to alter the records in the WARC.
  • This behavior isn't documented
  • It (very slightly) increases the size of the WARC

My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions