zstd failure for unsepcified buffer size #424

martindurant · 2023-03-17T13:22:54Z

Reported in kerchunk: fsspec/kerchunk#317 (comment) by @cgohlke

Minimal, reproducible code sample, a copy-pastable example if possible

>>> import numcodecs, cramjam
>>> numcodecs.Zstd().decode(cramjam.zstd.compress(bytes(1)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "numcodecs/zstd.pyx", line 219, in numcodecs.zstd.Zstd.decode
  File "numcodecs/zstd.pyx", line 153, in numcodecs.zstd.decompress
RuntimeError: Zstd decompression error: invalid input data

Problem description

According to zstd.h, note 2 "decompressed size is an optional field, it may not be present, typically in streaming mode."

This means, that we can always decompress chunks written by our own Zstd, but. not necessarily those written by others. This is another case in which we could simply use cramjam or imagecodecs (or even zstandard) rather than building out own cython extension. In most cases, we will know the size of the decompressed data, because it must fit the output (but other filters in the chain may spoil this).

Version and installation information

Please provide the following:

version: main
Version of Python interpreter: any

ilan-gold · 2025-02-25T10:14:58Z

FWIW I have now run into this twice with other zstd implementations. Here's a quick reproducer of one

uv venv venv
source venv/bin/activate
uv pip install zarrs zarr numpy

import zarr
import zarrs
import numpy as np

default_pipeline = zarr.config.get("codec_pipeline.path")
with zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}):

    memory_array = np.array(25)
    z = zarr.create_array("foo.zarr", shape=memory_array.shape, dtype=memory_array.dtype, compressors=[zarr.codecs.ZstdCodec()])
    z[...] = memory_array

with zarr.config.set({"codec_pipeline.path": default_pipeline}):
    z = zarr.open_array("foo.zarr")
    z[...]

The other one is the GPU implenetation from nvcomp 3.0 as implemented/tested in https://github.com/ilan-gold/cuda-zarr

Env:

``` asciitree==0.3.3 asttokens==3.0.0 crc32c==2.7.1 decorator==5.2.1 deprecated==1.2.18 donfig==0.8.1.post1 executing==2.2.0 fasteners==0.19 fsspec==2025.2.0 iniconfig==2.0.0 ipython==8.32.0 jedi==0.19.2 matplotlib-inline==0.1.7 msgpack==1.1.0 numcodecs==0.15.1 numpy==2.2.3 packaging==24.2 parso==0.8.4 pexpect==4.9.0 pluggy==1.5.0 prompt-toolkit==3.0.50 ptyprocess==0.7.0 pure-eval==0.2.3 pygments==2.19.1 pytest==8.3.4 pyyaml==6.0.2 stack-data==0.6.3 traitlets==5.14.3 typing-extensions==4.12.2 universal-pathlib==0.2.6 wcwidth==0.2.13 wrapt==1.17.2 zarr==3.0.4 zarrs==0.1.3 zstandard==0.23.0 ```

d-v-b · 2025-02-25T10:17:59Z

i think #707 fixes this, but we really should just depend on an external zstd.

ilan-gold · 2025-02-25T11:00:09Z

@d-v-b What would that look like? I am not so familiar with the space. numcodecs would simply re-export cramjam as referenced originally?

d-v-b · 2025-02-25T11:06:48Z

i'm not familiar with this space either, but ideally numcodecs would list "zstd" explicitly as a dependency (required or not I'm not sure), and that dependency would resolve to a maintained source of zstd. https://github.com/indygreg/python-zstandard looks fine for our purposes. I don't think we need cramjam for this?

ilan-gold · 2025-02-25T11:11:02Z

I don't think we need cramjam for this?

No probably not!

I think that library looks good as well.

dstansby · 2025-02-25T11:15:02Z

I'm very heavily in favour of depending on external pacakges instead of vendoring as a future direction for numcodecs. If will need doing carefully though - for blosc I laid the groundwork here: #619, and have a proof of concept here: #571

If there's more than one of us pro going down this road, then I think we should just do it, but post on a slightly wider forum (open up and issue and ping zarr-python developers?) so others have oppurtunity to comment before we commit.

martindurant · 2025-02-25T15:48:24Z

I don't think we need cramjam for this?

No probably not!

No, we don't need cramjam, but it might be more convenient than grabbing the separate python packages for each compression algo (this is why fastparquet chose it). It might also be faster in some cases.

ilan-gold · 2025-02-25T15:57:07Z

@martindurant That does make sense, hadn't thought of that. It seems like it'd be all or nothing since cramjam has so many codecs

martindurant mentioned this issue Mar 17, 2023

tiff_to_zarr for geotiff with compression: zarr reads strange values fsspec/kerchunk#317

Closed

dstansby added the bug label Aug 25, 2024

martindurant mentioned this issue Feb 25, 2025

Use System lz4, zstd, and blosc #569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zstd failure for unsepcified buffer size #424

zstd failure for unsepcified buffer size #424

martindurant commented Mar 17, 2023

ilan-gold commented Feb 25, 2025 •

edited

Loading

d-v-b commented Feb 25, 2025

ilan-gold commented Feb 25, 2025

d-v-b commented Feb 25, 2025

ilan-gold commented Feb 25, 2025

dstansby commented Feb 25, 2025

martindurant commented Feb 25, 2025

ilan-gold commented Feb 25, 2025 •

edited

Loading

zstd failure for unsepcified buffer size #424

zstd failure for unsepcified buffer size #424

Comments

martindurant commented Mar 17, 2023

Minimal, reproducible code sample, a copy-pastable example if possible

Problem description

Version and installation information

ilan-gold commented Feb 25, 2025 • edited Loading

d-v-b commented Feb 25, 2025

ilan-gold commented Feb 25, 2025

d-v-b commented Feb 25, 2025

ilan-gold commented Feb 25, 2025

dstansby commented Feb 25, 2025

martindurant commented Feb 25, 2025

ilan-gold commented Feb 25, 2025 • edited Loading

ilan-gold commented Feb 25, 2025 •

edited

Loading

ilan-gold commented Feb 25, 2025 •

edited

Loading