Skip to content

zstd failure for unsepcified buffer size #424

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
martindurant opened this issue Mar 17, 2023 · 8 comments
Open

zstd failure for unsepcified buffer size #424

martindurant opened this issue Mar 17, 2023 · 8 comments
Labels

Comments

@martindurant
Copy link
Member

Reported in kerchunk: fsspec/kerchunk#317 (comment) by @cgohlke

Minimal, reproducible code sample, a copy-pastable example if possible

>>> import numcodecs, cramjam
>>> numcodecs.Zstd().decode(cramjam.zstd.compress(bytes(1)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "numcodecs/zstd.pyx", line 219, in numcodecs.zstd.Zstd.decode
  File "numcodecs/zstd.pyx", line 153, in numcodecs.zstd.decompress
RuntimeError: Zstd decompression error: invalid input data

Problem description

According to zstd.h, note 2 "decompressed size is an optional field, it may not be present, typically in streaming mode."

This means, that we can always decompress chunks written by our own Zstd, but. not necessarily those written by others. This is another case in which we could simply use cramjam or imagecodecs (or even zstandard) rather than building out own cython extension. In most cases, we will know the size of the decompressed data, because it must fit the output (but other filters in the chain may spoil this).

Version and installation information

Please provide the following:

  • version: main
  • Version of Python interpreter: any
@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 25, 2025

FWIW I have now run into this twice with other zstd implementations. Here's a quick reproducer of one

uv venv venv
source venv/bin/activate
uv pip install zarrs zarr numpy
import zarr
import zarrs
import numpy as np

default_pipeline = zarr.config.get("codec_pipeline.path")
with zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}):

    memory_array = np.array(25)
    z = zarr.create_array("foo.zarr", shape=memory_array.shape, dtype=memory_array.dtype, compressors=[zarr.codecs.ZstdCodec()])
    z[...] = memory_array

with zarr.config.set({"codec_pipeline.path": default_pipeline}):
    z = zarr.open_array("foo.zarr")
    z[...]

The other one is the GPU implenetation from nvcomp 3.0 as implemented/tested in https://github.com/ilan-gold/cuda-zarr

Env:

``` asciitree==0.3.3 asttokens==3.0.0 crc32c==2.7.1 decorator==5.2.1 deprecated==1.2.18 donfig==0.8.1.post1 executing==2.2.0 fasteners==0.19 fsspec==2025.2.0 iniconfig==2.0.0 ipython==8.32.0 jedi==0.19.2 matplotlib-inline==0.1.7 msgpack==1.1.0 numcodecs==0.15.1 numpy==2.2.3 packaging==24.2 parso==0.8.4 pexpect==4.9.0 pluggy==1.5.0 prompt-toolkit==3.0.50 ptyprocess==0.7.0 pure-eval==0.2.3 pygments==2.19.1 pytest==8.3.4 pyyaml==6.0.2 stack-data==0.6.3 traitlets==5.14.3 typing-extensions==4.12.2 universal-pathlib==0.2.6 wcwidth==0.2.13 wrapt==1.17.2 zarr==3.0.4 zarrs==0.1.3 zstandard==0.23.0 ```

@d-v-b
Copy link
Contributor

d-v-b commented Feb 25, 2025

i think #707 fixes this, but we really should just depend on an external zstd.

@ilan-gold
Copy link
Contributor

@d-v-b What would that look like? I am not so familiar with the space. numcodecs would simply re-export cramjam as referenced originally?

@d-v-b
Copy link
Contributor

d-v-b commented Feb 25, 2025

i'm not familiar with this space either, but ideally numcodecs would list "zstd" explicitly as a dependency (required or not I'm not sure), and that dependency would resolve to a maintained source of zstd. https://github.com/indygreg/python-zstandard looks fine for our purposes. I don't think we need cramjam for this?

@ilan-gold
Copy link
Contributor

I don't think we need cramjam for this?

No probably not!

I think that library looks good as well.

@dstansby
Copy link
Contributor

I'm very heavily in favour of depending on external pacakges instead of vendoring as a future direction for numcodecs. If will need doing carefully though - for blosc I laid the groundwork here: #619, and have a proof of concept here: #571

If there's more than one of us pro going down this road, then I think we should just do it, but post on a slightly wider forum (open up and issue and ping zarr-python developers?) so others have oppurtunity to comment before we commit.

@martindurant
Copy link
Member Author

I don't think we need cramjam for this?

No probably not!

No, we don't need cramjam, but it might be more convenient than grabbing the separate python packages for each compression algo (this is why fastparquet chose it). It might also be faster in some cases.

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 25, 2025

@martindurant That does make sense, hadn't thought of that. It seems like it'd be all or nothing since cramjam has so many codecs

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants