Skip to content

adds codecs that numcodecs defines #2

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

adds codecs that numcodecs defines #2

wants to merge 6 commits into from

Conversation

normanrz
Copy link
Member

@normanrz normanrz commented Feb 24, 2025

  • Blosc
  • LZ4
  • Zstd
  • Zlib
  • GZip
  • BZ2
  • LZMA
  • Shuffle
  • CRC32
  • CRC32C
  • Adler32
  • Fletcher32
  • JenkinsLookup3
  • PCodec
  • ZFPY

@normanrz
Copy link
Member Author

normanrz commented Mar 1, 2025

I validated the schema.jsons agains the numcodecs fixtures:

# /// script
# dependencies = [ "jsonschema" ]
# ///

from jsonschema import validate
import json
from pathlib import Path

numcodecs_fixture_path = (
    Path.home() / "numcodecs" / "fixture"
)
for path in Path("codecs").glob("numcodecs.*/schema.json"):
    _, name = path.parent.name.split(".")
    print(name)
    for fixture_path in (numcodecs_fixture_path / name).glob("**/config.json"):
        print("  ", fixture_path)
        config_json = json.loads(fixture_path.read_text())
        config_json.pop("id", None)
        config_json = {"name": f"numcodecs.{name}", "configuration": config_json}

        validate(
            instance=config_json,
            schema=json.loads(path.read_bytes()),
        )

@jbms
Copy link
Contributor

jbms commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

@normanrz
Copy link
Member Author

normanrz commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

Also, many of these leave important details of the encoded format unspecified, meaning the actual specification is the numcodecs source code.

I'm not sure if it is intended that names can be registered without a proper specification other than a reference to the source code. But even if it is allowed, surely it should be discouraged and these initial ones should include a proper specification.

I agree and would welcome contributions. Unfortunately, the numcodecs documentation is also pretty sparse on encoding details. So, for every codec we need to go through the code and write a spec.
It is strongly encouraged to write a specification, but not a must. In the interest of time, I wanted to have these specification scaffolds in to reserve the names and leave the spec details for later.

@jbms
Copy link
Contributor

jbms commented Mar 5, 2025

Is there a reason to duplicate codecs that are already listed elsewhere in this repo, e.g. gzip, zstd, blosc?

Well, right now numcodecs uses the numcodecs. prefix in the codec names. Also, I am not sure the metadata is 100% equal to the ones listed in zarr-specs. That is why they are duplicated.

I see --- I did not realize that zarr-python had added all of the numcodecs codecs for zarr v3 as numcodecs.xxx.

I imagine it was done to make it very easy for someone using zarr-python to migrate to using zarr v3 -- which is understandable.

However, from an interoperability perspective this is kind of unfortunate --- someone using zarr-python with zarr v3 and a numcodecs.XXX codec may not realize that they are producing a zarr array that is not interoperable with any other zarr implementation, because the codec gets recorded as numcodecs.xxx. That is particularly unfortunate for cases like gzip or blosc or zstd where other implementations do in fact support those codecs both with zarr v2 and zarr v3, and had the zarr-python user specified the codec in exactly the same way but used zarr v2 instead of zarr v3 they would also produce an interoperable array, but by specifying zarr v3 they produce a non-interoperable array.

@rabernat
Copy link

Can we have aliases? Like the same codec has two different names?

Arrow and other projects do that (e.g. arrow utf8 is an alias for string).

@normanrz
Copy link
Member Author

Can we have aliases? Like the same codec has two different names?

Arrow and other projects do that (e.g. arrow utf8 is an alias for string).

I think that would just be another extension that needs to be registered, with some co-references in the readme.

@rabernat
Copy link

To be clear about my position, I think we should have such aliases or double entries (numcodecs.XXX and XXX) in every case where the codec is a general purpose interoperable codec.

@normanrz
Copy link
Member Author

That makes sense. I think that can be followup, though, once we have better specifications for the individual codecs.

@normanrz
Copy link
Member Author

I think this PR is blocked by zarr-developers/numcodecs#742 (comment). Instead of registering all codecs as numcodecs.* it would be better to register them individually. However, that would require harmonization both in zarr-python and numcodecs, for example w.r.t to numcodecs.blosc and blosc.

@normanrz normanrz marked this pull request as draft May 6, 2025 17:12
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants