Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Format] Specify VARIABLE_SIZE_LIST Logical type #437

Open
rok opened this issue Jun 24, 2024 · 2 comments
Open

[Format] Specify VARIABLE_SIZE_LIST Logical type #437

rok opened this issue Jun 24, 2024 · 2 comments

Comments

@rok
Copy link
Member

rok commented Jun 24, 2024

Arrow recently introduced FixedShapeTensor and VariableShapeTensor canonical extension types that use FixedSizeList and StructArray(List, FixedSizeList) as storage respectfully. These are targeted at machine learning and scientific applications that deal with large datasets and would benefit from using Parquet as on disk storage.

If Arrow's List was stored as BYTE_ARRAY we would likely see reduced overhead due to reading and writing definition and repetition levels. See discussion here. It would therefore be beneficial to introduce a VARIABLE_SIZE_LIST logical type to Parquet.

@mapleFU
Copy link
Member

mapleFU commented Jul 17, 2024

I’was busy previously. Sorry for delaying.

I found it a bit hard for Parquet to optimize tensor, maybe the problem is that rep-def levels for tensor / fixed length byte-array. Maybe I could try to fast check the rep/def-levels in this type

@rok
Copy link
Member Author

rok commented Jul 18, 2024

Yes, I would expect rep/def levels to be significant overhead for tensors currently.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants