You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
produces an exception: OSError: Encrypted column b not in file schema
I expected the encryption to work on a list type. I didn't test other non-fundamental types (like structs) or nested types. I did look for clarification on the capability in Parquet and Arrow, without much luck. Apologies if I missed something.
Thanks for Arrow, it's quite nice!
Component(s)
Python
Parquet
The text was updated successfully, but these errors were encountered:
tritzman
changed the title
[Python] Attempt to encrypt column of type 'list' produces OSError
[Python][Parquet] Attempt to encrypt column of type 'list' produces OSError
Apr 17, 2024
In my application code, when I call write_dataset, I have a file_visitor that collects metadata as Parquet files are created. Looking at the pyarrow.dataset.WrittenFile's metadata, I find path_in_schema, which shows lists are stored in Parquet with the name <column_name>.list.element. Adding the suffix to the value in col_b_key_name’s, (see column_keys below) results in proper operation, to include the assert comparison between the input table and output table. (ATM I'm not sure how to confirm all data is completely encrypted.)
Similarly, my application data includes structs. There I found path_in_schema entries for each field of the struct. I believe this would require a key declaration for each struct field (e.g. <column_name>.field_1, <column_name>.field_2, <column_name>.field_3, etc.
I have not looked into nested structs-of-lists or lists-of-structs to see how those are represented in Parquet.
It seems reasonable to have the developer list the column names to encrypt. But for non-primitive types, I'm not sure how they would know the modified column name used in the file.
In my application code, when writing encrypted Parquet, Python silently crashes in the previously mentioned file visitor. The application just exits with no messages or exceptions. This happens when calling pyarrow.dataset.WrittenFile’s function .metadata.to_dict(). By setting a break point and playing in the debugger, I found the same symptom when accessing meadata.row_group(0)’s to_dict() function. I won't be collecting and writing the _metadata or _common_metadata files when encrypting the data, so this code is normally disabled. But I figured it was worth noting the crash.
Describe the bug, including details regarding any error messages, version, and platform.
pyarrow 15.0.2
Changing the table definition for example at python/examples/parquet_encryption/sample_vault_kms_client.py to this:
produces an exception:
OSError: Encrypted column b not in file schema
I expected the encryption to work on a list type. I didn't test other non-fundamental types (like structs) or nested types. I did look for clarification on the capability in Parquet and Arrow, without much luck. Apologies if I missed something.
Thanks for Arrow, it's quite nice!
Component(s)
Python
Parquet
The text was updated successfully, but these errors were encountered: