Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Python][Parquet] Attempt to encrypt column of type 'list' produces OSError #41246

Open
tritzman opened this issue Apr 17, 2024 · 2 comments · May be fixed by #45462
Open

[Python][Parquet] Attempt to encrypt column of type 'list' produces OSError #41246

tritzman opened this issue Apr 17, 2024 · 2 comments · May be fixed by #45462

Comments

@tritzman
Copy link

tritzman commented Apr 17, 2024

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow 15.0.2

Changing the table definition for example at python/examples/parquet_encryption/sample_vault_kms_client.py to this:

    table = pa.Table.from_pydict({
        'a': pa.array([1, 2, 3]),
        'b': pa.array([['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']]),
        'c': pa.array(['x', 'y', 'z'])
    })

produces an exception:
OSError: Encrypted column b not in file schema

I expected the encryption to work on a list type. I didn't test other non-fundamental types (like structs) or nested types. I did look for clarification on the capability in Parquet and Arrow, without much luck. Apologies if I missed something.

Thanks for Arrow, it's quite nice!

Component(s)

Python
Parquet

@tritzman tritzman changed the title [Python] Attempt to encrypt column of type 'list' produces OSError [Python][Parquet] Attempt to encrypt column of type 'list' produces OSError Apr 17, 2024
@tritzman
Copy link
Author

In my application code, when I call write_dataset, I have a file_visitor that collects metadata as Parquet files are created. Looking at the pyarrow.dataset.WrittenFile's metadata, I find path_in_schema, which shows lists are stored in Parquet with the name <column_name>.list.element. Adding the suffix to the value in col_b_key_name’s, (see column_keys below) results in proper operation, to include the assert comparison between the input table and output table. (ATM I'm not sure how to confirm all data is completely encrypted.)

column_keys={ col_a_key_name: ["a"], col_b_key_name: ["b.list.element"], }

Similarly, my application data includes structs. There I found path_in_schema entries for each field of the struct. I believe this would require a key declaration for each struct field (e.g. <column_name>.field_1, <column_name>.field_2, <column_name>.field_3, etc.

I have not looked into nested structs-of-lists or lists-of-structs to see how those are represented in Parquet.

It seems reasonable to have the developer list the column names to encrypt. But for non-primitive types, I'm not sure how they would know the modified column name used in the file.

In my application code, when writing encrypted Parquet, Python silently crashes in the previously mentioned file visitor. The application just exits with no messages or exceptions. This happens when calling pyarrow.dataset.WrittenFile’s function .metadata.to_dict(). By setting a break point and playing in the debugger, I found the same symptom when accessing meadata.row_group(0)’s to_dict() function. I won't be collecting and writing the _metadata or _common_metadata files when encrypting the data, so this code is normally disabled. But I figured it was worth noting the crash.

@EnricoMi
Copy link
Contributor

I could reproduce the issue in C++ and Python: https://github.com/EnricoMi/arrow/pull/8/files.

Issue exists for list, struct and map data types.

I am now looking into fixing / improving this.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants