Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enable parquet input for S3FileStore s3 select query #95

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

alexeocto
Copy link
Contributor

@alexeocto alexeocto commented Sep 25, 2023

This PR enables the S3FileStore.fetch_object_contents_using_s3_select method to also query parquet files by adding a parquet serializer to the s3_select.py file.

A LocalFileStore equivalent method was also implemented in order to facilitate local development, in particular linked to https://github.com/octoenergy/kraken-core/pull/102690 and Integrity checks.

The LocalFileStore.fetch_object_contents_using_s3_select works by first loading the entire file from it's local storage location, and then creating a DuckDB in memory. DuckDB was selected over SQLite and SQLAlchemy as they were unable to handle more complex datatypes such as arrays without peculiarities and hashing whereas DuckDB with all test data returned results in the same format as expected.

Once the temporary DB is created in memory, the file is loaded into a table in the DB, where an SQL query is run against the DB returning a dataframe with the results. This is then output as a CSV or JSON using the arguments from the output serializer class in order to mimic the results that would be returned by s3.

A comprehensive suite of unit tests has then been added.

@alexeocto alexeocto force-pushed the ae/enable-fetch-file-contents-read-parquet branch 4 times, most recently from 66443d3 to d879ae5 Compare September 27, 2023 14:57
@alexeocto alexeocto marked this pull request as ready for review September 27, 2023 14:57
xocto/storage/storage.py Outdated Show resolved Hide resolved
xocto/storage/s3_select.py Outdated Show resolved Hide resolved
xocto/storage/storage.py Outdated Show resolved Hide resolved
xocto/storage/storage.py Outdated Show resolved Hide resolved
xocto/storage/storage.py Outdated Show resolved Hide resolved
@alexeocto alexeocto force-pushed the ae/enable-fetch-file-contents-read-parquet branch 3 times, most recently from 108be3a to 51c9a6e Compare September 28, 2023 11:15
Co-Authored-By: Omer Korner <omerkorner@gmail.com>
@alexeocto alexeocto force-pushed the ae/enable-fetch-file-contents-read-parquet branch from 51c9a6e to 4740d77 Compare September 28, 2023 11:21
@alexeocto alexeocto merged commit e7c6d43 into main Sep 28, 2023
@alexeocto alexeocto deleted the ae/enable-fetch-file-contents-read-parquet branch September 28, 2023 11:37
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants