Enable parquet input for S3FileStore s3 select query #95
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR enables the
S3FileStore.fetch_object_contents_using_s3_select
method to also query parquet files by adding a parquet serializer to thes3_select.py
file.A
LocalFileStore
equivalent method was also implemented in order to facilitate local development, in particular linked to https://github.com/octoenergy/kraken-core/pull/102690 and Integrity checks.The
LocalFileStore.fetch_object_contents_using_s3_select
works by first loading the entire file from it's local storage location, and then creating a DuckDB in memory. DuckDB was selected over SQLite and SQLAlchemy as they were unable to handle more complex datatypes such as arrays without peculiarities and hashing whereas DuckDB with all test data returned results in the same format as expected.Once the temporary DB is created in memory, the file is loaded into a table in the DB, where an SQL query is run against the DB returning a dataframe with the results. This is then output as a CSV or JSON using the arguments from the output serializer class in order to mimic the results that would be returned by s3.
A comprehensive suite of unit tests has then been added.