-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
GPU accelerate Apache Iceberg reads #5941
Conversation
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. The new fallback code looks good as well +1
LGTM |
Rebuilding CI to ensure Iceberg tests run after #6020. |
build |
build |
A strange error failed premerge, retry it.
|
build |
last premerge run failed orc_test
there should be no permission issue in premerge environment (docker), not sure if any side effects
|
Hi @jlowe , |
I avoided adding it to premerge since the iceberg tests are currently serial would slow down premerge. I was going to file a followup to address this, but if you feel it should be part of premerge happy to update the PR. |
build |
2 similar comments
build |
build |
A seperate issue is good fo me. And we can have more discussion about whether to add it. |
Closes #4817, closes #5453, and closes #5510.
Adds basic support for GPU acceleration of Apache Iceberg table reads along with a document detailing the limitations of the support and tests. The tests exercise the usual, generic table reading tests, but also test features more specific to Apache Iceberg like time-travel reads, incremental snapshot reads, partitioning schema evolution, row deletion and updates, etc.
Only the Parquet data format is supported in this initial version, and it only provides a per-file strategy to GPU acceleration. Multi-threaded and coalescing reader strategies are planned for the future.
This supports Apache Iceberg 0.13.x, and leverages the Iceberg
api
andcore
code provided by whatever Iceberg jar is provided by the user, with the assumption those APIs are relatively stable over time. Related Apache Iceberg code for Parquet and Spark have been adapted for use within the RAPIDS Accelerator, as these interfaces are less likely to remain stable across Apache Iceberg versions. Reflection is used to port over the relevant CPU scan state into an equivalent GPU-accelerated scan.Metadata queries and processing remains on the CPU, as this involves parsing of relatively tiny JSON files for CPU consumption. The data is read via the existing Parquet partition reader, after row-group filtering and predicate pushdown has been applied.