Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

GPU accelerate Apache Iceberg reads #5941

Merged
merged 45 commits into from
Jul 26, 2022
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
7cf08b9
GPU accelerated reads for Apache Iceberg
jlowe Apr 26, 2022
2f2435f
Clip Parquet block data to read schema
jlowe May 19, 2022
873443a
Add configs to disable Iceberg
jlowe May 19, 2022
49e2787
DPP filtering working, still not getting reuse
jlowe May 20, 2022
a233274
Merge branch 'branch-22.06' into iceberg-read-wip
jlowe May 23, 2022
69f6df1
Fix lack of exchange reuse
jlowe May 24, 2022
0232dad
Merge branch 'branch-22.06' into iceberg-read-wip
jlowe May 26, 2022
a953bbd
Add support for Iceberg on Spark 3.1
jlowe May 26, 2022
c81cb86
Use metrics from parent SparkScan
jlowe May 26, 2022
340b6c2
Merge branch 'branch-22.08' into iceberg-read-wip
jlowe Jun 13, 2022
d18ac4f
Fix metrics
jlowe Jun 13, 2022
9f63212
Fix DPP test
jlowe Jun 14, 2022
a2086bf
Update NOTICE-binary
jlowe Jun 14, 2022
e006f18
Remove unused code
jlowe Jun 24, 2022
fb2970a
Merge branch 'branch-22.08' into iceberg-read
jlowe Jun 24, 2022
352eb92
Add Iceberg support to ExternalSource
jlowe Jun 27, 2022
0a91f4f
Fix missing bytes read metric from stages reading from Iceberg
jlowe Jun 27, 2022
eaa3c09
Fix handling of list columns, add round trip Parquet read test
jlowe Jun 27, 2022
6c7b0e0
Fix Iceberg read enable config
jlowe Jun 27, 2022
41bdc86
Add more Iceberg tests
jlowe Jun 28, 2022
8f2361d
Remove unused code
jlowe Jun 28, 2022
4964085
More Iceberg tests
jlowe Jun 28, 2022
e06865c
Fix Iceberg metadata queries
jlowe Jun 28, 2022
f959047
Fix reads of Iceberg tables with renamed columns
jlowe Jun 29, 2022
1680427
Fix Iceberg reads for missing columns
jlowe Jun 29, 2022
793f6ad
Add Iceberg partition update and delete tests
jlowe Jun 30, 2022
94ca5d6
Fix Iceberg upcasting during reads
jlowe Jun 30, 2022
4a8e296
Suppress some warnings
jlowe Jun 30, 2022
83e0bb8
Update to Iceberg 0.13.2
jlowe Jul 1, 2022
d65c2a4
Skip tests not supported on Spark 3.1.x
jlowe Jul 1, 2022
97b281d
Add docs detailing Iceberg support
jlowe Jul 1, 2022
62a2bcf
Merge branch 'branch-22.08' into iceberg-read
jlowe Jul 1, 2022
1b3e594
Fix Iceberg doc reference
jlowe Jul 1, 2022
06433ab
Fix spark330 build
jlowe Jul 11, 2022
ed5e5a5
Fix paste error in Iceberg support docs
jlowe Jul 11, 2022
9853721
Merge branch 'branch-22.08' into iceberg-read
jlowe Jul 11, 2022
6590ed3
Merge branch 'branch-22.08' into iceberg-read
jlowe Jul 14, 2022
e7fe43a
Add new 320+/java directory for recently added shims
jlowe Jul 14, 2022
495d3a5
Add protections for errors during conversion of CPU scan
jlowe Jul 18, 2022
bd539c5
Update test for zstd being fully supported in libcudf
jlowe Jul 18, 2022
98daf5b
Address review comments
jlowe Jul 18, 2022
bab539f
Merge branch 'branch-22.08' into iceberg-read
jlowe Jul 18, 2022
abccc6b
Work around classloader issues in distributed setups
jlowe Jul 21, 2022
779a6c4
Merge branch 'branch-22.08' into iceberg-read
jlowe Jul 21, 2022
5d2fb27
Update to ShimLoader convention
jlowe Jul 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
RAPIDS plugin for Apache Spark
Copyright (c) 2019-2022, NVIDIA CORPORATION

--------------------------------------------------------------------------------

// ------------------------------------------------------------------
// NOTICE file corresponding to the section 4d of The Apache License,
// Version 2.0, in this case for
Expand All @@ -12,6 +14,34 @@ Copyright 2014 and onwards The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

--------------------------------------------------------------------------------

Apache Iceberg
Copyright 2017-2022 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

--------------------------------------------------------------------------------

This project includes code from Kite, developed at Cloudera, Inc. with
the following copyright notice:

| Copyright 2013 Cloudera Inc.
|
| Licensed under the Apache License, Version 2.0 (the "License");
| you may not use this file except in compliance with the License.
| You may obtain a copy of the License at
|
| http://www.apache.org/licenses/LICENSE-2.0
|
| Unless required by applicable law or agreed to in writing, software
| distributed under the License is distributed on an "AS IS" BASIS,
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
| See the License for the specific language governing permissions and
| limitations under the License.

--------------------------------------------------------------------------------

This product bundles various third-party components under other open source licenses.

Expand Down
30 changes: 29 additions & 1 deletion NOTICE-binary
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,35 @@ Copyright 2014 and onwards The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

---------------------------------------------------------------------
--------------------------------------------------------------------------------

Apache Iceberg
Copyright 2017-2022 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

--------------------------------------------------------------------------------

This project includes code from Kite, developed at Cloudera, Inc. with
the following copyright notice:

| Copyright 2013 Cloudera Inc.
|
| Licensed under the Apache License, Version 2.0 (the "License");
| you may not use this file except in compliance with the License.
| You may obtain a copy of the License at
|
| http://www.apache.org/licenses/LICENSE-2.0
|
| Unless required by applicable law or agreed to in writing, software
| distributed under the License is distributed on an "AS IS" BASIS,
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
| See the License for the specific language governing permissions and
| limitations under the License.

--------------------------------------------------------------------------------

UCF Consortium - Unified Communication X (UCX)

Copyright (c) 2014-2015 UT-Battelle, LLC. All rights reserved.
Expand Down
62 changes: 62 additions & 0 deletions docs/additional-functionality/iceberg-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
layout: page
title: Apache Iceberg Support
parent: Additional Functionality
nav_order: 7
---

# Apache Iceberg Support

The RAPIDS Accelerator for Apache Spark provides limited support for Apache Iceberg tables.
This document details the Apache Iceberg features that are supported.

## Apache Iceberg Versions

The RAPIDS Accelerator supports Apache Iceberg 0.13.x. Earlier versions of Apache Iceberg are
not supported.

## Reading Tables

### Metadata Queries

Reads of Apache Iceberg metadata, i.e.: the `history`, `snapshots`, and other metadata tables
associated with a table, will not be GPU-accelerated. The CPU will continue to process these
metadata-level queries.

### Row-level Delete and Update Support

Apache Iceberg supports row-level deletions and updates. Tables that are using a configuration of
`write.delete.mode=merge-on-read` are not supported.

### Schema Evolution

Columns that are added and removed at the top level of the table schema are supported. Columns
that are added or removed within struct columns are not supported.

### Data Formats

Apache Iceberg can store data in various formats. Each section below details the levels of support
for each of the underlying data formats.

#### Parquet

Data stored in Parquet is supported with the same limitations for loading data from raw Parquet
files. See the [Input/Output](../supported_ops.md#inputoutput) documentation for details. The
following compression codecs applied to the Parquet data are supported:
- gzip (Apache Iceberg default)
- snappy
- uncompressed
- zstd

#### ORC

The RAPIDS Accelerator does not support Apache Iceberg tables using the ORC data format.

#### Avro

The RAPIDS Accelerator does not support Apache Iceberg tables using the Avro data format.

## Writing Tables

The RAPIDS Accelerator for Apache Spark does not accelerate Apache Iceberg writes. Writes
to Iceberg tables will be processed by the CPU.
2 changes: 2 additions & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ Name | Description | Default Value
<a name="sql.format.avro.reader.type"></a>spark.rapids.sql.format.avro.reader.type|Sets the Avro reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.multiThreadedRead.numThreads and spark.rapids.sql.format.avro.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.|AUTO
<a name="sql.format.csv.enabled"></a>spark.rapids.sql.format.csv.enabled|When set to false disables all csv input and output acceleration. (only input is currently supported anyways)|true
<a name="sql.format.csv.read.enabled"></a>spark.rapids.sql.format.csv.read.enabled|When set to false disables csv input acceleration|true
<a name="sql.format.iceberg.enabled"></a>spark.rapids.sql.format.iceberg.enabled|When set to false disables all Iceberg acceleration|true
<a name="sql.format.iceberg.read.enabled"></a>spark.rapids.sql.format.iceberg.read.enabled|When set to false disables Iceberg input acceleration|true
<a name="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|false
<a name="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|false
<a name="sql.format.orc.enabled"></a>spark.rapids.sql.format.orc.enabled|When set to false disables all orc input and output acceleration|true
Expand Down
47 changes: 47 additions & 0 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -18307,6 +18307,49 @@ dates or timestamps, or for a lack of type coercion support.
<td> </td>
</tr>
<tr>
<th rowSpan="2">Iceberg</th>
<th>Read</th>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td> </td>
<td><b>NS</b></td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
<th>Write</th>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td> </td>
<td><b>NS</b></td>
<td> </td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<th rowSpan="2">JSON</th>
<th>Read</th>
<td>S</td>
Expand Down Expand Up @@ -18436,3 +18479,7 @@ dates or timestamps, or for a lack of type coercion support.
<td><b>NS</b></td>
</tr>
</table>

### Apache Iceberg Support
Support for Apache Iceberg has additional limitations. See the
[Apache Iceberg Support](additional-functionality/iceberg-support.md) document.
Loading