diff --git a/docs/FAQ.md b/docs/FAQ.md index 3edac177fbb..55d0f1de5af 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -10,7 +10,7 @@ nav_order: 12 ### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support? -The RAPIDS Accelerator for Apache Spark requires version 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1 or 3.3.0 of +The RAPIDS Accelerator for Apache Spark requires version 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, 3.2.2 or 3.3.0 of Apache Spark. Because the plugin replaces parts of the physical plan that Apache Spark considers to be internal the code for those plans can change even between bug fix releases. As a part of our process, we try to stay on top of these changes and release updates as quickly as possible. diff --git a/docs/compatibility.md b/docs/compatibility.md index 37bb882fc20..c63bdd47e02 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -65,7 +65,7 @@ conditions within the computation itself the result may not be the same each tim run. This is inherent in how the plugin speeds up the calculations and cannot be "fixed." If a query joins on a floating point value, which is not wise to do anyways, and the value is the result of a floating point aggregation then the join may fail to work properly with the plugin but would have -worked with plain Spark. As of 22.06 this is behavior is enabled by default but can be disabled with +worked with plain Spark. Starting from 22.06 this is behavior is enabled by default but can be disabled with the config [`spark.rapids.sql.variableFloatAgg.enabled`](configs.md#sql.variableFloatAgg.enabled). @@ -370,6 +370,7 @@ INTERVAL HOUR TO SECOND | INTERVAL '10:30:40.999999' HOUR TO SECOND | 10:30:40.9 INTERVAL MINUTE | INTERVAL '30' MINUTE | 30| INTERVAL MINUTE TO SECOND | INTERVAL '30:40.999999' MINUTE TO SECOND | 30:40.999999| INTERVAL SECOND | INTERVAL '40.999999' SECOND | 40.999999| + Currently, the RAPIDS Accelerator only supports the ANSI style. ## ORC @@ -807,7 +808,7 @@ leads to restrictions: * Float values cannot be larger than `1e18` or smaller than `-1e18` after conversion. * The results produced by GPU slightly differ from the default results of Spark. -As of 22.06 this conf is enabled, to disable this operation on the GPU when using Spark 3.1.0 or +Starting from 22.06 this conf is enabled, to disable this operation on the GPU when using Spark 3.1.0 or later, set [`spark.rapids.sql.castFloatToDecimal.enabled`](configs.md#sql.castFloatToDecimal.enabled) to `false` @@ -819,7 +820,7 @@ Spark 3.1.0 the MIN and MAX values were floating-point values such as `Int.MaxVa starting with 3.1.0 these are now integral types such as `Int.MaxValue` so this has slightly affected the valid range of values and now differs slightly from the behavior on GPU in some cases. -As of 22.06 this conf is enabled, to disable this operation on the GPU when using Spark 3.1.0 or later, set +Starting from 22.06 this conf is enabled, to disable this operation on the GPU when using Spark 3.1.0 or later, set [`spark.rapids.sql.castFloatToIntegralTypes.enabled`](configs.md#sql.castFloatToIntegralTypes.enabled) to `false`. @@ -831,7 +832,7 @@ The GPU will use different precision than Java's toString method when converting types to strings. The GPU uses a lowercase `e` prefix for an exponent while Spark uses uppercase `E`. As a result the computed string can differ from the default behavior in Spark. -As of 22.06 this conf is enabled by default, to disable this operation on the GPU, set +Starting from 22.06 this conf is enabled by default, to disable this operation on the GPU, set [`spark.rapids.sql.castFloatToString.enabled`](configs.md#sql.castFloatToString.enabled) to `false`. ### String to Float @@ -845,7 +846,7 @@ default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respe Also, the GPU does not support casting from strings containing hex values. -As of 22.06 this conf is enabled by default, to enable this operation on the GPU, set +Starting from 22.06 this conf is enabled by default, to enable this operation on the GPU, set [`spark.rapids.sql.castStringToFloat.enabled`](configs.md#sql.castStringToFloat.enabled) to `false`. ### String to Date diff --git a/docs/configs.md b/docs/configs.md index 34cde38f785..cc29610128b 100644 --- a/docs/configs.md +++ b/docs/configs.md @@ -10,15 +10,15 @@ The following is the list of options that `rapids-plugin-4-spark` supports. On startup use: `--conf [conf key]=[conf value]`. For example: ``` -${SPARK_HOME}/bin/spark --jars rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ +${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ ---conf spark.rapids.sql.incompatibleOps.enabled=true +--conf spark.rapids.sql.concurrentGpuTasks=2 ``` At runtime use: `spark.conf.set("[conf key]", [conf value])`. For example: ``` -scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true) +scala> spark.conf.set("spark.rapids.sql.concurrentGpuTasks", 2) ``` All configs can be set on startup, but some configs, especially for shuffle, will not diff --git a/docs/demo/Databricks/generate-init-script.ipynb b/docs/demo/Databricks/generate-init-script.ipynb index 03232f98590..12b845f3d35 100644 --- a/docs/demo/Databricks/generate-init-script.ipynb +++ b/docs/demo/Databricks/generate-init-script.ipynb @@ -3,7 +3,7 @@ { "cell_type":"code", "source":[ - "dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")\n \ndbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n#!/bin/bash\nsudo wget -O /databricks/jars/rapids-4-spark_2.12-22.06.0-cuda11.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0-cuda11.jar\n\"\"\", True)" + "dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")\n \ndbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n#!/bin/bash\nsudo wget -O /databricks/jars/rapids-4-spark_2.12-22.08.0.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.08.0/rapids-4-spark_2.12-22.08.0.jar\n\"\"\", True)" ], "metadata":{ diff --git a/docs/download.md b/docs/download.md index fd2abbb7c55..5bcbbbc31af 100644 --- a/docs/download.md +++ b/docs/download.md @@ -18,6 +18,46 @@ cuDF jar, that is either preinstalled in the Spark classpath on all nodes or sub that uses the RAPIDS Accelerator For Apache Spark. See the [getting-started guide](https://nvidia.github.io/spark-rapids/Getting-Started/) for more details. +## Release v22.08.0 +Hardware Requirements: + +The plugin is tested on the following architectures: + + GPU Models: NVIDIA V100, T4 and A2/A10/A30/A100 GPUs + +Software Requirements: + + OS: Ubuntu 18.04, Ubuntu 20.04 or CentOS 7, Rocky Linux 8 + + CUDA & NVIDIA Drivers*: 11.x & v450.80.02+ + + Apache Spark 3.1.1, 3.1.2, 3.1.3, 3.2.0, 3.2.1, 3.2.2, 3.3.0, Databricks 9.1 ML LTS or 10.4 ML LTS Runtime and GCP Dataproc 2.0 + + Python 3.6+, Scala 2.12, Java 8 + +*Some hardware may have a minimum driver version greater than v450.80.02+. Check the GPU spec sheet +for your hardware's minimum driver version. + +*For Cloudera and EMR support, please refer to the +[Distributions](./FAQ.md#which-distributions-are-supported) section of the FAQ. + +### Release Notes +New functionality and performance improvements for this release include: +* Rocky Linux 8 support +* Ability to build Spark RAPIDS jars for Java versions 9+ +* Zstandard Parquet and ORC read support +* Binary read support from parquet +* Apache Iceberg 0.13 support +* Array function support: array_intersect, array_union, array_except and arrays_overlap +* Support nth_value, first and last in windowing function +* Alluxio auto mount for AWS S3 buckets +* Qualification tool: + * SQL level qualification + * Add application details view + +For a detailed list of changes, please refer to the +[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md). + ## Release v22.06.0 Hardware Requirements: diff --git a/docs/get-started/getting-started-databricks.md b/docs/get-started/getting-started-databricks.md index 7985a596acd..1107202cf62 100644 --- a/docs/get-started/getting-started-databricks.md +++ b/docs/get-started/getting-started-databricks.md @@ -162,7 +162,7 @@ cluster. ```bash spark.rapids.sql.python.gpu.enabled true spark.python.daemon.module rapids.daemon_databricks - spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.06.0.jar:/databricks/spark/python + spark.executorEnv.PYTHONPATH /databricks/jars/rapids-4-spark_2.12-22.08.0.jar:/databricks/spark/python ``` 7. Once you’ve added the Spark config, click “Confirm and Restart”. diff --git a/docs/get-started/gpu_dataproc_packages_ubuntu_sample.sh b/docs/get-started/gpu_dataproc_packages_ubuntu_sample.sh index 5645a91a12f..6029a6978e5 100644 --- a/docs/get-started/gpu_dataproc_packages_ubuntu_sample.sh +++ b/docs/get-started/gpu_dataproc_packages_ubuntu_sample.sh @@ -139,7 +139,7 @@ EOF systemctl start dataproc-cgroup-device-permissions } -readonly DEFAULT_SPARK_RAPIDS_VERSION="22.06.0" +readonly DEFAULT_SPARK_RAPIDS_VERSION="22.08.0" readonly DEFAULT_CUDA_VERSION="11.0" readonly DEFAULT_XGBOOST_VERSION="1.6.1" readonly SPARK_VERSION="3.0" diff --git a/docs/img/spark3cluster.png b/docs/img/spark3cluster.png index 73050c63451..8f3a3e80734 100644 Binary files a/docs/img/spark3cluster.png and b/docs/img/spark3cluster.png differ diff --git a/docs/spark-profiling-tool.md b/docs/spark-profiling-tool.md index e21f1684fc7..7a87b36a6ee 100644 --- a/docs/spark-profiling-tool.md +++ b/docs/spark-profiling-tool.md @@ -1,9 +1,9 @@ --- layout: page -title: Profiling tool +title: Profiling Tool nav_order: 9 --- -# Profiling tool +# Profiling Tool The Profiling tool analyzes both CPU or GPU generated event logs and generates information which can be used for debugging and profiling Apache Spark applications. diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala index 09ed723c791..8ddf269a22b 100644 --- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala +++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala @@ -1546,15 +1546,15 @@ object RapidsConf { |On startup use: `--conf [conf key]=[conf value]`. For example: | |``` - |${SPARK_HOME}/bin/spark --jars rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ + |${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ |--conf spark.plugins=com.nvidia.spark.SQLPlugin \ - |--conf spark.rapids.sql.incompatibleOps.enabled=true + |--conf spark.rapids.sql.concurrentGpuTasks=2 |``` | |At runtime use: `spark.conf.set("[conf key]", [conf value])`. For example: | |``` - |scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true) + |scala> spark.conf.set("spark.rapids.sql.concurrentGpuTasks", 2) |``` | | All configs can be set on startup, but some configs, especially for shuffle, will not