Add shim for Databricks 10.4 [databricks] #4974

jlowe · 2022-03-17T15:53:09Z

Closes #4914.

Adds a shim for the Databricks 10.4 runtime. The dist pom has not been updated to reference this version, as we need to setup build pipelines for these new Databricks-based jars. Once those pipelines are setup, we can followup with a PR to update the dist pom and build scripts.

Some code from 301until330-all has been refactored into a 301until330-nondb directory to allow some reuse of files in the original 301until3300-all directory with the new 321db shim.

Once hiccup with this shim is that it appears First and possibly other aggregation expressions have an updated intermediate data format which lead to some aggregation test failures. See #4963. To be on the safe side until this is further investigated, aggregations are not replaced on Databricks 10.4 unless we can replace both sides of the aggregation.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe · 2022-03-17T15:54:17Z

build

jlowe · 2022-03-17T19:20:49Z

build

revans2

Generally this looks good to me, but I am not ready to approve it yet. There is just so much code I don't know what changes were copy/paste and what are specific to the new version of databricks. If you want to sit down and have us go over it together we can, or I can apply the patch and manually find the files with the same names and diff them. Either way works fine with me.

jenkins/databricks/build.sh

jlowe · 2022-03-17T21:19:16Z

build

jlowe · 2022-03-17T21:33:31Z

I don't know what changes were copy/paste and what are specific to the new version of databricks.

I can give a quick overview of the majority of the changes. They can be summed up in the following high-level changes:

Moving some files in until330-all to until330-nondb, triggering copies in the other two existing Databricks shims
New files in 321db (this new shim). Most of these are copies of what the 321 shim was using, but there are exceptions.
New code in AggregationTagging and aggregate.scala to support avoiding replacing only one half of an aggregate when using the 321db shim.

As expected, Databricks 10.4 is not just Apache Spark 3.2.1 plus custom changes, as it also has changes from Apache Spark 3.3.0 mixed in. As such, there were files in '*until330-all' directories that no longer applied to all once this shim appeared. To reconcile that, I moved the files incompatible with the new Databricks shim into until330-nondb directories which then triggered a copy of those files into the two existing Databricks shims that do not use the directory. The 301db shim is going away very soon, so it's net one extra copy for the 312db shim. Everything that was added to 301db and 31xdb should be the same as the new files added to corresponding until330-nondb directory.

Similarly, there was a 30Xuntil33X shim class that, once this shim was added, no longer applied to all shims. I split out the incompatible methods into a new nondb shim and updated all the original users to use that as well. The existing Databricks shims got copies of these separated methods since they don't use nondb dirs/code.

pxLi · 2022-03-18T01:06:14Z

@NvTimLiu can you help take a look at the CICD requirement for db 10.4 shims? thanks!

sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/SparkShims.scala

sql-plugin/src/main/321db/scala/org/apache/spark/rapids/shims/GpuShuffleExchangeExec.scala

.../scala/org/apache/spark/sql/rapids/execution/python/shims/GpuFlatMapGroupsInPandasExec.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/aggregate.scala

integration_tests/src/main/python/array_test.py

integration_tests/src/main/python/spark_session.py

sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/GpuHashPartitioning.scala

...in/330+/scala/org/apache/spark/sql/catalyst/json/rapids/shims/Spark33XFileOptionsShims.scala

sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/GpuRunningWindowExec.scala

sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/GpuWindowInPandasExec.scala

jlowe · 2022-03-18T16:19:10Z

@tgravescs @revans2 I think I have addressed your concerns.

jlowe · 2022-03-18T16:19:15Z

build

jlowe · 2022-03-18T16:20:04Z

Converting this to draft while I manually run the 321db tests

jlowe · 2022-03-18T17:27:23Z

build

tgravescs · 2022-03-18T18:53:43Z

want to change out of draft?

jlowe · 2022-03-18T18:56:12Z

want to change out of draft?

Still waiting for the manual run of the integration tests on Datarbricks 10.4 on the latest changes to complete. I expect it to pass as it did before, but I don't want this to be merged until we have confirmation those tests are passing.

jlowe · 2022-03-18T19:31:21Z

Databricks 10.4 test run passed, so just waiting for the clean CI run at this point.

Add shim for Databricks 10.4

8a20506

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe added this to the Feb 28 - Mar 18 milestone Mar 17, 2022

jlowe requested a review from revans2 as a code owner March 17, 2022 15:53

jlowe self-assigned this Mar 17, 2022

jlowe requested review from tgravescs, GaryShen2008 and NvTimLiu as code owners March 17, 2022 15:53

Add missing source directory for 304 shim

4d48d0f

jlowe mentioned this pull request Mar 17, 2022

Add avro reader support [databricks] #4956

Merged

revans2 reviewed Mar 17, 2022

View reviewed changes

jenkins/databricks/build.sh Outdated Show resolved Hide resolved

jlowe added 2 commits March 17, 2022 16:15

Add missing import

bb7173e

Remove unused HADOOP_FULL_VERSION

cd26464

tgravescs reviewed Mar 18, 2022

View reviewed changes

jlowe commented Mar 18, 2022

View reviewed changes

integration_tests/src/main/python/spark_session.py Outdated Show resolved Hide resolved

revans2 reviewed Mar 18, 2022

View reviewed changes

jlowe added 4 commits March 18, 2022 08:59

Fix Databricks version check to numerically compare

c552e18

Add comments, code cleanup

96b598e

Add 311+-db directory

7be17b9

Move FileOptions into a v2 shim

66404f9

jlowe marked this pull request as draft March 18, 2022 16:19

Fix Databricks version check

2ed451a

revans2 approved these changes Mar 18, 2022

View reviewed changes

jlowe mentioned this pull request Mar 18, 2022

Investigate lack of task context memory manager in Databricks 10.4 broadcastModeTransform #4987

Closed

tgravescs approved these changes Mar 18, 2022

View reviewed changes

jlowe marked this pull request as ready for review March 18, 2022 19:30

jlowe merged commit 5818905 into NVIDIA:branch-22.04 Mar 18, 2022

jlowe deleted the shim-321db branch March 18, 2022 19:45

abellina mentioned this pull request Mar 21, 2022

[BUG] Databricks 10.4 aggregations can fail when partially replaced #4963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shim for Databricks 10.4 [databricks] #4974

Add shim for Databricks 10.4 [databricks] #4974

jlowe commented Mar 17, 2022

jlowe commented Mar 17, 2022

jlowe commented Mar 17, 2022

revans2 left a comment

jlowe commented Mar 17, 2022

jlowe commented Mar 17, 2022

pxLi commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022

tgravescs commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022

Add shim for Databricks 10.4 [databricks] #4974

Add shim for Databricks 10.4 [databricks] #4974

Conversation

jlowe commented Mar 17, 2022

jlowe commented Mar 17, 2022

jlowe commented Mar 17, 2022

revans2 left a comment

Choose a reason for hiding this comment

jlowe commented Mar 17, 2022

jlowe commented Mar 17, 2022

pxLi commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022

tgravescs commented Mar 18, 2022

jlowe commented Mar 18, 2022

jlowe commented Mar 18, 2022