-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add shim for Databricks 10.4 [databricks] #4974
Conversation
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this looks good to me, but I am not ready to approve it yet. There is just so much code I don't know what changes were copy/paste and what are specific to the new version of databricks. If you want to sit down and have us go over it together we can, or I can apply the patch and manually find the files with the same names and diff them. Either way works fine with me.
build |
I can give a quick overview of the majority of the changes. They can be summed up in the following high-level changes:
As expected, Databricks 10.4 is not just Apache Spark 3.2.1 plus custom changes, as it also has changes from Apache Spark 3.3.0 mixed in. As such, there were files in '*until330-all' directories that no longer applied to all once this shim appeared. To reconcile that, I moved the files incompatible with the new Databricks shim into until330-nondb directories which then triggered a copy of those files into the two existing Databricks shims that do not use the directory. The 301db shim is going away very soon, so it's net one extra copy for the 312db shim. Everything that was added to 301db and 31xdb should be the same as the new files added to corresponding until330-nondb directory. Similarly, there was a 30Xuntil33X shim class that, once this shim was added, no longer applied to all shims. I split out the incompatible methods into a new nondb shim and updated all the original users to use that as well. The existing Databricks shims got copies of these separated methods since they don't use nondb dirs/code. |
@NvTimLiu can you help take a look at the CICD requirement for db 10.4 shims? thanks! |
sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/SparkShims.scala
Show resolved
Hide resolved
sql-plugin/src/main/321db/scala/org/apache/spark/rapids/shims/GpuShuffleExchangeExec.scala
Outdated
Show resolved
Hide resolved
.../scala/org/apache/spark/sql/rapids/execution/python/shims/GpuFlatMapGroupsInPandasExec.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/aggregate.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/GpuHashPartitioning.scala
Outdated
Show resolved
Hide resolved
...in/330+/scala/org/apache/spark/sql/catalyst/json/rapids/shims/Spark33XFileOptionsShims.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/GpuRunningWindowExec.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/GpuRunningWindowExec.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/321db/scala/com/nvidia/spark/rapids/shims/GpuWindowInPandasExec.scala
Outdated
Show resolved
Hide resolved
@tgravescs @revans2 I think I have addressed your concerns. |
build |
Converting this to draft while I manually run the 321db tests |
build |
want to change out of draft? |
Still waiting for the manual run of the integration tests on Datarbricks 10.4 on the latest changes to complete. I expect it to pass as it did before, but I don't want this to be merged until we have confirmation those tests are passing. |
Databricks 10.4 test run passed, so just waiting for the clean CI run at this point. |
Closes #4914.
Adds a shim for the Databricks 10.4 runtime. The dist pom has not been updated to reference this version, as we need to setup build pipelines for these new Databricks-based jars. Once those pipelines are setup, we can followup with a PR to update the dist pom and build scripts.
Some code from 301until330-all has been refactored into a 301until330-nondb directory to allow some reuse of files in the original 301until3300-all directory with the new 321db shim.
Once hiccup with this shim is that it appears
First
and possibly other aggregation expressions have an updated intermediate data format which lead to some aggregation test failures. See #4963. To be on the safe side until this is further investigated, aggregations are not replaced on Databricks 10.4 unless we can replace both sides of the aggregation.