-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add an explain only mode to the plugin #4322
Conversation
…at would have run on GPU Signed-off-by: Thomas Graves <tgraves@apache.org>
…nto explainonlymode
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick high-level comments, have not reviewed in detail. Also seems to be missing corresponding configs.md doc update from the RapidsConf change.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
I need to update the config docs |
Signed-off-by: Thomas Graves <tgraves@nvidia.com>
I updated to have spark.rapids.sql.mode |
I need to regenerate the configs docs, will update shortly. |
build |
it seems our tests are relying on the behavior of device manager ot create the rapids buffer store (and initialize gpu and memory) even though sql plugin is disabled. Looking at better way to handle this. |
want to initialize stuff on startup with the plugin disabled and dynamically enable it afterwards
build |
|
||
This allows running queries on the CPU and the plugin will evaluate the queries as if it was | ||
going to run on the GPU and tell you what would and wouldn't have been run on the GPU. | ||
There are two ways to run this, one is running with the plugin set to explain only mode and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Ideally should be using "RAPIDS Accelerator" rather than "plugin" in user docs. Applies to other places in the PR.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuShuffleEnv.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/RapidsShuffleInternalManagerBase.scala
Show resolved
Hide resolved
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
…cala Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…cala Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…cala Co-authored-by: Jason Lowe <jlowe@nvidia.com>
build |
build |
This allows users to run on the CPU and have the plugin evaluate the plan as if it would have run on GPU and output explain output the driver log file. Note that this isn't perfect, specifically for AQE where the plan may change as its executed.
fixes #4238
In explain only mode we don't acquire GPU and don't enable spark rapids shuffle (fallback to Spark version if configured), but processes the plan through the GpuOverrides like it would if it were running on the GPU. In the end we return the CPU plan still so it runs on CPU and log the explain output.
This requires the cudf and rapids jar be present and plugin enabled with the mode set to explainOnly. spark.rapids.sql.mode=explainOnly and spark.plugins=com.nvidia.spark.SQLPlugin. The alternate mode I called executeOnGPU just thinking if we happen to add other modes that might be best. Happy to change it if people have better ideas.
This PR updates the logging on startup to explain the mode, previously it just always printed how to turn the plugin off even if it was disabled. Now it printed enabled, disabled or explain only mode and reference configs to change.
I tested hits on a bunch of NDS queries and compared output vs actually running on GPU. AQE isn't perfect but put some docs in there about it. Added one basic integration test, I don't really have a good way to test the output since just goes to the logger. Manually tested on Databricks.
We should make sure we have QA test for explain only mode as well.