Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix regression from 21.12 where udfs defined in repl no longer worked #5030

Merged
merged 4 commits into from
Mar 25, 2022

Conversation

abellina
Copy link
Collaborator

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Closes #5019.

Reverts a localized change to the udf-compiler introduced with this PR (#3726).

With ShimLoader.loadClass, we couldn't resolve a UDF defined in the repl, as documented here: #5019.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina
Copy link
Collaborator Author

build

@wjxiz1992
Copy link
Collaborator

Should we add some tests for such case?

@abellina
Copy link
Collaborator Author

Should we add some tests for such case?

I am not sure this is worth it, and honestly I do not know how to replicate the repl in a test. I guess I could define a lambda in a different class loader, but then again I am not sure if it is wortwhile.

gerashegalov
gerashegalov previously approved these changes Mar 25, 2022
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Would be great if we can also verify a non-REPL use of Scala UDF in a Spark Job manually. We don't have Scala integration tests.

…ion.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
@abellina
Copy link
Collaborator Author

abellina commented Mar 25, 2022

@gerashegalov

Would be great if we can also verify a non-REPL use of Scala UDF in a Spark Job manually. We don't have Scala integration tests.

Sorry it took me a little while. We do have the OpcodeSuite unit tests that do cover all the translations available (a spark session is created there). I also create a simple app, and submitted it using spark-submit against a standalone cluster:

package com.nvidia.spark.udf

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf

object SimpleApp {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    import spark.implicits._
    val dataset = List(true, false, true, false).toDS().repartition(1)
    val myudf: Boolean => Boolean = { x => !x }
    val u = udf(myudf)

    var result = dataset.withColumn("new", u('value))
    result.explain(true)
    result.show()

    spark.conf.set("spark.rapids.sql.udfCompiler.enabled", "true")
    result = dataset.withColumn("new", u('value))
    result.explain(true)

    result.show()
    spark.stop()
  }
}

With the plugin disabled:

== Physical Plan ==
*(1) Project [value#1, UDF(value#1) AS new#8]
+- Exchange RoundRobinPartitioning(1), REPARTITION_WITH_NUM, [id=#9]
   +- LocalTableScan [value#1]

With the plugin enabled:

== Physical Plan ==
*(1) Project [value#1, if (NOT value#1) true else false AS new#23]
+- Exchange RoundRobinPartitioning(1), REPARTITION_WITH_NUM, [id=#41]
   +- LocalTableScan [value#1]

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit 35d777e into NVIDIA:branch-22.04 Mar 25, 2022
@abellina abellina deleted the udf/fix_repl_issue branch March 25, 2022 23:33
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] udf compiler failed to translate UDF in spark-shell
4 participants