-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Dynamically load hive and avro using reflection to avoid potential class not found exception [databricks] #5723
Dynamically load hive and avro using reflection to avoid potential class not found exception [databricks] #5723
Conversation
build |
…ass not found exception Signed-off-by: Chong Gao <res_life@163.com>
d73e53e
to
c89b407
Compare
build |
Conflict with #5716 |
I have verified this fixes the bug reported by our QA |
object ExternalSource { | ||
val providerClassName = "org.apache.spark.sql.rapids.AvroSourceProvider" | ||
|
||
lazy val hasSparkAvroJar = { | ||
val loader = Utils.getContextOrSparkClassLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do the same thing here as we are doing in GpuHiveOverrides. i.e. use the ShimLoader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
val className = "org.apache.spark.sql.hive.rapids.HiveSourceProvider" | ||
ShimLoader.newInstanceOf[HiveProvider](className).getExprs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Our current pattern is to create dedicated methods in ShimLoader to reduce spreading of the class-for-name sort of code in the code base.
Consider creating in ShimLoader.scala methods:
def newHiveProvider(): HiveProvider= {
newInstanceOf[HiveProvider]("org.apache.spark.sql.hive.rapids.HiveSourceProvider")
}
Another nit about naming. Usual pattern is that when the interface/trait called HiveProvider then the implementing class is HiveProviderImpl
Naming is hard. Given this called from GpuHiveOverrides, should the trait be GpuHiveOverridesProvider and the class GpuHiveOverridesImpl?
Similar considerations for Avro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
case _: AvroFileFormat => true | ||
case _ => false | ||
} | ||
ShimLoader.newInstanceOf[AvroProvider](providerClassName).isSupportedFormat(format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be called more than once, and will lead to creating garbage instances of AvroProvider. We should memoize AvroProvider as a singleton in a lazy val similar to hasSparkAvroJar
Let us create a ShimLoader method def newAvroProvider(): AvroProvider
. See the comment about Hive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I will file a dedicated issues for this. It does not look like this PR fixes it yet. To reproduce, invoke: $SPARK_HOME/bin/pyspark \
--driver-class-path dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
--packages org.apache.spark:spark-avro_2.12:3.2.1 \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.enabled=true \
--conf spark.rapids.sql.explain=ALL
If we substitute --jars for --driver-class-path it works |
I verified the following test works.
|
build |
@gerashegalov Help review. |
build |
1 similar comment
build |
/** | ||
* Singleton Avro Provider | ||
*/ | ||
lazy val avroProvider: AvroProvider = ShimLoader.newInstanceOf[AvroProvider]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have only methods in these trait, and not introduce any constraints, and let the caller decide how to use it. Can we change this to:
def newAvroProvider(): AvroProvider = {
...
}
for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
case _: AvroFileFormat => true | ||
case _ => false | ||
} | ||
ShimLoader.avroProvider.isSupportedFormat(format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a lazy val to object ExternalSource
lazy val avroProvider = ShimLoader.newAvroProvider()
and here and elsewhere simply call methods on the lazy val
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
build |
build |
this failed in https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/jenkins/spark-premerge-build.sh#L81-L84
|
build |
I posted the above 2 commits.
The second is to fix the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo optional nit about unsorted unshimmed*.txt
com/nvidia/spark/rapids/HiveProvider.class | ||
com/nvidia/spark/rapids/AvroProvider.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us keep the file sorted
build |
build |
Fixes #5648
Potential class not found exception problem:
source code of ExternalSource
See the above code.
ExternalSource
statically referenceAvroScan
evenAvroScan
does not exist at runtime.ExternalSource bytecode has the compile reference.
Resolve
Use the method mentioned in the issue #5648
Remove all the references to
Avro
classes in theExternalSource
file.Provides a
provider
trait and animplement
subclass.Put all the compile-references into the
implement
subclass.ExternalSource
dynamically loadsprovider
using reflection to avoid potential class not found exception.If has no
Avro
jar detected by theloader.loadClass
,ExternalSource
will not try to load theimplement
subclass.Also apply to
Hive
The following avro and hive classes should move to subclasses of the providers.
avro.{AvroFileFormat, AvroOptions}
avro.AvroScan
hive.{HiveGenericUDF, HiveSimpleUDF}
Signed-off-by: Chong Gao res_life@163.com