[BUG]Rapids Accelerator (0.2) failing to read csv file on Databricks 7.0 ML GPU Runtime #1322

krajendrannv · 2020-12-08T20:46:00Z

Describe the bug
Customer is replicating the Mortgage ETL query in a Databricks environment. Data is read from S3. The same S3 data is read using a CPU cluster and it works. GPU scan fails.
It is failing on the first read:

acq = read_acq_csv(spark, orig_acq_path)
def read_acq_csv(spark, path):
return spark.read.format('csv')
.option('nullValue', '')
.option('header', 'false')
.option('delimiter', '|')
.schema(_csv_acq_schema)
.load(path)
.withColumn('quarter', _get_quarter_from_csv_file_name())

Steps/Code to reproduce bug
The cluster uses p3.2xlarge (v100) for driver and executor.
Rapids Accelerator (0.2), Databricks 7.0ML GPU Runtime

Expected behavior
Attached both logs from CPU cluster (working-log4j_cpu.log) and GPU cluster (log4j_gpu_databricks7.0.log)

Environment details (please complete the following information)

Environment location: Databricks 7.0ML Runtime on AWS

Additional context
Add any other context about the problem here.
log4j_gpu_databricks7.0.log
working-log4j_cpu.log

tgravescs · 2020-12-08T21:17:51Z

this looks like spark.io.compression.codec is set to zstd, if this is set by Databricks I would expect that to just work, if the setup scripts you are using is doing it, we should not.

The code that is erroring isn't even in the plugin

java.lang.NoSuchMethodError: com.github.luben.zstd.Zstd.setCompressionLevel(JI)I
	at com.github.luben.zstd.ZstdOutputStream.<init>(ZstdOutputStream.java:64)
	at org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:224)
	at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:963)
	at org.apache.spark.ShuffleStatus.$anonfun$serializedMapStatus$2(MapOutputTracker.scala:234)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:75)
	at org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:231)
	at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:485)

The issue is either the underlying library/jar is missing or its incompatible with what is expected:

java.lang.NoSuchMethodError: com.github.luben.zstd.Zstd.setCompressionLevel(JI)I

when I start a databricks 7.0ML cluster it don't see the setting of spark.io.compression.codec

so are you setting that?

firestarman · 2022-10-31T02:25:56Z

Do we still need to fix this ?

revans2 · 2022-10-31T14:55:45Z

I agree that it is probably not an issue any more. We support zstd and we do not support ML7.0 for databricks any more. I mostly want to be sure that we are testing zstd on databricks with CSV. Even if we just manually verify it works once, that is good enough.

firestarman · 2022-12-05T08:10:17Z

@revans2 @revans2 FYI

This is more like a version issue because setCompressionLevel was introduced since v1.4.0.
Anyway, I just verified on Azure DB 9.1 and 10.4, both works now with zstd + plugin.

I am going to close this.

…IDIA#1322) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

krajendrannv added bug ? - Needs Triage labels Dec 8, 2020

sameerz removed the ? - Needs Triage label Dec 8, 2020

revans2 mentioned this issue Oct 27, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

firestarman closed this as completed Dec 5, 2022

GaryShen2008 assigned firestarman Dec 12, 2022

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 9b80bfdc71d68bb27646124f674aa2d15585fe97 (NV…

e58d5f4

…IDIA#1322) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Rapids Accelerator (0.2) failing to read csv file on Databricks 7.0 ML GPU Runtime #1322

[BUG]Rapids Accelerator (0.2) failing to read csv file on Databricks 7.0 ML GPU Runtime #1322

krajendrannv commented Dec 8, 2020

tgravescs commented Dec 8, 2020

firestarman commented Oct 31, 2022 •

edited

Loading

revans2 commented Oct 31, 2022

firestarman commented Dec 5, 2022

[BUG]Rapids Accelerator (0.2) failing to read csv file on Databricks 7.0 ML GPU Runtime #1322

[BUG]Rapids Accelerator (0.2) failing to read csv file on Databricks 7.0 ML GPU Runtime #1322

Comments

krajendrannv commented Dec 8, 2020

tgravescs commented Dec 8, 2020

firestarman commented Oct 31, 2022 • edited Loading

revans2 commented Oct 31, 2022

firestarman commented Dec 5, 2022

firestarman commented Oct 31, 2022 •

edited

Loading