Allow ORC conversion from VARCHAR to STRING #6188

jlowe · 2022-08-01T21:31:24Z

Fixes #6160. Relates to #6149. Updates the type checks for compatible conversions in ORC schema evolution to allow conversion from VARCHAR to STRING. libcudf loads VARCHAR columns from ORC as STRING already, so this is a no-op in practice.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe · 2022-08-01T21:31:30Z

build

jbrennan333

+1 this looks good to me

gerashegalov · 2022-08-01T22:08:49Z

integration_tests/src/main/python/orc_test.py

+
+def test_orc_read_varchar_as_string(std_input_path):
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark: spark.read.schema("id bigint, name string").orc(std_input_path + "/test_orc_varchar.orc"))


it would be more obvious what is being tested if the file was created on the fly in this test. It would also avoid checking in yet another binary resource.

AFAICT Spark is unable to create varchar files, as the type is always upgraded to a string with a warning. That's why I checked in a pre-built file built with Hive. I don't think there's any way to access Hive direct from pyspark (without hacking through the raw JVM interface) and Hive isn't required to be present with the Spark distribution.

If there's a way to create this file directly in the test, happy to update the test accordingly.

I poked around Spark and confirm. I did not realize it's difficult

gerashegalov · 2022-08-01T22:21:30Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

+      case VARCHAR =>
+        to.getCategory match {
+          case STRING => true
+          case _ => false
+        }


nit:

Suggested change

case VARCHAR =>

to.getCategory match {

case STRING => true

case _ => false

}

to.getCategory == STRING

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala

jlowe · 2022-08-02T15:02:32Z

build

gerashegalov

LGTM

Allow ORC conversion from VARCHAR to STRING

4f2a4aa

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

jlowe added this to the July 22 - Aug 5 milestone Aug 1, 2022

jlowe self-assigned this Aug 1, 2022

jbrennan333 previously approved these changes Aug 1, 2022

View reviewed changes

gerashegalov reviewed Aug 1, 2022

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Aug 1, 2022

firestarman reviewed Aug 2, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala Show resolved Hide resolved

revans2 previously approved these changes Aug 2, 2022

View reviewed changes

Add comments, refine code

47fab84

jlowe dismissed stale reviews from revans2 and jbrennan333 via 47fab84 August 2, 2022 15:02

jbrennan333 approved these changes Aug 2, 2022

View reviewed changes

revans2 approved these changes Aug 2, 2022

View reviewed changes

jlowe mentioned this pull request Aug 2, 2022

Implement all the casting cases that GPU can support for ORC reading. #6149

Open

gerashegalov approved these changes Aug 2, 2022

View reviewed changes

firestarman merged commit e24756c into NVIDIA:branch-22.08 Aug 3, 2022

jlowe mentioned this pull request Aug 17, 2022

Support bool/int8/int16/int32/int64 castings for ORC reading. #6273

Merged

jlowe mentioned this pull request Feb 14, 2024

GPU support for ORC's varchar/char datatype #1765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow ORC conversion from VARCHAR to STRING #6188

Allow ORC conversion from VARCHAR to STRING #6188

jlowe commented Aug 1, 2022

jlowe commented Aug 1, 2022

jbrennan333 left a comment

gerashegalov Aug 1, 2022

jlowe Aug 2, 2022

gerashegalov Aug 2, 2022

gerashegalov Aug 1, 2022

jlowe commented Aug 2, 2022

gerashegalov left a comment

Allow ORC conversion from VARCHAR to STRING #6188

Allow ORC conversion from VARCHAR to STRING #6188

Conversation

jlowe commented Aug 1, 2022

jlowe commented Aug 1, 2022

jbrennan333 left a comment

Choose a reason for hiding this comment

gerashegalov Aug 1, 2022

Choose a reason for hiding this comment

jlowe Aug 2, 2022

Choose a reason for hiding this comment

gerashegalov Aug 2, 2022

Choose a reason for hiding this comment

gerashegalov Aug 1, 2022

Choose a reason for hiding this comment

jlowe commented Aug 2, 2022

gerashegalov left a comment

Choose a reason for hiding this comment