-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Allow ORC conversion from VARCHAR to STRING #6188
Conversation
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this looks good to me
|
||
def test_orc_read_varchar_as_string(std_input_path): | ||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: spark.read.schema("id bigint, name string").orc(std_input_path + "/test_orc_varchar.orc")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be more obvious what is being tested if the file was created on the fly in this test. It would also avoid checking in yet another binary resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT Spark is unable to create varchar files, as the type is always upgraded to a string with a warning. That's why I checked in a pre-built file built with Hive. I don't think there's any way to access Hive direct from pyspark (without hacking through the raw JVM interface) and Hive isn't required to be present with the Spark distribution.
If there's a way to create this file directly in the test, happy to update the test accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I poked around Spark and confirm. I did not realize it's difficult
case VARCHAR => | ||
to.getCategory match { | ||
case STRING => true | ||
case _ => false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
case VARCHAR => | |
to.getCategory match { | |
case STRING => true | |
case _ => false | |
} | |
to.getCategory == STRING |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #6160. Relates to #6149. Updates the type checks for compatible conversions in ORC schema evolution to allow conversion from VARCHAR to STRING. libcudf loads VARCHAR columns from ORC as STRING already, so this is a no-op in practice.