Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Allow ORC conversion from VARCHAR to STRING #6188

Merged
merged 2 commits into from
Aug 3, 2022

Conversation

jlowe
Copy link
Contributor

@jlowe jlowe commented Aug 1, 2022

Fixes #6160. Relates to #6149. Updates the type checks for compatible conversions in ORC schema evolution to allow conversion from VARCHAR to STRING. libcudf loads VARCHAR columns from ORC as STRING already, so this is a no-op in practice.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added this to the July 22 - Aug 5 milestone Aug 1, 2022
@jlowe jlowe self-assigned this Aug 1, 2022
@jlowe
Copy link
Contributor Author

jlowe commented Aug 1, 2022

build

jbrennan333
jbrennan333 previously approved these changes Aug 1, 2022
Copy link
Contributor

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this looks good to me


def test_orc_read_varchar_as_string(std_input_path):
assert_gpu_and_cpu_are_equal_collect(
lambda spark: spark.read.schema("id bigint, name string").orc(std_input_path + "/test_orc_varchar.orc"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more obvious what is being tested if the file was created on the fly in this test. It would also avoid checking in yet another binary resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT Spark is unable to create varchar files, as the type is always upgraded to a string with a warning. That's why I checked in a pre-built file built with Hive. I don't think there's any way to access Hive direct from pyspark (without hacking through the raw JVM interface) and Hive isn't required to be present with the Spark distribution.

If there's a way to create this file directly in the test, happy to update the test accordingly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poked around Spark and confirm. I did not realize it's difficult

Comment on lines 240 to 244
case VARCHAR =>
to.getCategory match {
case STRING => true
case _ => false
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
case VARCHAR =>
to.getCategory match {
case STRING => true
case _ => false
}
to.getCategory == STRING

@sameerz sameerz added the task Work required that improves the product but is not user facing label Aug 1, 2022
revans2
revans2 previously approved these changes Aug 2, 2022
@jlowe jlowe dismissed stale reviews from revans2 and jbrennan333 via 47fab84 August 2, 2022 15:02
@jlowe
Copy link
Contributor Author

jlowe commented Aug 2, 2022

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
6 participants