-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Fix handling of duplicate column names in parquet reader #23050
Conversation
Parquet files may contain duplicate column names when written by case sensitive tools. We read the first case insensitive match from the file in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's iterate a bit more on this, maybe this will give us a better result.
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveFileFormats.java
Show resolved
Hide resolved
@@ -113,7 +113,9 @@ public static Map<List<String>, ColumnDescriptor> getDescriptors(MessageType fil | |||
.stream() | |||
.collect(toImmutableMap( | |||
columnIO -> Arrays.asList(columnIO.getFieldPath()), | |||
PrimitiveColumnIO::getColumnDescriptor)); | |||
PrimitiveColumnIO::getColumnDescriptor, | |||
// Same column name may occur more than once when the file is written by case-sensitive tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Same column name" -> "Namesake"
plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveFileFormats.java
Show resolved
Hide resolved
Removed |
I checked Apache Hive behaviour in this case. It always picks first case-insensitive match.
Setting |
I further confirmed that this PR is matching the current AWS Athena behaviour.
Delta lake seems to explicitly disallow cases where column name differs only by case |
Description
Parquet files may contain duplicate column names when written by case sensitive tools.
We read the first case insensitive match from the file in this scenario.
Additional context and related issues
Fixes query failures caused by #22538
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: