Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix handling of duplicate column names in parquet reader #23050

Merged
merged 1 commit into from
Aug 27, 2024

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Aug 15, 2024

Description

Parquet files may contain duplicate column names when written by case sensitive tools.
We read the first case insensitive match from the file in this scenario.

Additional context and related issues

Fixes query failures caused by #22538

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fixes query failures due to "Multiple entries with same key" when reading parquet files. ({issue}`23050`)

Parquet files may contain duplicate column names when written by
case sensitive tools. We read the first case insensitive match from
the file in this scenario.
@cla-bot cla-bot bot added the cla-signed label Aug 15, 2024
@github-actions github-actions bot added the hive Hive connector label Aug 15, 2024
@raunaqmorarka raunaqmorarka added the bug Something isn't working label Aug 15, 2024
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's iterate a bit more on this, maybe this will give us a better result.

@@ -113,7 +113,9 @@ public static Map<List<String>, ColumnDescriptor> getDescriptors(MessageType fil
.stream()
.collect(toImmutableMap(
columnIO -> Arrays.asList(columnIO.getFieldPath()),
PrimitiveColumnIO::getColumnDescriptor));
PrimitiveColumnIO::getColumnDescriptor,
// Same column name may occur more than once when the file is written by case-sensitive tools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Same column name" -> "Namesake"

@ebyhr
Copy link
Member

ebyhr commented Aug 16, 2024

Removed RELEASE-BLOCKER label since version 454 already shipped.

@raunaqmorarka
Copy link
Member Author

I checked Apache Hive behaviour in this case. It always picks first case-insensitive match.
The current PR and Trino behaviour before recent changes matches this behaviour.
I also checked Apache Spark, it throws an error by default

Caused by: org.apache.spark.SparkRuntimeException: Found duplicate field(s) "upper_case_column": [UPPER_CASE_COLUMN, Upper_Case_Column] in case-insensitive mode.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.foundDuplicateFieldInCaseInsensitiveModeError(QueryExecutionErrors.scala:1082)

Setting set spark.sql.hive.convertMetastoreParquet=false; makes it behave the same way as Apache Hive.

@raunaqmorarka
Copy link
Member Author

I further confirmed that this PR is matching the current AWS Athena behaviour.
I also found https://issues.apache.org/jira/browse/HIVE-7554 where case insensitive column matching was implemented a long time ago, but it doesn't explicitly discuss this particular scenario.
With iceberg we continue fail to read a table with such file even after this PR

Caused by: java.lang.IllegalArgumentException: Multiple entries with same key: 1=optional binary upper_case_column (STRING) = 1 and 1=optional binary upper_case_column (STRING) = 1
	at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:382)
	at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:376)
	at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:246)
	at com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:133)
	at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:95)
	at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:579)
	at com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:607)
	at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetIdToFieldMapping(IcebergPageSourceProvider.java:1060)
	at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:924)

Delta lake seems to explicitly disallow cases where column name differs only by case
https://docs.delta.io/latest/delta-batch.html#schema-validation
So I'm going ahead with landing this to match Trino behaviour with Apache Hive and AWS Athena in this case.

@raunaqmorarka raunaqmorarka merged commit 3b1eb2f into trinodb:master Aug 27, 2024
57 checks passed
@raunaqmorarka raunaqmorarka deleted the pqr-dup branch August 27, 2024 07:41
@github-actions github-actions bot added this to the 455 milestone Aug 27, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

3 participants