Fix handling of duplicate column names in parquet reader #23050

raunaqmorarka · 2024-08-15T05:22:28Z

Description

Parquet files may contain duplicate column names when written by case sensitive tools.
We read the first case insensitive match from the file in this scenario.

Additional context and related issues

Fixes query failures caused by #22538

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fixes query failures due to "Multiple entries with same key" when reading parquet files. ({issue}`23050`)

Parquet files may contain duplicate column names when written by case sensitive tools. We read the first case insensitive match from the file in this scenario.

findinpath

Let's iterate a bit more on this, maybe this will give us a better result.

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveFileFormats.java

findinpath · 2024-08-15T06:13:29Z

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java

@@ -113,7 +113,9 @@ public static Map<List<String>, ColumnDescriptor> getDescriptors(MessageType fil
                .stream()
                .collect(toImmutableMap(
                        columnIO -> Arrays.asList(columnIO.getFieldPath()),
-                        PrimitiveColumnIO::getColumnDescriptor));
+                        PrimitiveColumnIO::getColumnDescriptor,
+                        // Same column name may occur more than once when the file is written by case-sensitive tools


nit: "Same column name" -> "Namesake"

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveFileFormats.java

ebyhr · 2024-08-16T04:27:40Z

Removed RELEASE-BLOCKER label since version 454 already shipped.

raunaqmorarka · 2024-08-16T06:43:47Z

I checked Apache Hive behaviour in this case. It always picks first case-insensitive match.
The current PR and Trino behaviour before recent changes matches this behaviour.
I also checked Apache Spark, it throws an error by default

Caused by: org.apache.spark.SparkRuntimeException: Found duplicate field(s) "upper_case_column": [UPPER_CASE_COLUMN, Upper_Case_Column] in case-insensitive mode.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.foundDuplicateFieldInCaseInsensitiveModeError(QueryExecutionErrors.scala:1082)

Setting set spark.sql.hive.convertMetastoreParquet=false; makes it behave the same way as Apache Hive.

raunaqmorarka · 2024-08-27T07:41:08Z

I further confirmed that this PR is matching the current AWS Athena behaviour.
I also found https://issues.apache.org/jira/browse/HIVE-7554 where case insensitive column matching was implemented a long time ago, but it doesn't explicitly discuss this particular scenario.
With iceberg we continue fail to read a table with such file even after this PR

Caused by: java.lang.IllegalArgumentException: Multiple entries with same key: 1=optional binary upper_case_column (STRING) = 1 and 1=optional binary upper_case_column (STRING) = 1
	at com.google.common.collect.ImmutableMap.conflictException(ImmutableMap.java:382)
	at com.google.common.collect.ImmutableMap.checkNoConflict(ImmutableMap.java:376)
	at com.google.common.collect.RegularImmutableMap.checkNoConflictInKeyBucket(RegularImmutableMap.java:246)
	at com.google.common.collect.RegularImmutableMap.fromEntryArrayCheckingBucketOverflow(RegularImmutableMap.java:133)
	at com.google.common.collect.RegularImmutableMap.fromEntryArray(RegularImmutableMap.java:95)
	at com.google.common.collect.ImmutableMap$Builder.build(ImmutableMap.java:579)
	at com.google.common.collect.ImmutableMap$Builder.buildOrThrow(ImmutableMap.java:607)
	at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetIdToFieldMapping(IcebergPageSourceProvider.java:1060)
	at io.trino.plugin.iceberg.IcebergPageSourceProvider.createParquetPageSource(IcebergPageSourceProvider.java:924)

Delta lake seems to explicitly disallow cases where column name differs only by case
https://docs.delta.io/latest/delta-batch.html#schema-validation
So I'm going ahead with landing this to match Trino behaviour with Apache Hive and AWS Athena in this case.

Fix handling of duplicate column names in parquet reader

3fc32ec

Parquet files may contain duplicate column names when written by case sensitive tools. We read the first case insensitive match from the file in this scenario.

cla-bot bot added the cla-signed label Aug 15, 2024

github-actions bot added the hive Hive connector label Aug 15, 2024

raunaqmorarka requested review from findepi, wendigo, ebyhr and findinpath August 15, 2024 05:22

raunaqmorarka added the bug Something isn't working label Aug 15, 2024

findinpath reviewed Aug 15, 2024

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java Show resolved Hide resolved

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveFileFormats.java Show resolved Hide resolved

raunaqmorarka requested a review from findinpath August 15, 2024 06:10

findinpath reviewed Aug 15, 2024

View reviewed changes

findinpath approved these changes Aug 15, 2024

View reviewed changes

ebyhr reviewed Aug 15, 2024

View reviewed changes

plugin/trino-hive/src/test/java/io/trino/plugin/hive/TestHiveFileFormats.java Show resolved Hide resolved

raunaqmorarka requested a review from ebyhr August 15, 2024 07:34

raunaqmorarka added the RELEASE-BLOCKER label Aug 15, 2024

ebyhr removed the RELEASE-BLOCKER label Aug 16, 2024

ebyhr approved these changes Aug 21, 2024

View reviewed changes

raunaqmorarka merged commit 3b1eb2f into trinodb:master Aug 27, 2024
57 checks passed

raunaqmorarka deleted the pqr-dup branch August 27, 2024 07:41

github-actions bot added this to the 455 milestone Aug 27, 2024

mosabua mentioned this pull request Aug 28, 2024

Add Trino 455 release notes #23096

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of duplicate column names in parquet reader #23050

Fix handling of duplicate column names in parquet reader #23050

raunaqmorarka commented Aug 15, 2024 •

edited

Loading

findinpath left a comment

findinpath Aug 15, 2024

ebyhr commented Aug 16, 2024

raunaqmorarka commented Aug 16, 2024

raunaqmorarka commented Aug 27, 2024

Fix handling of duplicate column names in parquet reader #23050

Fix handling of duplicate column names in parquet reader #23050

Conversation

raunaqmorarka commented Aug 15, 2024 • edited Loading

Description

Additional context and related issues

Release notes

findinpath left a comment

Choose a reason for hiding this comment

findinpath Aug 15, 2024

Choose a reason for hiding this comment

ebyhr commented Aug 16, 2024

raunaqmorarka commented Aug 16, 2024

raunaqmorarka commented Aug 27, 2024

raunaqmorarka commented Aug 15, 2024 •

edited

Loading