Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] Some deltalake tests failed on ARM64 with DATAGEN_SEED=1702341898 #10025

Closed
revans2 opened this issue Dec 12, 2023 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

Describe the bug
I have not been able to reproduce this on my desktop. But we had an arm64 build with Spark 3.4.2 and deltalake 2.4.0 fail the following tests.

[2023-12-12T01:07:56.056Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_rows[None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.056Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":575,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.056Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":580,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.056Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_rows[None-False][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-12-12T01:07:56.056Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":580,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":575,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_dataframe_api[None-False][DATAGEN_SEED=1702341898, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows_with_dv[True-None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(HashAggregateExec,ColumnarToRowExec,RapidsDeltaWriteExec,GenerateExec,DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows[None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows_with_dv[True-None-False][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(HashAggregateExec,ColumnarToRowExec,RapidsDeltaWriteExec,GenerateExec,DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] Starting with datagen test seed: 1702341898. Set env variable SPARK_RAPIDS_TEST_DATAGEN_SEED to override.
[2023-12-12T01:07:56.057Z] Starting with OOM injection seed: 1702341898. Set env variable SPARK_RAPIDS_TEST_INJECT_OOM_SEED to override.
[2023-12-12T01:07:56.057Z] 2023-12-12 00:44:58 INFO     Executing global initialization tasks before test launches
[2023-12-12T01:07:56.057Z] 2023-12-12 00:44:58 INFO     Creating directory /home/jenkins/agent/workspace/rapids_it-arm64-dev/jars/integration_tests/target/run_dir-20231212004458-nCyn/hive with permissions 0o777
[2023-12-12T01:07:56.057Z] 2023-12-12 00:44:58 INFO     Skipping findspark init because on xdist master
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_dataframe_api[None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22}}'}

In all of these cases it appears that some of the data was partitioned slightly differently. I know that @jlowe was working on a fix for some deltalake issues where we got unlucky and the order of the files read was non-deterministic because the sizes matched exactly. I am not sure if that is the case here, or if something else is happening. Especially for the delete case.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 12, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 12, 2023
@jlowe
Copy link
Contributor

jlowe commented Dec 27, 2023

This is the same root cause as described at #9884 (comment). Two files created during setup have the exact same filesize, so when it tries to order them by size it's non-deterministic.

@jlowe
Copy link
Contributor

jlowe commented Dec 27, 2023

Note that I can reproduce these issues by forcing two threads, e.g.:

SPARK_SUBMIT_FLAGS="--master local[2]" TEST_PARALLEL=0 SPARK_HOME=/home/jlowe/spark-3.4.1-bin-hadoop3/ DATAGEN_SEED=1702341898 PYSP_TEST_spark_jars_packages=io.delta:delta-core_2.12:2.4.0 PYSP_TEST_spark_sql_extensions=io.delta.sql.DeltaSparkSessionExtension PYSP_TEST_spark_sql_catalog_spark__catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog integration_tests/run_pyspark_from_build.sh -k "test_delta_update_dataframe and None-False" --delta_lake --debug_tmp_path

@andygrove andygrove removed their assignment Apr 1, 2024
@andygrove andygrove added the ? - Needs Triage Need team to review and classify label Apr 1, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 2, 2024
@mattahrens mattahrens closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants