[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

ttnghia · 2024-11-16T15:25:36Z

When the input JSON has lists with mixed types, read_json does not output nulls correctly for the output lists column. For example:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123]}                 |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+----------------+                                                              
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|        {[null]}|             <========== wrong, should be {null}
+----------------+

In the example above, the input of the second list is [123], while the desired schema is list<struct<string, string>>. As such, the output here should be a null list, instead of a list of null struct like above.

The text was updated successfully, but these errors were encountered:

ttnghia · 2024-11-16T15:48:18Z

Another example:
GPU:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123, {"b": "1"}]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123, {"b": "1"}]}     |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+-------------------+
|   from_json(value)|
+-------------------+
|         {[{1, 2}]}|
|{[null, {1, null}]}|              <================ wrong
+-------------------+

The correct output from Spark CPU is also a null list:

+----------------+
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|          {null}|
+----------------+

karthikeyann · 2024-11-19T01:03:07Z

Input: for schemastruct<c2: array<struct<b: string, c: string>>>

      {"c2": []}
      {"c2": [{}]}
      {"c2": [[]]}
      {"c2": [[], {}]}
      {"c2": [[123], {"b": "1"}]}
      {"c2": [{"x": "y"}, {"b": "1"}]}
      {}

Spark Output:

[]
[{null, null}]
null
null
null
[{null, null}, {1, null}] 
null

ttnghia added the bug Something isn't working label Nov 16, 2024

ttnghia added this to libcudf Nov 16, 2024

ttnghia moved this to Burndown in libcudf Nov 16, 2024

karthikeyann self-assigned this Nov 18, 2024

GregoryKimball removed the status in libcudf Nov 20, 2024

This was referenced Nov 22, 2024

[FEA] enable from_json and json scan by default NVIDIA/spark-rapids#11630

Closed

Fix all null list column with missing child column in JSON reader #17348

Merged

rapids-bot bot closed this as completed in #17348 Dec 6, 2024

karthikeyann added this to the Nested JSON reader milestone Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

ttnghia commented Nov 16, 2024 •

edited

Loading

ttnghia commented Nov 16, 2024 •

edited

Loading

karthikeyann commented Nov 19, 2024

[BUG] read_json does not output null list when the input is a list with different data type than the specified schema #17349

[BUG] read_json does not output null list when the input is a list with different data type than the specified schema #17349

Comments

ttnghia commented Nov 16, 2024 • edited Loading

ttnghia commented Nov 16, 2024 • edited Loading

karthikeyann commented Nov 19, 2024

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

ttnghia commented Nov 16, 2024 •

edited

Loading

ttnghia commented Nov 16, 2024 •

edited

Loading