Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] read_json does not output null list when the input is a list with different data type than the specified schema #17349

Closed
Tracked by #11630
ttnghia opened this issue Nov 16, 2024 · 2 comments · Fixed by #17348
Assignees
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Nov 16, 2024

When the input JSON has lists with mixed types, read_json does not output nulls correctly for the output lists column. For example:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123]}                 |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+----------------+                                                              
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|        {[null]}|             <========== wrong, should be {null}
+----------------+

In the example above, the input of the second list is [123], while the desired schema is list<struct<string, string>>. As such, the output here should be a null list, instead of a list of null struct like above.

@ttnghia ttnghia added the bug Something isn't working label Nov 16, 2024
@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 16, 2024

Another example:
GPU:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123, {"b": "1"}]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123, {"b": "1"}]}     |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+-------------------+
|   from_json(value)|
+-------------------+
|         {[{1, 2}]}|
|{[null, {1, null}]}|              <================ wrong
+-------------------+

The correct output from Spark CPU is also a null list:

+----------------+
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|          {null}|
+----------------+

@ttnghia ttnghia added this to libcudf Nov 16, 2024
@ttnghia ttnghia moved this to Burndown in libcudf Nov 16, 2024
@karthikeyann karthikeyann self-assigned this Nov 18, 2024
@karthikeyann
Copy link
Contributor

Input: for schemastruct<c2: array<struct<b: string, c: string>>>

      {"c2": []}
      {"c2": [{}]}
      {"c2": [[]]}
      {"c2": [[], {}]}
      {"c2": [[123], {"b": "1"}]}
      {"c2": [{"x": "y"}, {"b": "1"}]}
      {}

Spark Output:

[]
[{null, null}]
null
null
null
[{null, null}, {1, null}] 
null

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants