Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[FEA] Finish the JSON test matrix #10491

Open
6 of 41 tasks
revans2 opened this issue Feb 23, 2024 · 3 comments
Open
6 of 41 tasks

[FEA] Finish the JSON test matrix #10491

revans2 opened this issue Feb 23, 2024 · 3 comments
Labels
test Only impacts tests

Comments

@revans2
Copy link
Collaborator

revans2 commented Feb 23, 2024

Is your feature request related to a problem? Please describe.

#10490 is adding in some beginnings of a test matrix for all JSON processing. It is a good start but it is not done. We are still missing a number of things.

Confs/Options

  • maxNestingDepth - added in 3.5.0 sets the maximum nesting depth of JSON before it is considered invalid. The default value is 1000, but in my tests with from_json It started to fail at a depth of 255. I don't know what the limit is for get_json_object. I also don't know if we want to go for the full 1000, or if we are okay with a smaller depth.
  • maxNumLen - also added in 3.5.0 this is the maximum length of a number. It looks like we go way beyond the default 1000 so we probably are okay, but it still would be good to have some kind of a test
  • maxStringLen - also added in 3.5.0. The default here is 20,000,000 (and I think that is chars not bytes) In my tests we go way beyond this but it would be good to add some tests here too.
  • locale - locale appears to impact date, timestamp, and decimal parsing. We have tests written and issues filed around decimal. We have not done the same for date/timestamp because there are issues with the date/timestamp format settings, which when fixed should hopefully mask this (as it really only impacts formats that we do not support) But tests that we fallback would still be good.
  • parseMode. We only support permissive, but we should have unified tests for this
  • columnNameOfCorruptRecord this is for capturing the input that didn't parse properly. It only matters if a column with this name shows up in the read schema. We have some tests but we should have unified tests.
  • timeZone - we have some tests but we should look at putting them in the test matrix.
  • dateFormat - again we have some tests but they are not unified
  • timestampFormat - again we have some tests but they are not unified
  • timestampNTZFormat - again we have some tests but they are not unified
  • enableDateTimeParsingFallback. This also needs to be combined with LEGACY parsing configs testing
  • linSep/encoding. We have very little testing here, but for the most part it is ignored except for ScanJson so it is probably okay.

Input Data Types/Formats:

  • dictionary of double quoted strings
  • dictionary of single quoted strings
  • dictionary of ints
  • array of double quoted strings i.e. {"data":[...]}
  • array of single quoted strings
  • array of integer format
  • top level array of double quoted strings i.e. [...]
  • top level array of single quoted strings
  • top level array of integer format
  • garbage at the end of the input lines
  • date formatted string for different locals
  • timestamp formatted string for different locals
  • extra white space in between fields
  • leading zeros on numbers
  • repeated columns (Add tests for repeated JSON columns/keys #11362)
  • missing columns
  • \r\n whitespace in strings
  • escaped chars in strings (need to test chars that can be encoded 3 ways \u \ and regular)
  • deeply nested/mixed data. These should include things that would map to the various map types

Data Types (note that these only apply to from_json and ScanJson because the others don't take a full read schema):

All nested types should be tested both at the top level and as a child (data) column. If we don't support those types yet, then we should verify that we fallback to the CPU as expected.

  • date
  • timestamp
  • map<string,string>
  • map<string,int>
  • map<string,decimal(38,0)>
  • array<string>
  • array<long>
  • array<decimal(38,0)>
  • map<string,map<string,string>>
  • map<string,array<string>>
@jihoonson
Copy link
Collaborator

@revans2, this is a great list. I want to clarify a couple of things though. As for the local config, did you mean locale? I don't see the local config in JSONOptions. I don't see corruptedColumnName either in both the Spark and the plugin repos. Where do you see this config?

@revans2
Copy link
Collaborator Author

revans2 commented Feb 6, 2025

@jihoonson yes I updated local to be locale.

corruptedColumnName is me misremembering the proper config name. https://github.com/apache/spark/blob/e89b19f0d162ace3cda0fc4d05de0771216b69ad/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L287

Thanks for catching my errors. I think I have updated this to be correct now.

@jihoonson
Copy link
Collaborator

@revans2 I see. Thank you for updating those config names!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
test Only impacts tests
Projects
None yet
Development

No branches or pull requests

4 participants