Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve compatibility when reading timestamps from JSON and CSV sources #4938

Merged
merged 23 commits into from
Mar 24, 2022

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Mar 11, 2022

Signed-off-by: Andy Grove andygrove@nvidia.com

Closes #4863 and closes #123

Improves timestamp support in JSON and CSV to match Spark, by reading from cuDF as strings and then converting to timestamps in the plugin.

There is one follow-on issues:

Status

  • Basic functionality in place
  • Fix regression in CsvScanSuite
  • Improve code documentation
  • Review/update compatibility guide
  • Review tests

@andygrove andygrove added this to the Feb 28 - Mar 18 milestone Mar 11, 2022
@andygrove andygrove self-assigned this Mar 11, 2022
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title WIP: Improve compatibility when reading timestamps from JSON and CSV sources Improve compatibility when reading timestamps from JSON and CSV sources Mar 14, 2022
@andygrove andygrove marked this pull request as ready for review March 14, 2022 20:04
@andygrove
Copy link
Contributor Author

build

2 similar comments
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is okay, but it is really complicated and a lot of assumptions that only a very specific set of formats are allowed. I keep thinking that there might be a simpler way to make it more data driven with look up tables instead of transpiling everything. But then I see we convert the format to both regular expressions and to the CUDF format and I just don't know if the lookup table will actually be small/less error prone or not.


// fix timestamps that have milliseconds but no microseconds
// example ".296" => ".296000"
val placeholder = "@@@"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about why @@@ is an okay sequence to use here an will never interfere with a real timestamp.

@andygrove
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes Mar 18, 2022
@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

@revans2 could you re-approve this one, please. I had to upmerge since your last approval.

@revans2 revans2 merged commit ceb3558 into NVIDIA:branch-22.04 Mar 24, 2022
@sameerz sameerz added the feature request New feature or request label Mar 25, 2022
@andygrove andygrove deleted the json-timestamp branch March 30, 2022 14:12
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
feature request New feature or request
Projects
None yet
3 participants