Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] cudf::io::json::detail::normalize_single_quotes outputs incorrect result when the input has \n character #17261

Closed
Tracked by #11630
ttnghia opened this issue Nov 7, 2024 · 0 comments · Fixed by #17266
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Nov 7, 2024

Reproducible with this input:

{\"a\": \"1\n2\"}
{\'a\': 12}

The output tokens, generated by cudf::io::json::detail::get_token_stream after preprocessing with cudf::io::json::detail::normalize_single_quotes are:

Input:
{"a": "1
2"}{'a': 12}
Tokens:
0, 4, 6, 7, 8, 9, 5, 1, 0, 1
Token indices:
0, 1, 1, 3, 6, 10, 11, 11, 0, 0

If remove the \n character then the output is correct:

Input:
{"a": "12"}{"a": 12}
Tokens:
0, 4, 6, 7, 8, 9, 5, 1, 0, 4, 6, 7, 10, 11, 5, 1
Token indices:
0, 1, 1, 3, 6, 9, 10, 10, 12, 13, 13, 15, 18, 20, 20, 20

Note:

  • Line delimiter between JSON objects is \0, not \n.
  • allow_unquoted_control is set to true.
  • Token indices are the positions of the tokens in the input string.
  • Token numbers are static_cast from enum token_t at
    enum token_t : PdaTokenT {

I suspect that it is due to the leftover character \n in

std::array<std::vector<SymbolT>, NUM_SYMBOL_GROUPS - 1> const qna_sgs{
{{'\"'}, {'\''}, {'\\'}, {'\n'}}};
, but I'm not 100% sure.

@ttnghia ttnghia added bug Something isn't working cuco cuCollections related issue cuIO cuIO issue and removed cuco cuCollections related issue labels Nov 7, 2024
@ttnghia ttnghia linked a pull request Nov 7, 2024 that will close this issue
3 tasks
@rapids-bot rapids-bot bot closed this as completed in 5cbdcd0 Nov 9, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants