[FEA] Implement merged 'mega' kernel to parse leaf-level columns in JSON reader #16965

shrshi · 2024-10-01T17:17:46Z

Is your feature request related to a problem? Please describe.
Inferring types and parsing leaf level columns in the JSON reader launches separate kernels for each column.

We can obtain improved performance by gathering the offsets for columns contiguously, and then parsing them in a single kernel.

Describe the solution you'd like
Partitioning strategies to consider:

For parsing, 1 thread per offset.
1 warp / column (but 32 offsets/warp), consecutive warps will most likely access nearby memory and probably benefit from coalescing
Fixed number of characters per thread, but more careful thought is required for distributing work depending on column type.

shrshi added the feature request New feature or request label Oct 1, 2024

shrshi self-assigned this Oct 1, 2024

karthikeyann mentioned this issue Oct 21, 2024

JSON spark reader plan for 24.12 #17138

Open

karthikeyann added this to the Nested JSON reader milestone Nov 12, 2024

GregoryKimball mentioned this issue Jan 10, 2025

[FEA] JSON reader performance projects #17718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement merged 'mega' kernel to parse leaf-level columns in JSON reader #16965

[FEA] Implement merged 'mega' kernel to parse leaf-level columns in JSON reader #16965

shrshi commented Oct 1, 2024

[FEA] Implement merged 'mega' kernel to parse leaf-level columns in JSON reader #16965

[FEA] Implement merged 'mega' kernel to parse leaf-level columns in JSON reader #16965

Comments

shrshi commented Oct 1, 2024