[FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing #6

revans2 · 2020-05-28T17:33:02Z

Is your feature request related to a problem? Please describe.
when parsing CSV currently the CPU will read through the data using the HadoopFileLinesReader and replace the line endings. It would be great from a performance standpoint to do a block copy of most of the data, and skip the line ending translation. This would require that the cudf CSV reader support line endings that are '\r', '\n', or '\r\n'. This is not a simple task but could reduce the CPU utilization significantly.

revans2 · 2020-10-21T12:44:49Z

I filed rapidsai/cudf#6572 in cudf to try and support this.

Update scala app version to 0.2.2

…tampNTZEnabled Fix errors caused by 340+ not working on DB

* A hacky approach for regexpr rewrite Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Use contains instead for that case Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * add config to switch Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Rewrite some rlike expression to StartsWith/EndsWith/Contains Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * wip Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * wip Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * add tests and config Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin performance A performance related task/issue labels May 28, 2020

sameerz changed the title ~~[FEA] explore ways not use HadoopFileLinesReader for CSV parseing~~ [FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing Oct 13, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Oct 20, 2020

wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020

Merge pull request NVIDIA#6 from firestarman/rel-ver

9994d52

Update scala app version to 0.2.2

revans2 mentioned this issue Mar 2, 2022

The improvement of GPU execution efficiency encounters a bottleneck ？ #4877

Open

mattahrens added the P1 Nice to have for release label Apr 27, 2022

revans2 mentioned this issue Oct 27, 2022

[BUG] Fix CSV Parsing #2063

Open

38 tasks

gerashegalov pushed a commit to gerashegalov/spark-rapids that referenced this issue Nov 18, 2022

Merge pull request NVIDIA#6 from amahussein/rapids-db113-parquetTimes…

8d1d61b

…tampNTZEnabled Fix errors caused by 340+ not working on DB

sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this issue May 16, 2024

Fix a test error related to SerializedTableColumn (NVIDIA#6)

4e64a7a

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing #6

[FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing #6

revans2 commented May 28, 2020

revans2 commented Oct 21, 2020

[FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing #6

[FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing #6

Comments

revans2 commented May 28, 2020

revans2 commented Oct 21, 2020