Profiling tool can miss datasources when they are GPU reads #4804

tgravescs · 2022-02-16T16:39:19Z

This fixes it so we properly report GPU based datasources (csv, parquet, json, orc) when the profiling tool looks at the event logs from a run witth rapids plugin enabled. Tested with both dsv1 and dsv2 versions. This also changes JDBC to report the other fields like format, locations, etc.

It also fixes a bug with CSV files where there could be commas in the field even though our delimiter is a comma. That makes it so if you read the CSV file back into Spark it truncates that. Specifically this happens with the file schema where format is: name:Type,name2:Type2,... So for this we took the same logic used by the qualiciation tool to just replace the comma in any strings.

example output:

Data Source Information:
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+
|appIndex|sqlID|format         |location                                                                                                            |pushedFilters|schema                                                                                       |
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+
|1       |0    |Text           |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/src/test/resources/people.csv]|[]           |value:string                                                                                 |
|1       |1    |gpucsv(GPU)    |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/src/test/re...                |unknown      |_c0:string,_c1:string,_c2:string                                                             |
|1       |2    |gpujson(GPU)   |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/test...                |unknown      |number:double                                                                                |
|1       |3    |gpuparquet(GPU)|InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/test...                |[]           |loan_id:bigint,orig_channel:string,seller_name:string,orig_interest_rate:double,orig_upb:i...|
|1       |4    |gpuorc(GPU)    |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/tests/src/test/resources/file...                |[]           |loan_id:bigint,orig_channel:int,orig_interest_rate:double,orig_upb:int,orig_loan_term:int,...|
|2       |0    |Text           |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/resources/people.csv]  |[]           |value:string                                                                                 |
|2       |1    |CSV(GPU)       |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/re...                  |[]           |_c0:string,_c1:string,_c2:string                                                             |
|2       |2    |JSON(GPU)      |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/target/test...                  |[]           |number:double                                                                                |
|2       |3    |ORC(GPU)       |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/tests/src/test/resources/file...                  |[]           |loan_id:bigint,orig_channel:int,orig_interest_rate:double,orig_upb:int,orig_loan_term:int,...|
|2       |4    |Parquet(GPU)   |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/target/test...                  |[]           |loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...|
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

…file

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

…Read

tgravescs · 2022-02-16T16:43:30Z

actually looks like it missed removing the Location and pushedfilter tags for some gpu, I'll fix that

tgravescs · 2022-02-16T18:28:30Z

build

tools/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala

nartal1 · 2022-02-16T22:42:30Z

Just a question on method name. Rest all LGTM.

tgravescs · 2022-02-16T23:18:55Z

build

tgravescs added 15 commits February 11, 2022 08:27

Change qualification tool to not report decimal as problematic

ce3ec50

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Update more test results to remove decimal check

f091b66

Print all spark properties as separate table

1e12410

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Merge remote-tracking branch 'origin/branch-22.04' into sparkpropsPro…

d34a166

…file

Update copyright and add tests

893e23b

capitalize header

65f19ae

copyright

5b17840

Update profile docs

be4e9fb

Fix the Gpu Scans to show up in profiling tool

24103ac

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Merge remote-tracking branch 'origin/branch-22.04' into profileFixGpu…

897c55a

…Read

Fix and add dvs1 test

602e3a2

Add dsv2 GPU test

2d97d4c

handle profile CSV file replacing comma if in one of the field strings

d2b204f

fix line length

372a012

Merge remote-tracking branch 'origin/branch-22.04' into profileFixGpu…

abbbd13

…Read

tgravescs added the tools label Feb 16, 2022

tgravescs added this to the Feb 14 - Feb 25 milestone Feb 16, 2022

tgravescs self-assigned this Feb 16, 2022

tgravescs marked this pull request as draft February 16, 2022 16:43

for GPU scans and dsv2 remove the Location and filter tags from output

6e569db

tgravescs marked this pull request as ready for review February 16, 2022 18:28

nartal1 reviewed Feb 16, 2022

View reviewed changes

tools/src/main/scala/org/apache/spark/sql/rapids/tool/AppBase.scala Outdated Show resolved Hide resolved

Update function name

ad189c4

nartal1 approved these changes Feb 16, 2022

View reviewed changes

tgravescs merged commit 6eae4c1 into NVIDIA:branch-22.04 Feb 17, 2022

tgravescs deleted the profileFixGpuRead branch February 17, 2022 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling tool can miss datasources when they are GPU reads #4804

Profiling tool can miss datasources when they are GPU reads #4804

tgravescs commented Feb 16, 2022 •

edited

Loading

tgravescs commented Feb 16, 2022

tgravescs commented Feb 16, 2022

nartal1 commented Feb 16, 2022

tgravescs commented Feb 16, 2022

Profiling tool can miss datasources when they are GPU reads #4804

Profiling tool can miss datasources when they are GPU reads #4804

Conversation

tgravescs commented Feb 16, 2022 • edited Loading

tgravescs commented Feb 16, 2022

tgravescs commented Feb 16, 2022

nartal1 commented Feb 16, 2022

tgravescs commented Feb 16, 2022

tgravescs commented Feb 16, 2022 •

edited

Loading