Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Profiling tool can miss datasources when they are GPU reads #4804

Merged
merged 17 commits into from
Feb 17, 2022

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Feb 16, 2022

fixes #4759

This fixes it so we properly report GPU based datasources (csv, parquet, json, orc) when the profiling tool looks at the event logs from a run witth rapids plugin enabled. Tested with both dsv1 and dsv2 versions. This also changes JDBC to report the other fields like format, locations, etc.

It also fixes a bug with CSV files where there could be commas in the field even though our delimiter is a comma. That makes it so if you read the CSV file back into Spark it truncates that. Specifically this happens with the file schema where format is: name:Type,name2:Type2,... So for this we took the same logic used by the qualiciation tool to just replace the comma in any strings.

example output:

Data Source Information:
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+
|appIndex|sqlID|format         |location                                                                                                            |pushedFilters|schema                                                                                       |
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+
|1       |0    |Text           |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/src/test/resources/people.csv]|[]           |value:string                                                                                 |
|1       |1    |gpucsv(GPU)    |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/src/test/re...                |unknown      |_c0:string,_c1:string,_c2:string                                                             |
|1       |2    |gpujson(GPU)   |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/test...                |unknown      |number:double                                                                                |
|1       |3    |gpuparquet(GPU)|InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/test...                |[]           |loan_id:bigint,orig_channel:string,seller_name:string,orig_interest_rate:double,orig_upb:i...|
|1       |4    |gpuorc(GPU)    |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/tests/src/test/resources/file...                |[]           |loan_id:bigint,orig_channel:int,orig_interest_rate:double,orig_upb:int,orig_loan_term:int,...|
|2       |0    |Text           |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/resources/people.csv]  |[]           |value:string                                                                                 |
|2       |1    |CSV(GPU)       |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/re...                  |[]           |_c0:string,_c1:string,_c2:string                                                             |
|2       |2    |JSON(GPU)      |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/target/test...                  |[]           |number:double                                                                                |
|2       |3    |ORC(GPU)       |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/tests/src/test/resources/file...                  |[]           |loan_id:bigint,orig_channel:int,orig_interest_rate:double,orig_upb:int,orig_loan_term:int,...|
|2       |4    |Parquet(GPU)   |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/target/test...                  |[]           |loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...|
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+

@tgravescs tgravescs added this to the Feb 14 - Feb 25 milestone Feb 16, 2022
@tgravescs tgravescs self-assigned this Feb 16, 2022
@tgravescs
Copy link
Collaborator Author

actually looks like it missed removing the Location and pushedfilter tags for some gpu, I'll fix that

@tgravescs tgravescs marked this pull request as draft February 16, 2022 16:43
@tgravescs tgravescs marked this pull request as ready for review February 16, 2022 18:28
@tgravescs
Copy link
Collaborator Author

build

@nartal1
Copy link
Collaborator

nartal1 commented Feb 16, 2022

Just a question on method name. Rest all LGTM.

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 6eae4c1 into NVIDIA:branch-22.04 Feb 17, 2022
@tgravescs tgravescs deleted the profileFixGpuRead branch February 17, 2022 14:23
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Profiling tool can miss datasources when they are GPU reads
2 participants