fix: Address multi-GPU issue in engine deserialize #2325

gs-olive · 2023-09-18T23:26:50Z

Description

Fix issue where GPU ID of device compiled on was taking precedence over GPU ID of device in context
Clean up logic in device detection

Fixes #2319
Fixes #2269

Type of change

Please delete options that are not relevant and/or add your own.

Bug fix (non-breaking change which fixes an issue)

Checklist:

[ x ] My code follows the style guidelines of this project (You can use the linters)
[ x ] I have performed a self-review of my own code
[ x ] I have commented my code, particularly in hard-to-understand areas and hacks
[ x ] I have made corresponding changes to the documentation
[ - ] I have added tests to verify my fix or my feature
- See test_multi_gpu_serde.cpp
[ x ] New and existing unit tests pass locally with my changes
[ x ] I have added the relevant labels to my PR in so that relevant reviewers are notified

github-actions

Code conforms to Python style guidelines

github-actions

Code conforms to C++ style guidelines

github-actions

Code conforms to C++ style guidelines

github-actions

Code conforms to C++ style guidelines

github-actions

Code conforms to Python style guidelines

github-actions

Code conforms to Python style guidelines

github-actions

Code conforms to Python style guidelines

github-actions

Code conforms to C++ style guidelines

core/runtime/execute_engine.cpp

github-actions

Code conforms to C++ style guidelines

github-actions

Code conforms to Python style guidelines

github-actions

Code conforms to C++ style guidelines

github-actions

Code conforms to Python style guidelines

github-actions

Code conforms to C++ style guidelines

github-actions

Code conforms to Python style guidelines

narendasan · 2023-09-25T16:22:28Z

core/runtime/runtime.cpp

@@ -10,6 +10,8 @@ namespace runtime {
 c10::optional<RTDevice> get_most_compatible_device(const RTDevice& target_device) {
  LOG_DEBUG("Target Device: " << target_device);
  auto device_options = find_compatible_devices(target_device);
+  auto current_device = get_current_device();


This probably has performance implications since the current_device_call has significant perf overhead

I thought this line would only run when the engine is being instantiated on a new device and not on every inference call, unless is_switch_required is entered, but this PR ensures that would not be the case.

Default constructor of RTDevice invalid device. Add Optional argument to get_most_compatible_device if RTDevice is missing then add get_current_device call.

- Fix issue where GPU ID of device compiled on was taking precedence over GPU ID of device in context - Add performance-considerate implementation of current device-getting

github-actions

Code conforms to C++ style guidelines

github-actions

Code conforms to Python style guidelines

narendasan

LGTM

gs-olive requested review from narendasan and peri044 September 18, 2023 23:26

gs-olive self-assigned this Sep 18, 2023

facebook-github-bot added the cla signed label Sep 18, 2023

github-actions bot added component: core Issues re: The core compiler component: runtime labels Sep 18, 2023

github-actions bot requested a review from bowang007 September 18, 2023 23:27

gs-olive force-pushed the runtime_multi_gpu_fix branch from 4fe18df to 54af72d Compare September 18, 2023 23:29

github-actions bot approved these changes Sep 18, 2023

View reviewed changes

gs-olive added the WIP Work is in progress, pull request should not be merged yet label Sep 18, 2023

gs-olive force-pushed the runtime_multi_gpu_fix branch from 54af72d to 754fd5a Compare September 19, 2023 00:07

gs-olive commented Sep 19, 2023

View reviewed changes

core/runtime/execute_engine.cpp Outdated Show resolved Hide resolved

github-actions bot approved these changes Sep 19, 2023

View reviewed changes

gs-olive mentioned this pull request Sep 19, 2023

❓ [Question] How do I load the torch tensorRT model on multiple gpus #2319

Closed

gs-olive added WIP Work is in progress, pull request should not be merged yet and removed WIP Work is in progress, pull request should not be merged yet labels Sep 19, 2023

gs-olive force-pushed the runtime_multi_gpu_fix branch 2 times, most recently from 42c6650 to 5a6d1e6 Compare September 19, 2023 00:46

gs-olive removed the WIP Work is in progress, pull request should not be merged yet label Sep 19, 2023

github-actions bot approved these changes Sep 19, 2023

View reviewed changes

narendasan reviewed Sep 25, 2023

View reviewed changes

fix: Address multi-GPU issue in engine deserialize

3df2b65

- Fix issue where GPU ID of device compiled on was taking precedence over GPU ID of device in context - Add performance-considerate implementation of current device-getting

gs-olive force-pushed the runtime_multi_gpu_fix branch from 5a6d1e6 to 3df2b65 Compare September 26, 2023 23:49

github-actions bot approved these changes Sep 26, 2023

View reviewed changes

narendasan approved these changes Sep 27, 2023

View reviewed changes

gs-olive merged commit 338e542 into main Sep 27, 2023

gs-olive deleted the runtime_multi_gpu_fix branch September 27, 2023 19:55

agunapal mentioned this pull request Sep 29, 2023

Debugging extension for VSCode pytorch/serve#2223

Open

gs-olive mentioned this pull request Oct 24, 2023

🐛 [Bug] Loading Torch-TensorRT models (.ts) on multiple GPUs (in TorchServe) #1888

Open

fix: Address multi-GPU issue in engine deserialize #2325

fix: Address multi-GPU issue in engine deserialize #2325

Uh oh!

Conversation

gs-olive commented Sep 18, 2023

Description

Type of change

Checklist:

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

narendasan Sep 25, 2023

Choose a reason for hiding this comment

Uh oh!

gs-olive Sep 25, 2023

Choose a reason for hiding this comment

Uh oh!

gs-olive Sep 26, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

narendasan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!