You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using a custom container (adapted from an existing TorchServe container) which has the following:
torchserve (version 0.8.2)
torch-model-archiver (version 0.8.2)
tensorrt (version 8.5.3.1)
torch_tensorrt (version 1.4.0)
cuDNN (version 8.9.3.28)
CUDA 11.7
I am running this example on a g5dn.24xlarge EC2 instance. It is expected that the model should be loaded on all 4 GPUs (with one worker each). Upon starting TorchServe, the model is loaded successfully and I can get the following inference output:
curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
{
"tabby": 0.2723647356033325,
"tiger_cat": 0.13748960196971893,
"Egyptian_cat": 0.04659610986709595,
"lynx": 0.00318642589263618,
"lens_cap": 0.00224193069152534
}
When I run curl -X GET http://localhost:8081/models/res50-trt-fp16, I get the following output:
From the above output, it appears that a worker is created on each GPU, however the memory.used field is 5 MB for all GPUs except the one with id 9003 (which has memory.used = 2152 MB)
Running nvidia-smi leads to the following output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 28C P0 60W / 300W | 2152MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 23C P8 16W / 300W | 5MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 23C P8 16W / 300W | 5MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 22C P8 16W / 300W | 5MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
As can be seen above, the memory-usage is 5 MB for all GPUs except GPU 0.
Also, when I send an inference request, I see the following in the model server logs:
2023-09-06T06:10:25,671 [WARN ] W-9000-res50-trt-fp16_1.0-stderr MODEL_LOG - WARNING: [Torch-TensorRT] - Input 0 of engine __torch___torchvision_models_resnet_ResNet_trt_engine_ was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
The warning here suggests that while the inference request is sent for the worker on GPU 1, it eventually gets redirected to GPU 0, further suggesting that the model is only loaded on 1 GPU, not all 4. I request you to please investigate this issue. Thanks!
🐛 Describe the bug
I am following this example to perform inference on TorchServe with a torch-tensorrt model: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt
I am using a custom container (adapted from an existing TorchServe container) which has the following:
I am running this example on a g5dn.24xlarge EC2 instance. It is expected that the model should be loaded on all 4 GPUs (with one worker each). Upon starting TorchServe, the model is loaded successfully and I can get the following inference output:
When I run
curl -X GET http://localhost:8081/models/res50-trt-fp16
, I get the following output:From the above output, it appears that a worker is created on each GPU, however the
memory.used
field is5 MB
for all GPUs except the one with id9003
(which hasmemory.used = 2152 MB
)Running
nvidia-smi
leads to the following output:As can be seen above, the memory-usage is 5 MB for all GPUs except GPU 0.
Also, when I send an inference request, I see the following in the model server logs:
The warning here suggests that while the inference request is sent for the worker on GPU 1, it eventually gets redirected to GPU 0, further suggesting that the model is only loaded on 1 GPU, not all 4. I request you to please investigate this issue. Thanks!
Error logs
Pasted relevant logs above.
Installation instructions
Provided relevant information above.
Model Packaing
Followed this example: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt
config.properties
No response
Versions
Repro instructions
To reproduce, please follow this example: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt
Possible Solution
No response
The text was updated successfully, but these errors were encountered: