The Torch-TensorRT library provides an easy interface to create and run inference using the TensorRT engine for PyTorch models.
PyTorch 2.0 introduced features such as TorchDynamo, TorchInductor and TorchCompile.
Briefly, I add snippets from PyTorch documentation for relevant concepts above.
torch.compile
is a PyTorch function introduced in PyTorch 2.x that aims to solve the problem of accurate graph capturing in PyTorch and ultimately enable software engineers to run their PyTorch programs faster.
torch.compile
leverages the following underlying technologies:
-
TorchDynamo (
torch._dynamo
or simply Dynamo) is a Python-level Just-In-Time (JIT) compiler designed to make unmodified PyTorch programs faster. Dynamo hooks into the frame evaluation API in CPython (PEP 523) to dynamically modify Python bytecode right before it is executed. It rewrites Python bytecode to extract sequences of PyTorch operations into an FX Graph which is then compiled with a customizable backend. It creates this FX Graph through bytecode analysis and is designed to mix Python execution with compiled backends to get the best of both worlds — usability and performance. -
TorchInductor is the default
torch.compile
deep learning compiler that generates fast code for multiple accelerators and backends. You need to use a backend compiler to make speedups throughtorch.compile
possible. For NVIDIA, AMD and Intel GPUs, it leverages OpenAI Triton as the key building block. -
AOT Autograd captures not only the user-level code, but also backpropagation, which results in capturing the backwards pass “ahead-of-time”. This enables acceleration of both forwards and backwards pass using TorchInductor.
Refer to the guide for more information on torch.compile
: https://pytorch.org/docs/main/torch.compiler.html and a video on Deep Dive on TorchDynamo.
There are several approaches inside TorchTensorRT to get a TensorRT engine and perform inference using the same.
There are two sub-approaches to use Dynamo as frontend.
-
torch.compile
: It uses just-in-time (JIT) compilation, deferring compilation until the first run for greater runtime flexibility. Under the hood, it creates sub-graphs which are further partitioned into components that will run in PyTorch and ones to be further compiled to TensorRT based on support for operators. An example of usingtorch.compile
approach: https://pytorch.org/TensorRT/dynamo/torch_compile.html. -
torch_tensorrt.dynamo.compile
: It performs Ahead-of-Time (AOT) compilation, meaning the model is compiled into a TensorRT engine before execution. Similar to above, it creates sub-graphs that can be partitioned to be ran in PyTorch or using TensorRT based on TensorRT operator support. An example of usingtorch_tensorrt.dynamo.compile
approach: https://pytorch.org/TensorRT/dynamo/dynamo_export.html. This approach is what we use for our implementation.
Note: Quantization/INT8 support is slated for a future release; currently, we support FP16 and FP32 precision layers.
TorchScript is an intermediate representation of PyTorch code that allows them to be serialized and optimized for deployment. There are two sub-approaches to convert PyTorch code into Torchscript format: tracing and scripting.
The TorchScript frontend was the original default frontend for Torch-TensorRT and targets models in the TorchScript format. The graph provided will be partitioned into supported and unsupported blocks. Supported blocks will be lowered to TensorRT and unsupported blocks will remain to run with LibTorch.
Note: TorchScript frontend (
torch_tensorrt.ts.compile
) approach is a legacy feature of the Torch-TensorRT library.
This frontend has almost entirely been replaced by the Dynamo frontend which is a superset of the features available though the FX frontend. The original FX frontend remains in the codebase for backwards compatibility reasons.
User guide on this approach: https://pytorch.org/TensorRT/fx/getting_started_with_fx_path.html. I found it bit hard to digest.
Exploring and reading Torch-Tensorrt documentation, I found that there are 3 approaches to converting a torch model to TensorRT.
- Dynamo Frontend
- TorchScript Frontend
- FX Frontend
It is not particularly clear which approach to prefer for a new user. For this experiment, I used Dynamo
approach that uses a FX Graph Module to create a TensorRT module. (Update: More clear spending more time going through and writing it down, TorchScript and FX frontend are legacy frontends and they are present for backward compatibility. Dynamo is the approach that should be used.)
Additionally, there are two approaches to create tensorrt module using Dynamo
-
torch.compile
approach: This approach did not provide a way to get the underlying tensorrt engine. -
Exported program approach which is what we use as our approach that provides a way to access the converted tensorrt module.
There were few notes/limitations regarding Torch-Tensorrt
library.
-
Dynamo approach does not provide
INT
precision support at the moment. -
use_python_runtime
parameter to the compiler changes which profiler is being used.- If set to
True
, it uses PythonTorchTensorRTModule as it's runtime. This approach does not provide an option in it's enable_profiling function to save the layer-wise latency. It instead just prints the traces on the stdout. In our implementation, we have written utility functions around this implementation to save layer-wise latency and tensorrt engine information to JSON files. - If set to
False
, it uses TorchTensorRTModule as it's runtime. This approach does provide option to store the profiling traces in a directory.
- If set to
-
At present there's no clean way to profile as noted in one of the issues on the TensorRT repo : pytorch/TensorRT#1467. We use a hacky approach to enable profiling as suggested in the comment on the issue.
-
There are different modes of ProfilingVerbosity that shows how detailed information for a particular layer is recorded. Apparently for
dynamo
approach, this is configured implicitly viadebug
parameter here. -
There are various parameter like
min_block_size
andoptimization
that affect the TensorRT run times. -
For example, to see if all operators are supported, we set
min_block_size
value to some higher value100
. This shows how many graph fragmentation occurs and what is recommended value.Here's an example output from logs for the
resnet18
model.Graph Structure: Inputs: List[Tensor: (1, 3, 224, 224)@float16] ... TRT Engine #1 - Submodule name: _run_on_acc_0 Engine Inputs: List[Tensor: (1, 3, 224, 224)@float16] Number of Operators in Engine: 89 Engine Outputs: Tensor: (1, 1000)@float16 ... Outputs: List[Tensor: (1, 1000)@float16] ------------------------- Aggregate Stats ------------------------- Average Number of Operators per TRT Engine: 89.0 Most Operators in a TRT Engine: 89 ********** Recommendations ********** - For minimal graph segmentation, select min_block_size=89 which would generate 1 TRT engine(s) - The current level of graph segmentation is equivalent to selecting min_block_size=89 which generates 1 TRT engine(s)
-
optimization_level
parameter decides how long to spend searching for an optimal path to minimize run time for a build. It's value ranges from 0 to 5 (higher levels imply longer build time).
-
How does setting the recommended value of
min_block_size
differ to the default? What is the perfomance difference in TensorRT inference -
Same question about
optimization_level
parameter. -
What approach should be used for integer quantization?