[docs]@dataclass
+classCompilationSettings:
+"""Compilation settings for Torch-TensorRT Dynamo Paths
+
+ Args:
+ precision (torch.dtype): Model Layer precision
+ debug (bool): Whether to print out verbose debugging information
+ workspace_size (int): Workspace TRT is allowed to use for the module (0 is default)
+ min_block_size (int): Minimum number of operators per TRT-Engine Block
+ torch_executed_ops (Sequence[str]): Sequence of operations to run in Torch, regardless of converter coverage
+ pass_through_build_failures (bool): Whether to fail on TRT engine build errors (True) or not (False)
+ max_aux_streams (Optional[int]): Maximum number of allowed auxiliary TRT streams for each engine
+ version_compatible (bool): Provide version forward-compatibility for engine plan files
+ optimization_level (Optional[int]): Builder optimization 0-5, higher levels imply longer build time,
+ searching for more optimization options. TRT defaults to 3
+ use_python_runtime (Optional[bool]): Whether to strictly use Python runtime or C++ runtime. To auto-select a runtime
+ based on C++ dependency presence (preferentially choosing C++ runtime if available), leave the
+ argument as None
+ truncate_long_and_double (bool): Whether to truncate int64/float64 TRT engine inputs or weights to int32/float32
+ use_fast_partitioner (bool): Whether to use the fast or global graph partitioning system
+ enable_experimental_decompositions (bool): Whether to enable all core aten decompositions
+ or only a selected subset of them
+ device (Device): GPU to compile the model on
+ require_full_compilation (bool): Whether to require the graph is fully compiled in TensorRT.
+ Only applicable for `ir="dynamo"`; has no effect for `torch.compile` path
+ """
+
+ precision:torch.dtype=PRECISION
+ debug:bool=DEBUG
+ workspace_size:int=WORKSPACE_SIZE
+ min_block_size:int=MIN_BLOCK_SIZE
+ torch_executed_ops:Set[str]=field(default_factory=set)
+ pass_through_build_failures:bool=PASS_THROUGH_BUILD_FAILURES
+ max_aux_streams:Optional[int]=MAX_AUX_STREAMS
+ version_compatible:bool=VERSION_COMPATIBLE
+ optimization_level:Optional[int]=OPTIMIZATION_LEVEL
+ use_python_runtime:Optional[bool]=USE_PYTHON_RUNTIME
+ truncate_long_and_double:bool=TRUNCATE_LONG_AND_DOUBLE
+ use_fast_partitioner:bool=USE_FAST_PARTITIONER
+ enable_experimental_decompositions:bool=ENABLE_EXPERIMENTAL_DECOMPOSITIONS
+ device:Device=field(default_factory=default_device)
+ require_full_compilation:bool=REQUIRE_FULL_COMPILATION
To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy.
diff --git a/docs/_sources/index.rst.txt b/docs/_sources/index.rst.txt
index 9e98c7a63d..d2f6c54b9d 100644
--- a/docs/_sources/index.rst.txt
+++ b/docs/_sources/index.rst.txt
@@ -40,6 +40,7 @@ User Guide
------------
* :ref:`creating_a_ts_mod`
* :ref:`getting_started_with_fx`
+* :ref:`torch_compile`
* :ref:`ptq`
* :ref:`runtime`
* :ref:`saving_models`
@@ -54,6 +55,7 @@ User Guide
user_guide/creating_torchscript_module_in_python
user_guide/getting_started_with_fx_path
+ user_guide/torch_compile
user_guide/ptq
user_guide/runtime
user_guide/saving_models
diff --git a/docs/_sources/user_guide/dynamic_shapes.rst.txt b/docs/_sources/user_guide/dynamic_shapes.rst.txt
index 28320956c4..4e1bf69631 100644
--- a/docs/_sources/user_guide/dynamic_shapes.rst.txt
+++ b/docs/_sources/user_guide/dynamic_shapes.rst.txt
@@ -1,4 +1,4 @@
-.. _runtime:
+.. _dynamic_shapes:
Dynamic shapes with Torch-TensorRT
====================================
@@ -206,13 +206,3 @@ In the future, we plan to explore the option of compiling with dynamic shapes in
# Recompilation happens with modified batch size
inputs_bs2 = torch.randn((2, 3, 224, 224), dtype=torch.float32)
trt_gm = torch_tensorrt.compile(model, ir="torch_compile", inputs_bs2)
-
-
-
-
-
-
-
-
-
-
diff --git a/docs/_sources/user_guide/torch_compile.rst.txt b/docs/_sources/user_guide/torch_compile.rst.txt
new file mode 100644
index 0000000000..a2d83cd52e
--- /dev/null
+++ b/docs/_sources/user_guide/torch_compile.rst.txt
@@ -0,0 +1,110 @@
+.. _torch_compile:
+
+Torch-TensorRT `torch.compile` Backend
+======================================================
+.. currentmodule:: torch_tensorrt.dynamo
+
+.. automodule:: torch_tensorrt.dynamo
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+This guide presents the Torch-TensorRT `torch.compile` backend: a deep learning compiler which uses TensorRT to accelerate JIT-style workflows across a wide variety of models.
+
+Key Features
+--------------------------------------------
+
+The primary goal of the Torch-TensorRT `torch.compile` backend is to enable Just-In-Time compilation workflows by combining the simplicity of `torch.compile` API with the performance of TensorRT. Invoking the `torch.compile` backend is as simple as importing the `torch_tensorrt` package and specifying the backend:
+
+.. code-block:: python
+
+ import torch_tensorrt
+ ...
+ optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False)
+
+.. note:: Many additional customization options are available to the user. These will be discussed in further depth in this guide.
+
+The backend can handle a variety of challenging model structures and offers a simple-to-use interface for effective acceleration of models. Additionally, it has many customization options to ensure the compilation process is fitting to the specific use case.
+
+Customizeable Settings
+-----------------
+.. autoclass:: CompilationSettings
+
+Custom Setting Usage
+^^^^^^^^^^^^^^^^^
+.. code-block:: python
+
+ import torch_tensorrt
+ ...
+ optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False,
+ options={"truncate_long_and_double": True,
+ "precision": torch.half,
+ "debug": True,
+ "min_block_size": 2,
+ "torch_executed_ops": {"torch.ops.aten.sub.Tensor"},
+ "optimization_level": 4,
+ "use_python_runtime": False,})
+
+.. note:: Quantization/INT8 support is slated for a future release; currently, we support FP16 and FP32 precision layers.
+
+Compilation
+-----------------
+Compilation is triggered by passing inputs to the model, as so:
+
+.. code-block:: python
+
+ import torch_tensorrt
+ ...
+ # Causes model compilation to occur
+ first_outputs = optimized_model(*inputs)
+
+ # Subsequent inference runs with the same, or similar inputs will not cause recompilation
+ # For a full discussion of this, see "Recompilation Conditions" below
+ second_outputs = optimized_model(*inputs)
+
+After Compilation
+-----------------
+The compilation object can be used for inference within the Python session, and will recompile according to the recompilation conditions detailed below. In addition to general inference, the compilation process can be a helpful tool in determining model performance, current operator coverage, and feasibility of serialization. Each of these points will be covered in detail below.
+
+Model Performance
+^^^^^^^^^^^^^^^^^
+The optimized model returned from `torch.compile` is useful for model benchmarking since it can automatically handle changes in the compilation context, or differing inputs that could require recompilation. When benchmarking inputs of varying distributions, batch sizes, or other criteria, this can save time.
+
+Operator Coverage
+^^^^^^^^^^^^^^^^^
+Compilation is also a useful tool in determining operator coverage for a particular model. For instance, the following compilation command will display the operator coverage for each graph, but will not compile the model - effectively providing a "dryrun" mechanism:
+
+.. code-block:: python
+
+ import torch_tensorrt
+ ...
+ optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False,
+ options={"debug": True,
+ "min_block_size": float("inf"),})
+
+If key operators for your model are unsupported, see :ref:`dynamo_conversion` to contribute your own converters, or file an issue here: https://github.com/pytorch/TensorRT/issues.
+
+Feasibility of Serialization
+^^^^^^^^^^^^^^^^^
+Compilation can also be helpful in demonstrating graph breaks and the feasibility of serialization of a particular model. For instance, if a model has no graph breaks and compiles successfully with the Torch-TensorRT backend, then that model should be compileable and serializeable via the `torch_tensorrt` Dynamo IR, as discussed in :ref:`dynamic_shapes`. To determine the number of graph breaks in a model, the `torch._dynamo.explain` function is very useful:
+
+.. code-block:: python
+
+ import torch
+ import torch_tensorrt
+ ...
+ explanation = torch._dynamo.explain(model)(*inputs)
+ print(f"Graph breaks: {explanation.graph_break_count}")
+ optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False, options={"truncate_long_and_double": True})
+
+Dynamic Shape Support
+-----------------
+
+The Torch-TensorRT `torch.compile` backend will currently require recompilation for each new batch size encountered, and it is preferred to use the `dynamic=False` argument when compiling with this backend. Full dynamic shape support is planned for a future release.
+
+Recompilation Conditions
+-----------------
+
+Once the model has been compiled, subsequent inference inputs with the same shape and data type, which traverse the graph in the same way will not require recompilation. Furthermore, each new recompilation will be cached for the duration of the Python session. For instance, if inputs of batch size 4 and 8 are provided to the model, causing two recompilations, no further recompilation would be necessary for future inputs with those batch sizes during inference within the same session. Support for engine cache serialization is planned for a future release.
+
+Recompilation is generally triggered by one of two events: encountering inputs of different sizes or inputs which traverse the model code differently. The latter scenario can occur when the model code includes conditional logic, complex loops, or data-dependent-shapes. `torch.compile` handles guarding in both of these scenario and determines when recompilation is necessary.
diff --git a/docs/_static/documentation_options.js b/docs/_static/documentation_options.js
index b97c8ee26e..52fbf40f3d 100644
--- a/docs/_static/documentation_options.js
+++ b/docs/_static/documentation_options.js
@@ -1,6 +1,6 @@
var DOCUMENTATION_OPTIONS = {
URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
- VERSION: 'v2.2.0.dev0+50ab2c1',
+ VERSION: 'v2.2.0.dev0+4a920a1',
LANGUAGE: 'None',
COLLAPSE_INDEX: false,
BUILDER: 'html',
diff --git a/docs/cli/torchtrtc.html b/docs/cli/torchtrtc.html
index 5d8cdfde3f..e63a287040 100644
--- a/docs/cli/torchtrtc.html
+++ b/docs/cli/torchtrtc.html
@@ -10,7 +10,7 @@
- torchtrtc — Torch-TensorRT v2.2.0.dev0+50ab2c1 documentation
+ torchtrtc — Torch-TensorRT v2.2.0.dev0+4a920a1 documentation
@@ -225,7 +225,7 @@
By default, you can run a pytorch model with varied input shapes and the output shapes are determined eagerly.
However, Torch-TensorRT is an AOT compiler which requires some prior information about the input shapes to compile and optimize the model.
In the case of dynamic input shapes, we must provide the (min_shape, opt_shape, max_shape) arguments so that the model can be optimized for
diff --git a/docs/user_guide/getting_started_with_fx_path.html b/docs/user_guide/getting_started_with_fx_path.html
index a1b0c471bd..a10cbfecef 100644
--- a/docs/user_guide/getting_started_with_fx_path.html
+++ b/docs/user_guide/getting_started_with_fx_path.html
@@ -10,7 +10,7 @@
-
This guide presents the Torch-TensorRT torch.compile backend: a deep learning compiler which uses TensorRT to accelerate JIT-style workflows across a wide variety of models.
The primary goal of the Torch-TensorRT torch.compile backend is to enable Just-In-Time compilation workflows by combining the simplicity of torch.compile API with the performance of TensorRT. Invoking the torch.compile backend is as simple as importing the torch_tensorrt package and specifying the backend:
Many additional customization options are available to the user. These will be discussed in further depth in this guide.
+
+
The backend can handle a variety of challenging model structures and offers a simple-to-use interface for effective acceleration of models. Additionally, it has many customization options to ensure the compilation process is fitting to the specific use case.
Compilation settings for Torch-TensorRT Dynamo Paths
+
+
Parameters
+
+
precision (torch.dpython:type) – Model Layer precision
+
debug (bool) – Whether to print out verbose debugging information
+
workspace_size (python:int) – Workspace TRT is allowed to use for the module (0 is default)
+
min_block_size (python:int) – Minimum number of operators per TRT-Engine Block
+
torch_executed_ops (Sequence[str]) – Sequence of operations to run in Torch, regardless of converter coverage
+
pass_through_build_failures (bool) – Whether to fail on TRT engine build errors (True) or not (False)
+
max_aux_streams (Optional[python:int]) – Maximum number of allowed auxiliary TRT streams for each engine
+
version_compatible (bool) – Provide version forward-compatibility for engine plan files
+
optimization_level (Optional[python:int]) – Builder optimization 0-5, higher levels imply longer build time,
+searching for more optimization options. TRT defaults to 3
+
use_python_runtime (Optional[bool]) – Whether to strictly use Python runtime or C++ runtime. To auto-select a runtime
+based on C++ dependency presence (preferentially choosing C++ runtime if available), leave the
+argument as None
+
truncate_long_and_double (bool) – Whether to truncate int64/float64 TRT engine inputs or weights to int32/float32
+
use_fast_partitioner (bool) – Whether to use the fast or global graph partitioning system
+
enable_experimental_decompositions (bool) – Whether to enable all core aten decompositions
+or only a selected subset of them
require_full_compilation (bool) – Whether to require the graph is fully compiled in TensorRT.
+Only applicable for ir=”dynamo”; has no effect for torch.compile path
Compilation is triggered by passing inputs to the model, as so:
+
importtorch_tensorrt
+...
+# Causes model compilation to occur
+first_outputs=optimized_model(*inputs)
+
+# Subsequent inference runs with the same, or similar inputs will not cause recompilation
+# For a full discussion of this, see "Recompilation Conditions" below
+second_outputs=optimized_model(*inputs)
+
The compilation object can be used for inference within the Python session, and will recompile according to the recompilation conditions detailed below. In addition to general inference, the compilation process can be a helpful tool in determining model performance, current operator coverage, and feasibility of serialization. Each of these points will be covered in detail below.
The optimized model returned from torch.compile is useful for model benchmarking since it can automatically handle changes in the compilation context, or differing inputs that could require recompilation. When benchmarking inputs of varying distributions, batch sizes, or other criteria, this can save time.
Compilation is also a useful tool in determining operator coverage for a particular model. For instance, the following compilation command will display the operator coverage for each graph, but will not compile the model - effectively providing a “dryrun” mechanism:
Compilation can also be helpful in demonstrating graph breaks and the feasibility of serialization of a particular model. For instance, if a model has no graph breaks and compiles successfully with the Torch-TensorRT backend, then that model should be compileable and serializeable via the torch_tensorrt Dynamo IR, as discussed in Dynamic shapes with Torch-TensorRT. To determine the number of graph breaks in a model, the torch._dynamo.explain function is very useful:
The Torch-TensorRT torch.compile backend will currently require recompilation for each new batch size encountered, and it is preferred to use the dynamic=False argument when compiling with this backend. Full dynamic shape support is planned for a future release.
Once the model has been compiled, subsequent inference inputs with the same shape and data type, which traverse the graph in the same way will not require recompilation. Furthermore, each new recompilation will be cached for the duration of the Python session. For instance, if inputs of batch size 4 and 8 are provided to the model, causing two recompilations, no further recompilation would be necessary for future inputs with those batch sizes during inference within the same session. Support for engine cache serialization is planned for a future release.
+
Recompilation is generally triggered by one of two events: encountering inputs of different sizes or inputs which traverse the model code differently. The latter scenario can occur when the model code includes conditional logic, complex loops, or data-dependent-shapes. torch.compile handles guarding in both of these scenario and determines when recompilation is necessary.
To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy.