pytorch
diff --git a/Diff for: ‎.github/workflows/build-test.yml
+1 b/Diff for: ‎.github/workflows/build-test.yml
+1
diff --git a/Diff for: ‎docsrc/user_guide/saving_models.rst
+31-18 b/Diff for: ‎docsrc/user_guide/saving_models.rst
+31-18
diff --git a/Diff for: ‎py/torch_tensorrt/dynamo/_compiler.py
+9-3 b/Diff for: ‎py/torch_tensorrt/dynamo/_compiler.py
+9-3
diff --git a/Diff for: ‎py/torch_tensorrt/dynamo/_defaults.py
+1 b/Diff for: ‎py/torch_tensorrt/dynamo/_defaults.py
+1
diff --git a/Diff for: ‎py/torch_tensorrt/dynamo/_exporter.py
+88-50 b/Diff for: ‎py/torch_tensorrt/dynamo/_exporter.py
+88-50
@@ -142,6 +142,7 @@ jobs:
         ${CONDA_RUN} python -m pip install --pre pytest timm transformers parameterized expecttest==0.1.6 --use-deprecated=legacy-resolver
         ${CONDA_RUN} python -m pytest --junitxml=${RUNNER_TEST_RESULTS_DIR}/dynamo_fe_test_results.xml --ir dynamo models/test_models_export.py
         ${CONDA_RUN} python -m pytest --junitxml=${RUNNER_TEST_RESULTS_DIR}/dyn_models_export.xml --ir dynamo models/test_dyn_models.py
+        ${CONDA_RUN} python -m pytest --junitxml=${RUNNER_TEST_RESULTS_DIR}/output_format.xml --ir dynamo models/test_output_format.py
         popd
 
   tests-py-dynamo-serde:
 
@@ -14,14 +14,18 @@ Saving models compiled with Torch-TensorRT varies slightly with the `ir` that ha
 Dynamo IR
 -------------
 
-Starting with 2.1 release of Torch-TensorRT, we are switching the default compilation to be dynamo based.
-The output of `ir=dynamo` compilation is a `torch.fx.GraphModule` object. There are two ways to save these objects
+The output type of `ir=dynamo` compilation of Torch-TensorRT is `torch.export.ExportedProgram` object by default. 
+In addition, we provide a new parameter `output_format` in the `CompilationSetting` object provided before compilation.
+The `output_format` can take the following options 
 
-a) Converting to Torchscript
+* `exported_program` (or) `ep` : This is the default. Returns an ExportedProgram 
+* `torchscript` (or) `ts` : This returns a TorchScript module
+* `graph_module` (or) `fx` : This returns a torch.fx.GraphModule which can be traced into Torchscript to save to disk.
+
+a) Torchscript
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-`torch.fx.GraphModule` objects cannot be serialized directly. Hence we use `torch.jit.trace` to convert this into a `ScriptModule` object which can be saved to disk.
-The following code illustrates this approach.
+If you set the `output_format="torchscript"`, this will return a `ScriptModule` which can be serialized via torch.jit.save
 
 .. code-block:: python
 
@@ -30,9 +34,9 @@ The following code illustrates this approach.
 
     model = MyModel().eval().cuda()
     inputs = [torch.randn((1, 3, 224, 224)).cuda()]
-    trt_gm = torch_tensorrt.compile(model, ir="dynamo", inputs) # Output is a torch.fx.GraphModule
-    trt_traced_model = torch.jit.trace(trt_gm, inputs)
-    torch.jit.save(trt_traced_model, "trt_model.ts")
+    # trt_ts is a torch.jit.ScriptModule object
+    trt_ts = torch_tensorrt.compile(model, ir="dynamo", inputs, output_format="torchscript")
+    torch.jit.save(trt_ts, "trt_model.ts")
 
     # Later, you can load it and run inference
     model = torch.jit.load("trt_model.ts").cuda()
@@ -41,8 +45,7 @@ The following code illustrates this approach.
 b) ExportedProgram
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-`torch.export.ExportedProgram` is a new format introduced in Pytorch 2.1. After we compile a Pytorch module using Torch-TensorRT, the resultant
-`torch.fx.GraphModule` along with additional metadata can be used to create `ExportedProgram` which can be saved and loaded from disk.
+`torch.export.ExportedProgram`, a new format introduced in Pytorch 2.X is the default return type of Torch-TensorRT compilation.
 
 .. code-block:: python
 
@@ -51,26 +54,36 @@ b) ExportedProgram
 
     model = MyModel().eval().cuda()
     inputs = [torch.randn((1, 3, 224, 224)).cuda()]
-    trt_gm = torch_tensorrt.compile(model, ir="dynamo", inputs) # Output is a torch.fx.GraphModule
-    # Transform and create an exported program
-    trt_exp_program = torch_tensorrt.dynamo.export(trt_gm, inputs)
-    torch.export.save(trt_exp_program, "trt_model.ep")
+    # trt_ep is a torch.export.ExportedProgram object
+    trt_ep = torch_tensorrt.compile(model, ir="dynamo", inputs) 
+    torch.export.save(trt_ep, "trt_model.ep")
 
     # Later, you can load it and run inference
     model = torch.export.load("trt_model.ep")
     model(*inputs)
 
-`torch_tensorrt.dynamo.export` inlines the submodules within a GraphModule to their corresponding nodes and stiches all the nodes together.
-This is needed as `torch._export` serialization cannot handle serializing and deserializing of submodules (`call_module` nodes).
+c) GraphModule
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-.. note:: This way of saving the models using `ExportedProgram` is experimental. Here is a known issue : https://github.com/pytorch/TensorRT/issues/2341
+We can also return a `torch.fx.GraphModule` object as the output of Torch-TensorRT compilation by setting `output_format="graph_module"`.
+Internally, partitioning, lowering, conversion phases operate using GraphModule objects. These can be either traced into a Torchscript modules or 
+exported into `ExportedProgram` objects
 
+.. code-block:: python
+
+    import torch
+    import torch_tensorrt
+
+    model = MyModel().eval().cuda()
+    inputs = [torch.randn((1, 3, 224, 224)).cuda()]
+    # trt_gm is a torch.fx.GraphModule object
+    trt_gm = torch_tensorrt.compile(model, ir="dynamo", inputs, output_format="graph_module") 
 
 Torchscript IR
 -------------
 
 In Torch-TensorRT 1.X versions, the primary way to compile and run inference with Torch-TensorRT is using Torchscript IR.
-This behavior stays the same in 2.X versions as well.
+For `ir=ts`, this behavior stays the same in 2.X versions as well.
 
 .. code-block:: python
 
 
@@ -5,6 +5,7 @@
 from typing import Any, Collection, List, Optional, Sequence, Set, Tuple, Union
 
 import torch
+import torch_tensorrt
 from torch.export import ExportedProgram
 from torch.fx.node import Target
 from torch_tensorrt import _enums
@@ -29,6 +30,7 @@
     MIN_BLOCK_SIZE,
     NUM_AVG_TIMING_ITERS,
     OPTIMIZATION_LEVEL,
+    OUTPUT_FORMAT,
     PASS_THROUGH_BUILD_FAILURES,
     PRECISION,
     REFIT,
@@ -46,6 +48,7 @@
     dryrun_stats_display,
     parse_non_trt_nodes,
 )
+from torch_tensorrt.dynamo._exporter import export
 from torch_tensorrt.dynamo.conversion import (
     CompilationSettings,
     UnsupportedOperatorException,
@@ -66,8 +69,6 @@
     to_torch_tensorrt_device,
 )
 
-import torch_tensorrt
-
 logger = logging.getLogger(__name__)
 
 
@@ -103,6 +104,7 @@ def compile(
     enable_experimental_decompositions: bool = ENABLE_EXPERIMENTAL_DECOMPOSITIONS,
     dryrun: bool = DRYRUN,
     hardware_compatible: bool = HARDWARE_COMPATIBLE,
+    output_format: str = OUTPUT_FORMAT,
     **kwargs: Any,
 ) -> torch.fx.GraphModule:
     """Compile a TorchScript module for NVIDIA GPUs using TensorRT
@@ -161,6 +163,7 @@ def compile(
         enable_experimental_decompositions (bool): Use the full set of operator decompositions. These decompositions may not be tested but serve to make the grap easier to covert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
         dryrun (bool): Toggle for "Dryrun" mode, running everything except conversion to TRT and logging outputs
         hardware_compatible (bool): Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
+        output_format (str): Output format of the result of TRT compilation. Options include "exported_program" (or) "ep" | "torchscript" (or) "ts" | "graph_module" (or) "fx". Default is "exported_program"
         **kwargs: Any,
     Returns:
         torch.fx.GraphModule: Compiled FX Module, when run it will execute via TensorRT
@@ -238,11 +241,14 @@ def compile(
         "dla_global_dram_size": dla_global_dram_size,
         "dryrun": dryrun,
         "hardware_compatible": hardware_compatible,
+        "output_format": output_format,
     }
 
     settings = CompilationSettings(**compilation_options)
     logger.info("Compilation Settings: %s\n", settings)
-    return compile_module(gm, inputs, settings)
+    trt_gm = compile_module(gm, inputs, settings)
+    trt_result = export(trt_gm, torch_inputs, output_format)
+    return trt_result
 
 
 def compile_module(
 
@@ -26,6 +26,7 @@
 REQUIRE_FULL_COMPILATION = False
 DRYRUN = False
 HARDWARE_COMPATIBLE = False
+OUTPUT_FORMAT = "exported_program"
 
 
 def default_device() -> Device:
 
@@ -1,4 +1,3 @@
-import copy
 import operator
 from typing import Any, Dict, Sequence, Tuple, cast
 
@@ -19,50 +18,43 @@
 def export(
     gm: torch.fx.GraphModule,
     inputs: Sequence[torch.Tensor],
-    *,
-    ir: str = "torchscript",
+    output_format: str,
 ) -> ExportedProgram:
-    """Export a program (``torch.fx.GraphModule``) for serialization with the TensorRT engines embedded.
-
-    > Note: When ExportedProgram becomes stable, this function will get merged into ``torch_tensorrt.dynamo.compile``
+    """Export the result of TensorRT compilation into the desired output format.
 
     Arguments:
-        src_gm (torch.fx.GraphModule): Source module, generated by torch.export (The module provided to ``torch_tensorrt.dynamo.compile``)
         gm (torch.fx.GraphModule): Compiled Torch-TensorRT module, generated by ``torch_tensorrt.dynamo.compile``
-
-    Keyword Arguments:
-        inputs (Any): **Required** List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using
-            torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum
-            to select device type. ::
-
-                input=[
-                    torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
-                    torch_tensorrt.Input(
-                        min_shape=(1, 224, 224, 3),
-                        opt_shape=(1, 512, 512, 3),
-                        max_shape=(1, 1024, 1024, 3),
-                        dtype=torch.int32
-                        format=torch.channel_last
-                    ), # Dynamic input shape for input #2
-                    torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
-        ir (str): torchscript | exported_program. Based on the provided ir, the output type would be a torchscript or exported program.
+        inputs (torch.Tensor): Torch input tensors
+        output_format (str): Output format of the result of TRT compilation. Options include "exported_program" (or) "ep" | "torchscript" (or) "ts" | "graph_module" (or) "fx". Default is "exported_program"
     """
-    if ir == "torchscript":
+    if output_format == "torchscript" or output_format == "ts":
         return torch.jit.trace(gm, inputs)
-    elif ir == "exported_program":
+    elif output_format == "exported_program" or output_format == "ep":
         patched_module = transform(gm, inputs)
         exp_program = create_trt_exp_program(patched_module)
-
         return exp_program
+    elif output_format == "graph_module" or output_format == "fx":
+        return gm
     else:
         raise ValueError(
-            f"Invalid ir : {ir} provided for serialization. Options include torchscript | exported_program"
+            f"Invalid output format {output_format} specified. Supported options include exported_program (or) ep | torchscript (or) ts | graph_module (or) fx"
         )
 
 
 def transform(
     gm: torch.fx.GraphModule, inputs: Sequence[torch.Tensor]
 ) -> torch.fx.GraphModule:
+    """
+    Transforms the graphmodule by inlining Pytorch and TensorRT submodules.
+    Inlining collapses submodules into nodes which is necessary for torch.export
+    serialization.
+
+    Arguments:
+        gm (torch.fx.GraphModule): Compiled Torch-TensorRT module, generated by ``torch_tensorrt.dynamo.compile``
+        inputs (torch.Tensor): Torch input tensors
+
+    Returns an inlined torch.fx.GraphModule
+    """
     # Run shape analysis
     _, outputs_map = partitioning.run_shape_analysis(gm, inputs)
 
@@ -72,10 +64,6 @@ def transform(
     # Inline pytorch submodules
     inline_torch_modules(gm)
 
-    # Lift constant buffers and parameters in the graph
-    # torch.export serialization expects them to be lifted
-    lift_constant_pass(gm)
-
     # Clean the graph
     gm.delete_all_unused_submodules()
     gm.graph.eliminate_dead_code()
@@ -84,34 +72,80 @@ def transform(
     return gm
 
 
-def lift_constant_pass(trt_gm: torch.fx.GraphModule) -> torch.fx.GraphModule:
+def lift(gm: torch.fx.GraphModule, graph_signature: Any) -> torch.fx.GraphModule:
+    """
+    Given an unlifted fx.GraphModule, lift all parameters, buffers into placeholders.
+    Arguments:
+        gm (torch.fx.GraphModule): Unlifted GraphModule which contains parameters and buffers as get_attr nodes.
+        graph_signature (torch.export.ExportGraphSignature): Instance of ExportGraphSignature class created for the output ExportedProgram.
+        After lifting, this graph_signature will be modified with the parameters and buffers added appropriately.
+    Returns:
+        A lifted fx.GraphModule, modified graph_signature and a new state_dict
+    """
+    # Get the state_dict of graph_module. This is different from exported_program.state_dict
+    # exp_program.state_dict contains parameters and buffers whereas a graph_module's state_dict
+    # has all parameters registered as torch.tensors.
+    state_dict = gm.state_dict()
+
     fake_mode = detect_fake_mode(
-        tuple(
-            node.meta["val"] for node in trt_gm.graph.nodes if node.op == "placeholder"
-        )
+        tuple(node.meta["val"] for node in gm.graph.nodes if node.op == "placeholder")
     )
+    assert fake_mode is not None
 
+    # Locate the user input to insert new placeholders before them
     first_user_input = None
-    for node in trt_gm.graph.nodes:
-        if node.op == "placeholder":
+    for node in gm.graph.nodes:
+        if node.op == "placeholder" and node.name in graph_signature.user_inputs:
             first_user_input = node
             break
 
-    for node in trt_gm.graph.nodes:
+    # At first the user_inputs are only present in the graph_signature.input_specs and hence non_user_input_idx=0
+    # The input_specs should be of the form [params, buffers, constant_tensors, user_inputs]
+    non_user_input_idx = 0
+    for node in gm.graph.nodes:
         if node.op == "get_attr":
-            constant_tensor = getattr(trt_gm, node.target)
-            with trt_gm.graph.inserting_before(first_user_input):
-                const_placeholder_node = trt_gm.graph.placeholder(node.target)
-                const_placeholder_node.meta = copy.deepcopy(node.meta)
+            constant_tensor = getattr(gm, node.target)
+            input_kind = InputKind.CONSTANT_TENSOR
+
+            # state_dict has these parameters/buffers as torch.Tensors. We override them as torch.nn.Parameter/torch.Tensors respectively.
+            for name, _ in gm.named_parameters():
+                if node.target == name:
+                    input_kind = InputKind.PARAMETER
+                    state_dict[name] = constant_tensor
+                    break
+            for name, _ in gm.named_buffers():
+                if node.target == name:
+                    input_kind = InputKind.BUFFER
+                    state_dict[name] = constant_tensor
+                    break
+
+            # Replace get_attr nodes with placeholder nodes and copy metadata.
+            with gm.graph.inserting_before(first_user_input):
+                const_placeholder_node = gm.graph.placeholder(node.target)
+                for k, v in node.meta.items():
+                    const_placeholder_node.meta[k] = v
                 const_placeholder_node.meta["val"] = fake_mode.from_tensor(
                     constant_tensor
                 )
                 node.replace_all_uses_with(const_placeholder_node)
-                trt_gm.graph.erase_node(node)
+                gm.graph.erase_node(node)
+
+                # Add these parameters/buffers/constants to the existing graph signature
+                # before user inputs. These specs are looked up in the state_dict during ExportedProgram creation.
+                graph_signature.input_specs.insert(
+                    non_user_input_idx,
+                    InputSpec(
+                        kind=input_kind,
+                        arg=TensorArgument(name=const_placeholder_node.name),
+                        target=node.target,
+                    ),
+                )
+                non_user_input_idx += 1
+
+    gm.graph.eliminate_dead_code()
+    gm.graph.lint()
 
-    trt_gm.graph.eliminate_dead_code()
-    trt_gm.graph.lint()
-    return trt_gm
+    return gm, graph_signature, state_dict
 
 
 def get_duplicate_nodes(
@@ -140,7 +174,7 @@ def get_duplicate_nodes(
 def inline_torch_modules(gm: torch.fx.GraphModule) -> torch.fx.GraphModule:
     """
     Inline a submodule within the parent graph (gm). All `call_module` nodes
-    should be replaced by their submodule nodes.
+    should be replaced by their nodes in the submodule.
     """
     # Clean the graph
     gm.graph.eliminate_dead_code()
@@ -165,7 +199,6 @@ def inline_torch_modules(gm: torch.fx.GraphModule) -> torch.fx.GraphModule:
 
                 # Copy all nodes in the submodule into gm and
                 # store the output node of this submodule which is now present in gm
-
                 submodule_output = gm.graph.graph_copy(submodule.graph, val_map)
 
                 # Get their references (since we copied) in the parent graph (gm)
@@ -227,6 +260,7 @@ def create_trt_exp_program(
     """Creates a new Exported Program. This function takes an torch.fx.GraphModule which has TRT engines
     and constructs an Exported Program object with the new IO node names and state_dict
     """
+
     input_nodes = [node for node in gm.graph.nodes if node.op == "placeholder"]
     output_nodes = [node for node in gm.graph.nodes if node.op == "output"]
     assert output_nodes
@@ -245,8 +279,12 @@ def create_trt_exp_program(
         input_specs=input_specs, output_specs=output_specs
     )
 
+    # Lift parameters/buffers/constants in the graph
+    # torch.export serialization expects them to be lifted
+    gm, trt_graph_signature, state_dict = lift(gm, trt_graph_signature)
+
     trt_exp_program = ExportedProgram(
-        gm, gm.graph, trt_graph_signature, gm.state_dict(), {}, [], [], []
+        gm, gm.graph, trt_graph_signature, state_dict, {}, [], [], []
     )
 
     return trt_exp_program