remove all TRT when referenced with TRT Model Optimizer (NVIDIA-NeMo#15147)

yueshen2016 · quapham · commit 828fcf3bfaaf · 2025-12-16T03:59:58.000Z
Signed-off-by: Yue &lt;yueshen@nvidia.com&gt;
Signed-off-by: quanpham &lt;youngkwan199@gmail.com&gt;
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -12,8 +12,8 @@ NVIDIA NeMo Framework is an end-to-end, cloud-native framework designed to build
 - Flash Attention
 - Activation Recomputation
 - Positional Embeddings and Positional Interpolation
-- Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) with `TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_
-- Knowledge Distillation-based training with `TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_
+- Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) with `Model Optimizer <https://github.com/NVIDIA/Model-Optimizer>`_
+- Knowledge Distillation-based training with `Model Optimizer <https://github.com/NVIDIA/Model-Optimizer>`_
 - Sequence Packing
 
 `NVIDIA NeMo Framework <https://github.com/NVIDIA/NeMo>`_ has separate collections for:
diff --git a/docs/source/nlp/quantization.rst b/docs/source/nlp/quantization.rst
@@ -11,7 +11,7 @@ PTQ enables deploying a model in a low-precision format -- FP8, INT4, or INT8 --
 
 Model quantization has two primary benefits: reduced model memory requirements and increased inference throughput.
 
-In NeMo, quantization is enabled by the `NVIDIA TensorRT Model Optimizer (ModelOpt) <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ library -- a library to quantize and compress deep learning models for optimized inference on GPUs.
+In NeMo, quantization is enabled by the `NVIDIA Model Optimizer (ModelOpt) <https://github.com/NVIDIA/Model-Optimizer>`_ library -- a library to quantize and compress deep learning models for optimized inference on GPUs.
 
 The quantization process consists of the following steps:
 
diff --git a/nemo/collections/llm/modelopt/__init__.py b/nemo/collections/llm/modelopt/__init__.py
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Model optimization utilities for using TensorRT Model Optimizer."""
+"""Model optimization utilities for using Model Optimizer."""
 
 from .distill import *  # noqa: F401
 from .model_utils import *  # noqa: F401
diff --git a/nemo/collections/llm/modelopt/prune/__init__.py b/nemo/collections/llm/modelopt/prune/__init__.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""Prune utilities for using TensorRT Model Optimizer."""
+"""Prune utilities for using Model Optimizer."""
 
 from .pruner import PruningConfig, prune_language_model, save_pruned_model
 
diff --git a/nemo/collections/llm/modelopt/quantization/quantizer.py b/nemo/collections/llm/modelopt/quantization/quantizer.py
@@ -65,7 +65,7 @@ class QuantizationConfig:
     """Quantization parameters.
 
     Available quantization methods are listed in `QUANT_CFG_CHOICES` dictionary above.
-    Please consult Model Optimizer documentation https://nvidia.github.io/TensorRT-Model-Optimizer/ for details.
+    Please consult Model Optimizer documentation https://nvidia.github.io/Model-Optimizer/ for details.
 
     Quantization algorithm can also be conveniently set to None to perform only weights export step
     for TensorRT-LLM deployment. This is useful to getting baseline results for a full-precision model.
diff --git a/nemo/collections/llm/modelopt/speculative/model_transform.py b/nemo/collections/llm/modelopt/speculative/model_transform.py
@@ -32,7 +32,7 @@ def apply_speculative_decoding(model: nn.Module, algorithm: str = "eagle3") -> n
     Args:
         model: The model to transform.
         algorithm: The algorithm to use for Speculative Decoding.
-            (See https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py)
+            (See https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py)
 
     Returns:
         The transformed model.
diff --git a/nemo/collections/vlm/modelopt/__init__.py b/nemo/collections/vlm/modelopt/__init__.py
@@ -11,6 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""Model optimization utilities for VLM models using TensorRT Model Optimizer."""
+"""Model optimization utilities for VLM models using Model Optimizer."""
 
 from .model_utils import *  # noqa: F401
diff --git a/nemo/export/quantize/quantizer.py b/nemo/export/quantize/quantizer.py
@@ -92,7 +92,7 @@ class Quantizer:
     model families is experimental and might not be fully supported.
 
     Available quantization methods are listed in `QUANT_CFG_CHOICES` dictionary above.
-    Please consult Model Optimizer documentation https://nvidia.github.io/TensorRT-Model-Optimizer/ for details.
+    Please consult Model Optimizer documentation https://nvidia.github.io/Model-Optimizer/ for details.
     You can also inspect different choices in examples/nlp/language_modeling/conf/megatron_gpt_ptq.yaml
     for quantization algorithms and calibration data as well as recommended settings.
 
diff --git a/scripts/llm/gpt_convert_speculative.py b/scripts/llm/gpt_convert_speculative.py
@@ -30,7 +30,7 @@
     - Eagle 3 (default): Extrapolation Algorithm for Greater Language-model Efficiency
 
 For more details on speculative decoding algorithms, refer to the NVIDIA Model Optimizer documentation:
-https://nvidia.github.io/TensorRT-Model-Optimizer/guides/7_speculative_decoding.html
+https://nvidia.github.io/Model-Optimizer/guides/7_speculative_decoding.html
 """
 
 from argparse import ArgumentParser
diff --git a/scripts/llm/ptq.py b/scripts/llm/ptq.py
@@ -32,7 +32,7 @@ def get_args():
     parser.add_argument(
         "--tokenizer", type=str, help="Tokenizer to use. If not provided, model tokenizer will be used"
     )
-    parser.add_argument("--decoder_type", type=str, help="Decoder type for TensorRT-Model-Optimizer")
+    parser.add_argument("--decoder_type", type=str, help="Decoder type for Model-Optimizer")
     parser.add_argument("-ctp", "--calibration_tp", "--calib_tp", type=int, default=1)
     parser.add_argument("-cep", "--calibration_ep", "--calib_ep", type=int, default=1)
     parser.add_argument("-cpp", "--calibration_pp", "--calib_pp", type=int, default=1)
@@ -75,7 +75,7 @@ def get_args():
         "--algorithm",
         type=str,
         default="fp8",
-        help="TensorRT-Model-Optimizer quantization algorithm",
+        help="Model-Optimizer quantization algorithm",
     )
     parser.add_argument(
         "-awq_bs", "--awq_block_size", type=int, default=128, help="Block size for AWQ quantization algorithms"
diff --git a/scripts/vlm/llama4/llama4_ptq.py b/scripts/vlm/llama4/llama4_ptq.py
@@ -84,7 +84,7 @@ def get_args():
         formatter_class=argparse.ArgumentDefaultsHelpFormatter, description="NeMo PTQ argument parser"
     )
     parser.add_argument("-nc", "--nemo_checkpoint", type=str, help="Source NeMo 2.0 checkpoint")
-    parser.add_argument("--decoder_type", type=str, help="Decoder type for TensorRT-Model-Optimizer")
+    parser.add_argument("--decoder_type", type=str, help="Decoder type for Model-Optimizer")
     parser.add_argument("-ctp", "--calibration_tp", "--calib_tp", type=int, default=1)
     parser.add_argument("-cpp", "--calibration_pp", "--calib_pp", type=int, default=1)
     parser.add_argument(
@@ -127,7 +127,7 @@ def get_args():
         type=str,
         default="fp8",
         choices=QUANT_CFG_CHOICES_LIST,
-        help="TensorRT-Model-Optimizer quantization algorithm",
+        help="Model-Optimizer quantization algorithm",
     )
     parser.add_argument(
         "-awq_bs", "--awq_block_size", type=int, default=128, help="Block size for AWQ quantization algorithms"
diff --git a/scripts/vlm/neva_generate.py b/scripts/vlm/neva_generate.py
@@ -255,7 +255,7 @@ def forward_loop():
                 )
                 print(img_url, "->", response)
 
-        # Please see https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html
+        # Please see https://nvidia.github.io/Model-Optimizer/guides/_choosing_quant_methods.html
         # for the selection of quantization algorithms
         if args.quant_alg == "int8_sq":
             mtq_config = mtq.INT8_SMOOTHQUANT_CFG
diff --git a/tutorials/llm/qwen/pruning-distillation/04_depth_vs_width_pruning_comparison.ipynb b/tutorials/llm/qwen/pruning-distillation/04_depth_vs_width_pruning_comparison.ipynb
@@ -120,7 +120,7 @@
     "\n",
     "So far, we have distilled the pruned models on a pre-training dataset hence the model is a base variant. Since we have a base model, we only compared all the models on base model benchmarks like MMLU. To practically use these models for reasoning tasks, we need to perform post-training on these models as well which is something we will add to this tutorial in the near future.\n",
     "\n",
-    "We can also further Quantize these models to FP8 precision using [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq) and measure Tokens per Second (TPS) for inference. We observed that the depth pruned 6B model is ~30% faster than the Qwen3-4B and ~60% faster than the Qwen3-8B when all are quantized to FP8 precision on single H100 GPU."
+    "We can also further Quantize these models to FP8 precision using [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq) and measure Tokens per Second (TPS) for inference. We observed that the depth pruned 6B model is ~30% faster than the Qwen3-4B and ~60% faster than the Qwen3-8B when all are quantized to FP8 precision on single H100 GPU."
    ]
   }
  ],
diff --git a/tutorials/llm/qwen/pruning-distillation/README.md b/tutorials/llm/qwen/pruning-distillation/README.md
@@ -1,20 +1,20 @@
 # Qwen3-8B Pruning and Distillation with NeMo 2.0 Framework
 
-[NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is the library (referred to as **Model Optimizer**, or **ModelOpt**) comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, and speculative decoding to compress models. We will use this library to perform the pruning and distillation on [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) in [NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)
+[NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) is the library (referred to as **Model Optimizer**, or **ModelOpt**) comprising state-of-the-art model optimization techniques including quantization, distillation, pruning, and speculative decoding to compress models. We will use this library to perform the pruning and distillation on [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) in [NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)
 
 [LLM Pruning and Distillation in Practice: The Minitron Approach](https://arxiv.org/abs/2408.11796) provides details pruning and distillation on Llama 3.1 as described in the [tech report](https://arxiv.org/abs/2408.11796).
 
 [How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/) provides practical and effective structured compression best practices for LLMs that combine depth, width, attention, and MLP pruning with knowledge distillation-based retraining.
 
-[Supercharge Edge AI With High‑Accuracy Reasoning Using NVIDIA Nemotron Nano 2 9B](https://huggingface.co/blog/nvidia/supercharge-ai-reasoning-with-nemotron-nano-2) talks about how state-of-the-art reasoning model [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) was created by pruning and distilling a 12B Hybrid Mamba Transformer model which is also supported by TensorRT Model Optimizer.
+[Supercharge Edge AI With High‑Accuracy Reasoning Using NVIDIA Nemotron Nano 2 9B](https://huggingface.co/blog/nvidia/supercharge-ai-reasoning-with-nemotron-nano-2) talks about how state-of-the-art reasoning model [NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) was created by pruning and distilling a 12B Hybrid Mamba Transformer model which is also supported by Model Optimizer.
 
 ## Objectives
 
 This tutorial demonstrates how to perform depth-pruning, width-pruning, and distillation on **Qwen3-8B** using the [WikiText](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset with the NeMo Framework. We will start with a HuggingFace checkpoint and convert it to NeMo format to use for pruning and distillation and later convert the distilled model back to HuggingFace format. The `WikiText` language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia. While this is the most easy to get started with, in practice, we recommend using bigger, more recent and much higher quality datasets like [ClimbMix](https://huggingface.co/datasets/OptimalScale/ClimbMix) or [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1).
 
 There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding 2 pruned models. These models will serve as starting points for distillation to create the final distilled models.
 
-**NOTE:** Checkout the full list of supported models and prunable dimensions [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/pruning).
+**NOTE:** Checkout the full list of supported models and prunable dimensions [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning).
 
 ## Requirements