Skip to content

Commit 828fcf3

Browse files
yueshen2016quapham
authored andcommitted
remove all TRT when referenced with TRT Model Optimizer (NVIDIA-NeMo#15147)
Signed-off-by: Yue <yueshen@nvidia.com> Signed-off-by: quanpham <youngkwan199@gmail.com>
1 parent 9da5f5e commit 828fcf3

File tree

14 files changed

+19
-19
lines changed

14 files changed

+19
-19
lines changed

docs/source/index.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ NVIDIA NeMo Framework is an end-to-end, cloud-native framework designed to build
1212
- Flash Attention
1313
- Activation Recomputation
1414
- Positional Embeddings and Positional Interpolation
15-
- Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) with `TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_
16-
- Knowledge Distillation-based training with `TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_
15+
- Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) with `Model Optimizer <https://github.com/NVIDIA/Model-Optimizer>`_
16+
- Knowledge Distillation-based training with `Model Optimizer <https://github.com/NVIDIA/Model-Optimizer>`_
1717
- Sequence Packing
1818

1919
`NVIDIA NeMo Framework <https://github.com/NVIDIA/NeMo>`_ has separate collections for:

docs/source/nlp/quantization.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ PTQ enables deploying a model in a low-precision format -- FP8, INT4, or INT8 --
1111

1212
Model quantization has two primary benefits: reduced model memory requirements and increased inference throughput.
1313

14-
In NeMo, quantization is enabled by the `NVIDIA TensorRT Model Optimizer (ModelOpt) <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ library -- a library to quantize and compress deep learning models for optimized inference on GPUs.
14+
In NeMo, quantization is enabled by the `NVIDIA Model Optimizer (ModelOpt) <https://github.com/NVIDIA/Model-Optimizer>`_ library -- a library to quantize and compress deep learning models for optimized inference on GPUs.
1515

1616
The quantization process consists of the following steps:
1717

nemo/collections/llm/modelopt/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
"""Model optimization utilities for using TensorRT Model Optimizer."""
14+
"""Model optimization utilities for using Model Optimizer."""
1515

1616
from .distill import * # noqa: F401
1717
from .model_utils import * # noqa: F401

nemo/collections/llm/modelopt/prune/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
"""Prune utilities for using TensorRT Model Optimizer."""
15+
"""Prune utilities for using Model Optimizer."""
1616

1717
from .pruner import PruningConfig, prune_language_model, save_pruned_model
1818

nemo/collections/llm/modelopt/quantization/quantizer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ class QuantizationConfig:
6565
"""Quantization parameters.
6666
6767
Available quantization methods are listed in `QUANT_CFG_CHOICES` dictionary above.
68-
Please consult Model Optimizer documentation https://nvidia.github.io/TensorRT-Model-Optimizer/ for details.
68+
Please consult Model Optimizer documentation https://nvidia.github.io/Model-Optimizer/ for details.
6969
7070
Quantization algorithm can also be conveniently set to None to perform only weights export step
7171
for TensorRT-LLM deployment. This is useful to getting baseline results for a full-precision model.

nemo/collections/llm/modelopt/speculative/model_transform.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def apply_speculative_decoding(model: nn.Module, algorithm: str = "eagle3") -> n
3232
Args:
3333
model: The model to transform.
3434
algorithm: The algorithm to use for Speculative Decoding.
35-
(See https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/speculative/config.py)
35+
(See https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py)
3636
3737
Returns:
3838
The transformed model.

nemo/collections/vlm/modelopt/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,6 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
"""Model optimization utilities for VLM models using TensorRT Model Optimizer."""
14+
"""Model optimization utilities for VLM models using Model Optimizer."""
1515

1616
from .model_utils import * # noqa: F401

nemo/export/quantize/quantizer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ class Quantizer:
9292
model families is experimental and might not be fully supported.
9393
9494
Available quantization methods are listed in `QUANT_CFG_CHOICES` dictionary above.
95-
Please consult Model Optimizer documentation https://nvidia.github.io/TensorRT-Model-Optimizer/ for details.
95+
Please consult Model Optimizer documentation https://nvidia.github.io/Model-Optimizer/ for details.
9696
You can also inspect different choices in examples/nlp/language_modeling/conf/megatron_gpt_ptq.yaml
9797
for quantization algorithms and calibration data as well as recommended settings.
9898

scripts/llm/gpt_convert_speculative.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
- Eagle 3 (default): Extrapolation Algorithm for Greater Language-model Efficiency
3131
3232
For more details on speculative decoding algorithms, refer to the NVIDIA Model Optimizer documentation:
33-
https://nvidia.github.io/TensorRT-Model-Optimizer/guides/7_speculative_decoding.html
33+
https://nvidia.github.io/Model-Optimizer/guides/7_speculative_decoding.html
3434
"""
3535

3636
from argparse import ArgumentParser

scripts/llm/ptq.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ def get_args():
3232
parser.add_argument(
3333
"--tokenizer", type=str, help="Tokenizer to use. If not provided, model tokenizer will be used"
3434
)
35-
parser.add_argument("--decoder_type", type=str, help="Decoder type for TensorRT-Model-Optimizer")
35+
parser.add_argument("--decoder_type", type=str, help="Decoder type for Model-Optimizer")
3636
parser.add_argument("-ctp", "--calibration_tp", "--calib_tp", type=int, default=1)
3737
parser.add_argument("-cep", "--calibration_ep", "--calib_ep", type=int, default=1)
3838
parser.add_argument("-cpp", "--calibration_pp", "--calib_pp", type=int, default=1)
@@ -75,7 +75,7 @@ def get_args():
7575
"--algorithm",
7676
type=str,
7777
default="fp8",
78-
help="TensorRT-Model-Optimizer quantization algorithm",
78+
help="Model-Optimizer quantization algorithm",
7979
)
8080
parser.add_argument(
8181
"-awq_bs", "--awq_block_size", type=int, default=128, help="Block size for AWQ quantization algorithms"

0 commit comments

Comments
 (0)