System Info
Who can help?
@fxmarty
Information
Tasks
Reproduction (minimal, reproducible, runnable)
The method described in the docs for TRT engine building is outdated, first mentioned here, I tested the dynamic shapes method in optimum-benchmark here.
Expected behavior
We can update the docs with this snippet:
provider_options = {
"trt_engine_cache_enable": True,
"trt_engine_cache_path": "tmp/trt_cache_gpt2_example",
"trt_profile_min_shapes": "input_ids:1x16,attention_mask:1x16",
"trt_profile_max_shapes": "input_ids:1x64,attention_mask:1x64",
"trt_profile_opt_shapes": "input_ids:1x32,attention_mask:1x32",
}
ort_model = ORTModelForCausalLM.from_pretrained(
"gpt2",
export=True,
use_cache=False,
provider="TensorrtExecutionProvider",
provider_options=provider_options,
)
ort_model.generate(
input_ids=torch.tensor([[1] * 16]).to("cuda"),
max_new_tokens=64-16,
min_new_tokens=64-16,
pad_token_id=0,
eos_token_id=0,
)
though it's still not clear to me what's the effect of trt_profile_opt_shapes.
System Info
Who can help?
@fxmarty
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
The method described in the docs for TRT engine building is outdated, first mentioned here, I tested the dynamic shapes method in
optimum-benchmarkhere.Expected behavior
We can update the docs with this snippet:
though it's still not clear to me what's the effect of
trt_profile_opt_shapes.