Description
Bug Description
The outputs of TRT compilation do not match with PyTorch for llama2 model. These are the causes for it.
-
Running in FP16 precision (layernorm warns about FP16 precision not being enough). So, we need to compile in FP32 precision
-
Rotation : https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L152-L156
This block leads to output mismatch -
Adding attention mask https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/models/llama/modeling_llama.py#L347-L349 These lines also cause output mismatch.
Compiling with dynamic shapes and FP32 also lead to high memory usage.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- Torch-TensorRT Version (e.g. 1.0.0):
- PyTorch Version (e.g. 1.0):
- CPU Architecture:
- OS (e.g., Linux):
- How you installed PyTorch (
conda
,pip
,libtorch
, source): - Build command you used (if compiling from source):
- Are you using local sources or building from archives:
- Python version:
- CUDA version:
- GPU models and configuration:
- Any other relevant information: