An error occurred when using AWQ Fused modules

### System Info

- `transformers` version: 4.36.0
- `autoawq` version: 0.1.7
- Platform: Linux-5.10.192-183.736.amzn2.x86_64-x86_64-with-glibc2.26
- Python version: 3.10.13
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.0.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No


### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Code snippet (I am using the inference code described in https://github.com/huggingface/transformers/pull/27411):
```Python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AwqConfig, TextStreamer

model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

quantization_config = AwqConfig(
    bits=4,
    do_fuse=True,
    fuse_max_seq_len=512,
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
tokenizer = AutoTokenizer.from_pretrained(model_id)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt_template = """\
<|im_start|>system
You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant"""

prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"

tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer([prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt)], return_tensors="pt", padding=True).to(0)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
```


Error messages:
```
You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. ['fuse_max_seq_len', 'modules_to_fuse', 'do_fuse']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 32
     28 tokenizer.pad_token = tokenizer.eos_token
     30 inputs = tokenizer([prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt)], return_tensors="pt", padding=True).to(0)
---> 32 outputs = model.generate(**inputs, max_new_tokens=512)
     33 print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/generation/utils.py:1718, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1701     return self.assisted_decoding(
   1702         input_ids,
   1703         assistant_model=assistant_model,
   (...)
   1714         **model_kwargs,
   1715     )
   1716 if generation_mode == GenerationMode.GREEDY_SEARCH:
   1717     # 11. run greedy search
-> 1718     return self.greedy_search(
   1719         input_ids,
   1720         logits_processor=logits_processor,
   1721         stopping_criteria=stopping_criteria,
   1722         pad_token_id=generation_config.pad_token_id,
   1723         eos_token_id=generation_config.eos_token_id,
   1724         output_scores=generation_config.output_scores,
   1725         return_dict_in_generate=generation_config.return_dict_in_generate,
   1726         synced_gpus=synced_gpus,
   1727         streamer=streamer,
   1728         **model_kwargs,
   1729     )
   1731 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
   1732     if not model_kwargs["use_cache"]:

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/generation/utils.py:2579, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2576 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2578 # forward pass to get next token
-> 2579 outputs = self(
   2580     **model_inputs,
   2581     return_dict=True,
   2582     output_attentions=output_attentions,
   2583     output_hidden_states=output_hidden_states,
   2584 )
   2586 if synced_gpus and this_peer_finished:
   2587     continue  # don't waste resources running the code we don't need

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:1044, in MistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1041 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1043 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1044 outputs = self.model(
   1045     input_ids=input_ids,
   1046     attention_mask=attention_mask,
   1047     position_ids=position_ids,
   1048     past_key_values=past_key_values,
   1049     inputs_embeds=inputs_embeds,
   1050     use_cache=use_cache,
   1051     output_attentions=output_attentions,
   1052     output_hidden_states=output_hidden_states,
   1053     return_dict=return_dict,
   1054 )
   1056 hidden_states = outputs[0]
   1057 logits = self.lm_head(hidden_states)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:954, in MistralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    952 next_cache = None
    953 if use_cache:
--> 954     next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
    956 if not return_dict:
    957     return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)

AttributeError: 'list' object has no attribute 'to_legacy_cache'
```

### Expected behavior

Expected behavior is for the Fused Modules of the AWQ model to function without errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error occurred when using AWQ Fused modules #28028

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

An error occurred when using AWQ Fused modules #28028

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions