-
Notifications
You must be signed in to change notification settings - Fork 31.8k
Closed
Description
System Info
transformersversion: 4.36.0autoawqversion: 0.1.7- Platform: Linux-5.10.192-183.736.amzn2.x86_64-x86_64-with-glibc2.26
- Python version: 3.10.13
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Code snippet (I am using the inference code described in #27411):
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AwqConfig, TextStreamer
model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
quantization_config = AwqConfig(
bits=4,
do_fuse=True,
fuse_max_seq_len=512,
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt_template = """\
<|im_start|>system
You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant"""
prompt = "You're standing on the surface of the Earth. "\
"You walk one mile south, one mile west and one mile north. "\
"You end up exactly where you started. Where are you?"
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer([prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt)], return_tensors="pt", padding=True).to(0)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))Error messages:
You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. ['fuse_max_seq_len', 'modules_to_fuse', 'do_fuse']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[5], line 32
28 tokenizer.pad_token = tokenizer.eos_token
30 inputs = tokenizer([prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt)], return_tensors="pt", padding=True).to(0)
---> 32 outputs = model.generate(**inputs, max_new_tokens=512)
33 print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/generation/utils.py:1718, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1701 return self.assisted_decoding(
1702 input_ids,
1703 assistant_model=assistant_model,
(...)
1714 **model_kwargs,
1715 )
1716 if generation_mode == GenerationMode.GREEDY_SEARCH:
1717 # 11. run greedy search
-> 1718 return self.greedy_search(
1719 input_ids,
1720 logits_processor=logits_processor,
1721 stopping_criteria=stopping_criteria,
1722 pad_token_id=generation_config.pad_token_id,
1723 eos_token_id=generation_config.eos_token_id,
1724 output_scores=generation_config.output_scores,
1725 return_dict_in_generate=generation_config.return_dict_in_generate,
1726 synced_gpus=synced_gpus,
1727 streamer=streamer,
1728 **model_kwargs,
1729 )
1731 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
1732 if not model_kwargs["use_cache"]:
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/generation/utils.py:2579, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2576 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2578 # forward pass to get next token
-> 2579 outputs = self(
2580 **model_inputs,
2581 return_dict=True,
2582 output_attentions=output_attentions,
2583 output_hidden_states=output_hidden_states,
2584 )
2586 if synced_gpus and this_peer_finished:
2587 continue # don't waste resources running the code we don't need
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:1044, in MistralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1041 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1043 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1044 outputs = self.model(
1045 input_ids=input_ids,
1046 attention_mask=attention_mask,
1047 position_ids=position_ids,
1048 past_key_values=past_key_values,
1049 inputs_embeds=inputs_embeds,
1050 use_cache=use_cache,
1051 output_attentions=output_attentions,
1052 output_hidden_states=output_hidden_states,
1053 return_dict=return_dict,
1054 )
1056 hidden_states = outputs[0]
1057 logits = self.lm_head(hidden_states)
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py:954, in MistralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
952 next_cache = None
953 if use_cache:
--> 954 next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
956 if not return_dict:
957 return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
AttributeError: 'list' object has no attribute 'to_legacy_cache'
Expected behavior
Expected behavior is for the Fused Modules of the AWQ model to function without errors.
Metadata
Metadata
Assignees
Labels
No labels