Closed
Description
transformers
implement LLaMA model's Rotary Positional Embedding (RoPE) as follows:
transformers/src/transformers/models/llama/modeling_llama.py
Lines 173 to 188 in e42587f
This is GPT-NeoX style RoPE. But in Meta's official model implementation, the model adopts GPT-J style RoPE, which processes query and key vectors in an interleaved way instead of split into two half (as in rotate_half
method).
Meta's official repo implements RoPE as (full code link):
def apply_rotary_emb(
xq: torch.Tensor,
xk: torch.Tensor,
freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
return xq_out.type_as(xq), xk_out.type_as(xk)
I'm confused with this difference, since transformers.LlamaModel
can directly load weights converted from the officially released checkpoint, won't this lead to inconsistency in inference results? Is this difference expected?
Metadata
Metadata
Assignees
Labels
No labels