Use HF papers (#38184)
* Use hf papers * Hugging Face papers * doi to hf papers * style
This commit is contained in:
committed by
GitHub
parent
1031ed5166
commit
de24fb63ed
@@ -53,7 +53,7 @@ class TestAttention(nn.Module):
|
||||
Adapted from transformers.models.mistral.modeling_mistral.MistralAttention:
|
||||
The input dimension here is attention_hidden_size = 2 * hidden_size, and head_dim = attention_hidden_size // num_heads.
|
||||
The extra factor of 2 comes from the input being the concatenation of original_hidden_states with the output of the previous (mamba) layer
|
||||
(see fig. 2 in https://arxiv.org/pdf/2405.16712).
|
||||
(see fig. 2 in https://huggingface.co/papers/2405.16712).
|
||||
Additionally, replaced
|
||||
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) with
|
||||
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim/2)
|
||||
|
||||
Reference in New Issue
Block a user