Remove head mask in generative models (#35786)
* just squash into one commit * delete print
This commit is contained in:
committed by
GitHub
parent
0173a99e73
commit
955e61b0da
@@ -46,8 +46,12 @@ The main differences compared to GPT2.
|
||||
- Merge the key and value caches into one (this changes the format of layer_past/ present, does it risk creating problems?)
|
||||
- Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original openai-community/gpt2 model).
|
||||
|
||||
|
||||
You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
|
||||
|
||||
> [!NOTE]
|
||||
> The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
|
||||
|
||||
## Combining Starcoder and Flash Attention 2
|
||||
|
||||
First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature.
|
||||
|
||||
Reference in New Issue
Block a user