Update all references to canonical models (#29001)
* Script & Manual edition * Update
This commit is contained in:
@@ -38,7 +38,7 @@ The main differences compared to GPT2.
|
||||
- Use jit to fuse the attention fp32 casting, masking, softmax, and scaling.
|
||||
- Combine the attention and causal masks into a single one, pre-computed for the whole model instead of every layer.
|
||||
- Merge the key and value caches into one (this changes the format of layer_past/ present, does it risk creating problems?)
|
||||
- Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original gpt2 model).
|
||||
- Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original openai-community/gpt2 model).
|
||||
|
||||
You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user