Add Mistral GPT-2 Stability Tweaks (#13573)

* Add layer-wise scaling * Add reorder & upcasting argument * Add OpenAI GPT-2 weight initialization scheme * start `layer_idx` count at zero for consistency * disentangle attn and reordered and upscaled attn function * rename `scale_attn_by_layer` to `scale_attn_by_layer_id` * make autocast from amp compatible with pytorch<1.6 * fix docstring * style fixes * Add fixes from PR feedback, style tweaks * Fix doc whitespace * Reformat * First pass scale_attn_by_layer_idx and reorder_and_upcast_attn tests * Rename scale_attn_by_layer_idx, add tip * Remove extra newline * add test for weight initialization * update code format * add assert check weights are fp32 * remove assert * Fix incorrect merge * Fix shape mismatch in baddbmm * Add generation test for Mistral flags Co-authored-by: leandro <leandro.vonwerra@spoud.io> Co-authored-by: Keshav Santhanam <keshav2@stanford.edu> Co-authored-by: J38 <jebolton@stanford.edu>
2021-10-04 04:37:09 -07:00
parent 955fd4fea9
commit 3a8de58c51
4 changed files with 198 additions and 42 deletions
--- a/docs/source/model_doc/gpt2.rst
+++ b/docs/source/model_doc/gpt2.rst
@@ -41,6 +41,8 @@ Tips:
  pre-computed values in the context of text generation. For PyTorch, see `past_key_values` argument of the
  :meth:`~transformers.GPT2Model.forward` method, or for TF the `past` argument of the
  :meth:`~transformers.TFGPT2Model.call` method for more information on its usage.
+- Enabling the `scale_attn_by_inverse_layer_idx` and `reorder_and_upcast_attn` flags will apply the training stability
+  improvements from `Mistral <https://github.com/stanford-crfm/mistral/>`__ (for PyTorch only).

 `Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
 Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five