[XLNet] Fix mems behavior (#8567)

* fix mems in xlnet * fix use_mems * fix use_mem_len * fix use mems * clean docs * fix tf typo * make xlnet tf for generation work * fix tf test * refactor use cache * add use cache for missing models * correct use_cache in generate * correct use cache in tf generate * fix tf * correct getattr typo * make sylvain happy * change in docs as well * do not apply to cookie cutter statements * fix tf test * make pytorch model fully backward compatible
2020-11-25 22:54:59 +01:00
parent 369f1d77b4
commit 2a6fbe6a40
47 changed files with 259 additions and 134 deletions
--- a/docs/source/model_doc/deberta.rst
+++ b/docs/source/model_doc/deberta.rst
@@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
 position, respectively, and the attention weights among words are computed using disentangled matrices on their
 contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
 predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
-of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
-of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
+of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
+the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
 (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
 pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*