[XLNet] Fix mems behavior (#8567)
* fix mems in xlnet * fix use_mems * fix use_mem_len * fix use mems * clean docs * fix tf typo * make xlnet tf for generation work * fix tf test * refactor use cache * add use cache for missing models * correct use_cache in generate * correct use cache in tf generate * fix tf * correct getattr typo * make sylvain happy * change in docs as well * do not apply to cookie cutter statements * fix tf test * make pytorch model fully backward compatible
This commit is contained in:
committed by
GitHub
parent
369f1d77b4
commit
2a6fbe6a40
@@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
|
||||
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
||||
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
||||
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
||||
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
|
||||
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
||||
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
||||
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
||||
|
||||
|
||||
Reference in New Issue
Block a user