[XLNet] Fix mems behavior (#8567)

* fix mems in xlnet * fix use_mems * fix use_mem_len * fix use mems * clean docs * fix tf typo * make xlnet tf for generation work * fix tf test * refactor use cache * add use cache for missing models * correct use_cache in generate * correct use cache in tf generate * fix tf * correct getattr typo * make sylvain happy * change in docs as well * do not apply to cookie cutter statements * fix tf test * make pytorch model fully backward compatible
2020-11-25 22:54:59 +01:00
parent 369f1d77b4
commit 2a6fbe6a40
47 changed files with 259 additions and 134 deletions
--- a/docs/source/model_doc/electra.rst
+++ b/docs/source/model_doc/electra.rst
@@ -12,14 +12,14 @@ identify which tokens were replaced by the generator in the sequence.

 The abstract from the paper is the following:

-*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
-[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
+*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
+and then train a model to reconstruct the original tokens. While they produce good results when transferred to
 downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
-more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
+more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
 corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
 of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
 predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
-demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
+demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
 rather than just the small subset that was masked out. As a result, the contextual representations learned by our
 approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
 particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained