[XLNet] Fix mems behavior (#8567)
* fix mems in xlnet * fix use_mems * fix use_mem_len * fix use mems * clean docs * fix tf typo * make xlnet tf for generation work * fix tf test * refactor use cache * add use cache for missing models * correct use_cache in generate * correct use cache in tf generate * fix tf * correct getattr typo * make sylvain happy * change in docs as well * do not apply to cookie cutter statements * fix tf test * make pytorch model fully backward compatible
This commit is contained in:
committed by
GitHub
parent
369f1d77b4
commit
2a6fbe6a40
@@ -18,9 +18,9 @@ operating these large models in on-the-edge and/or under constrained computation
|
||||
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
|
||||
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
|
||||
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
|
||||
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
|
||||
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
|
||||
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
|
||||
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
|
||||
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
|
||||
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
|
||||
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
|
||||
study.*
|
||||
|
||||
Reference in New Issue
Block a user