[XLNet] Fix mems behavior (#8567)
* fix mems in xlnet * fix use_mems * fix use_mem_len * fix use mems * clean docs * fix tf typo * make xlnet tf for generation work * fix tf test * refactor use cache * add use cache for missing models * correct use_cache in generate * correct use cache in tf generate * fix tf * correct getattr typo * make sylvain happy * change in docs as well * do not apply to cookie cutter statements * fix tf test * make pytorch model fully backward compatible
This commit is contained in:
committed by
GitHub
parent
369f1d77b4
commit
2a6fbe6a40
@@ -97,6 +97,6 @@ You should check out our [swift-coreml-transformers](https://github.com/huggingf
|
||||
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
|
||||
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
||||
|
||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch or
|
||||
At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or
|
||||
TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its
|
||||
hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting!
|
||||
|
||||
@@ -10,7 +10,7 @@ Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Ali
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By
|
||||
*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
|
||||
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
|
||||
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
|
||||
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
|
||||
|
||||
@@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
|
||||
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
||||
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
||||
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
||||
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
|
||||
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
||||
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
||||
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
||||
|
||||
|
||||
@@ -18,9 +18,9 @@ operating these large models in on-the-edge and/or under constrained computation
|
||||
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
|
||||
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
|
||||
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
|
||||
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
|
||||
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
|
||||
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
|
||||
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
|
||||
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
|
||||
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
|
||||
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
|
||||
study.*
|
||||
|
||||
@@ -12,14 +12,14 @@ identify which tokens were replaced by the generator in the sequence.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
|
||||
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
||||
*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
|
||||
and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
||||
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
|
||||
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
|
||||
more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
|
||||
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
|
||||
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
|
||||
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
|
||||
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
|
||||
demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
|
||||
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
|
||||
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
|
||||
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
|
||||
|
||||
@@ -19,7 +19,7 @@ representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018;
|
||||
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
|
||||
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
|
||||
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
|
||||
time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
|
||||
time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
|
||||
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
|
||||
community for further reproducible experiments in French NLP.*
|
||||
|
||||
|
||||
@@ -14,7 +14,7 @@ The abstract from the paper is the following:
|
||||
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
|
||||
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
|
||||
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
|
||||
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
|
||||
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
|
||||
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
|
||||
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
|
||||
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
|
||||
|
||||
@@ -6,19 +6,19 @@ Overview
|
||||
|
||||
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
|
||||
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
|
||||
Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
|
||||
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
|
||||
information extraction tasks, such as form understanding and receipt understanding.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
|
||||
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
|
||||
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
|
||||
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
|
||||
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
|
||||
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
|
||||
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
|
||||
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
|
||||
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
|
||||
framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
|
||||
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
|
||||
classification (from 93.07 to 94.42).*
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ Encoder Representations from Transformers) framework to learn these vision-and-l
|
||||
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
|
||||
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
|
||||
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
|
||||
pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
||||
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
||||
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
|
||||
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
|
||||
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
|
||||
|
||||
@@ -13,7 +13,7 @@ The MBart model was presented in `Multilingual Denoising Pre-training for Neural
|
||||
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||
|
||||
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
|
||||
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
|
||||
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
|
||||
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
|
||||
on the encoder, decoder, or reconstructing parts of the text.
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@ the next token.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
|
||||
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
|
||||
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
||||
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
||||
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
||||
@@ -25,7 +25,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
|
||||
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
||||
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
||||
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
|
||||
|
||||
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@ The abstract from the paper is the following:
|
||||
task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
|
||||
has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
|
||||
transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
|
||||
text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer
|
||||
text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled datasets, transfer
|
||||
approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
|
||||
with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
|
||||
summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
|
||||
|
||||
@@ -19,7 +19,7 @@ just the next token. Its architecture is identical to ProhpetNet, but the model
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
|
||||
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
|
||||
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
||||
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
||||
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
||||
@@ -27,7 +27,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
|
||||
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
||||
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
||||
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
|
||||
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
|
||||
|
||||
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
||||
|
||||
|
||||
@@ -527,7 +527,7 @@ Pegasus
|
||||
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
|
||||
|
||||
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
|
||||
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
|
||||
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
|
||||
objective, called Gap Sentence Generation (GSG).
|
||||
|
||||
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
|
||||
@@ -609,7 +609,7 @@ MT5
|
||||
`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
|
||||
et al.
|
||||
|
||||
The model architecture is same as T5. mT5's pre-training objective includes T5's self-supervised training, but not T5's
|
||||
The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
|
||||
supervised training. mT5 is trained on 101 languages.
|
||||
|
||||
The library provides a version of this model for conditional generation.
|
||||
@@ -630,8 +630,8 @@ MBart
|
||||
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
|
||||
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||
|
||||
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
|
||||
for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
|
||||
The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
|
||||
for supervised and unsupervised machine translation. MBart is one of the first methods for pretraining a complete
|
||||
sequence-to-sequence model by denoising full texts in multiple languages,
|
||||
|
||||
The library provides a version of this model for conditional generation.
|
||||
@@ -658,7 +658,7 @@ ProphetNet
|
||||
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
||||
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
||||
|
||||
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
|
||||
ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
|
||||
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
|
||||
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
|
||||
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
|
||||
@@ -683,8 +683,8 @@ XLM-ProphetNet
|
||||
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
||||
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
||||
|
||||
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was
|
||||
pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
|
||||
XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
|
||||
on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
|
||||
|
||||
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
|
||||
versions for headline generation and question generation, respectively.
|
||||
|
||||
@@ -305,7 +305,7 @@ Language modeling is the task of fitting a model to a corpus, which can be domai
|
||||
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
|
||||
GPT-2 with causal language modeling.
|
||||
|
||||
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
|
||||
Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
|
||||
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
|
||||
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user