[XLNet] Fix mems behavior (#8567)

* fix mems in xlnet

* fix use_mems

* fix use_mem_len

* fix use mems

* clean docs

* fix tf typo

* make xlnet tf for generation work

* fix tf test

* refactor use cache

* add use cache for missing models

* correct use_cache in generate

* correct use cache in tf generate

* fix tf

* correct getattr typo

* make sylvain happy

* change in docs as well

* do not apply to cookie cutter statements

* fix tf test

* make pytorch model fully backward compatible
This commit is contained in:
Patrick von Platen
2020-11-25 22:54:59 +01:00
committed by GitHub
parent 369f1d77b4
commit 2a6fbe6a40
47 changed files with 259 additions and 134 deletions

View File

@@ -97,6 +97,6 @@ You should check out our [swift-coreml-transformers](https://github.com/huggingf
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`, It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices. `DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch or At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or
TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its
hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting! hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting!

View File

@@ -10,7 +10,7 @@ Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Ali
The abstract from the paper is the following: The abstract from the paper is the following:
*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By *Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We

View File

@@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
position, respectively, and the attention weights among words are computed using disentangled matrices on their position, respectively, and the attention weights among words are computed using disentangled matrices on their
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.* pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*

View File

@@ -18,9 +18,9 @@ operating these large models in on-the-edge and/or under constrained computation
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
study.* study.*

View File

@@ -12,14 +12,14 @@ identify which tokens were replaced by the generator in the sequence.
The abstract from the paper is the following: The abstract from the paper is the following:
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with *Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to and then train a model to reconstruct the original tokens. While they produce good results when transferred to
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
rather than just the small subset that was masked out. As a result, the contextual representations learned by our rather than just the small subset that was masked out. As a result, the contextual representations learned by our
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained

View File

@@ -19,7 +19,7 @@ representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018;
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
community for further reproducible experiments in French NLP.* community for further reproducible experiments in French NLP.*

View File

@@ -14,7 +14,7 @@ The abstract from the paper is the following:
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, *Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our

View File

@@ -6,19 +6,19 @@ Overview
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
information extraction tasks, such as form understanding and receipt understanding. information extraction tasks, such as form understanding and receipt understanding.
The abstract from the paper is the following: The abstract from the paper is the following:
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the *Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images, the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
which is beneficial for a great number of real-world document image understanding tasks such as information extraction which is beneficial for a great number of real-world document image understanding tasks such as information extraction
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
classification (from 93.07 to 94.42).* classification (from 93.07 to 94.42).*

View File

@@ -19,7 +19,7 @@ Encoder Representations from Transformers) framework to learn these vision-and-l
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification), pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our

View File

@@ -13,7 +13,7 @@ The MBart model was presented in `Multilingual Denoising Pre-training for Neural
Ghazvininejad, Mike Lewis, Luke Zettlemoyer. Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
on the encoder, decoder, or reconstructing parts of the text. on the encoder, decoder, or reconstructing parts of the text.

View File

@@ -17,7 +17,7 @@ the next token.
The abstract from the paper is the following: The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel *In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
@@ -25,7 +25,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.* state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__. The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.

View File

@@ -17,7 +17,7 @@ The abstract from the paper is the following:
task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled datasets, transfer
approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
summarization, question answering, text classification, and more. To facilitate future work on transfer learning for summarization, question answering, text classification, and more. To facilitate future work on transfer learning for

View File

@@ -19,7 +19,7 @@ just the next token. Its architecture is identical to ProhpetNet, but the model
The abstract from the paper is the following: The abstract from the paper is the following:
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel *In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
@@ -27,7 +27,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.* state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__. The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.

View File

@@ -527,7 +527,7 @@ Pegasus
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019. <https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
objective, called Gap Sentence Generation (GSG). objective, called Gap Sentence Generation (GSG).
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
@@ -609,7 +609,7 @@ MT5
`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue `mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
et al. et al.
The model architecture is same as T5. mT5's pre-training objective includes T5's self-supervised training, but not T5's The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
supervised training. mT5 is trained on 101 languages. supervised training. mT5 is trained on 101 languages.
The library provides a version of this model for conditional generation. The library provides a version of this model for conditional generation.
@@ -630,8 +630,8 @@ MBart
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete for supervised and unsupervised machine translation. MBart is one of the first methods for pretraining a complete
sequence-to-sequence model by denoising full texts in multiple languages, sequence-to-sequence model by denoising full texts in multiple languages,
The library provides a version of this model for conditional generation. The library provides a version of this model for conditional generation.
@@ -658,7 +658,7 @@ ProphetNet
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou. Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
@@ -683,8 +683,8 @@ XLM-ProphetNet
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou. Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__. on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
versions for headline generation and question generation, respectively. versions for headline generation and question generation, respectively.

View File

@@ -305,7 +305,7 @@ Language modeling is the task of fitting a model to a corpus, which can be domai
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling, transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
GPT-2 with causal language modeling. GPT-2 with causal language modeling.
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__. on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.

View File

@@ -55,8 +55,6 @@ class PretrainedConfig(object):
Whether or not the model should return all hidden-states. Whether or not the model should return all hidden-states.
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`): output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should returns all attentions. Whether or not the model should returns all attentions.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
return_dict (:obj:`bool`, `optional`, defaults to :obj:`True`): return_dict (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return a :class:`~transformers.file_utils.ModelOutput` instead of a plain Whether or not the model should return a :class:`~transformers.file_utils.ModelOutput` instead of a plain
tuple. tuple.
@@ -168,7 +166,6 @@ class PretrainedConfig(object):
self.return_dict = kwargs.pop("return_dict", True) self.return_dict = kwargs.pop("return_dict", True)
self.output_hidden_states = kwargs.pop("output_hidden_states", False) self.output_hidden_states = kwargs.pop("output_hidden_states", False)
self.output_attentions = kwargs.pop("output_attentions", False) self.output_attentions = kwargs.pop("output_attentions", False)
self.use_cache = kwargs.pop("use_cache", True) # Not used by all models
self.torchscript = kwargs.pop("torchscript", False) # Only used by PyTorch models self.torchscript = kwargs.pop("torchscript", False) # Only used by PyTorch models
self.use_bfloat16 = kwargs.pop("use_bfloat16", False) self.use_bfloat16 = kwargs.pop("use_bfloat16", False)
self.pruned_heads = kwargs.pop("pruned_heads", {}) self.pruned_heads = kwargs.pop("pruned_heads", {})

View File

@@ -229,7 +229,7 @@ class LineByLineWithSOPTextDataset(Dataset):
# to `block_size` anyways, so short sequences are generally wasted # to `block_size` anyways, so short sequences are generally wasted
# computation. However, we *sometimes* # computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning. # sequences to minimize the mismatch between pretraining and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas # The `target_seq_length` is just a rough target however, whereas
# `block_size` is a hard limit. # `block_size` is a hard limit.
target_seq_length = max_num_tokens target_seq_length = max_num_tokens
@@ -425,7 +425,7 @@ class TextDatasetForNextSentencePrediction(Dataset):
# to `block_size` anyways, so short sequences are generally wasted # to `block_size` anyways, so short sequences are generally wasted
# computation. However, we *sometimes* # computation. However, we *sometimes*
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
# sequences to minimize the mismatch between pre-training and fine-tuning. # sequences to minimize the mismatch between pretraining and fine-tuning.
# The `target_seq_length` is just a rough target however, whereas # The `target_seq_length` is just a rough target however, whereas
# `block_size` is a hard limit. # `block_size` is a hard limit.
target_seq_length = max_num_tokens target_seq_length = max_num_tokens

View File

@@ -38,6 +38,7 @@ class TFGenerationMixin:
def _use_cache(self, outputs, use_cache): def _use_cache(self, outputs, use_cache):
"""During generation, decide whether to pass the `past` variable to the next forward pass.""" """During generation, decide whether to pass the `past` variable to the next forward pass."""
use_cache = getattr(self.config, "use_cache", False)
if len(outputs) <= 1 or use_cache is False: if len(outputs) <= 1 or use_cache is False:
return False return False
if hasattr(self.config, "mem_len") and self.config.mem_len == 0: if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
@@ -194,7 +195,6 @@ class TFGenerationMixin:
min_length = min_length if min_length is not None else self.config.min_length min_length = min_length if min_length is not None else self.config.min_length
do_sample = do_sample if do_sample is not None else self.config.do_sample do_sample = do_sample if do_sample is not None else self.config.do_sample
early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
use_cache = use_cache if use_cache is not None else self.config.use_cache
num_beams = num_beams if num_beams is not None else self.config.num_beams num_beams = num_beams if num_beams is not None else self.config.num_beams
temperature = temperature if temperature is not None else self.config.temperature temperature = temperature if temperature is not None else self.config.temperature
top_k = top_k if top_k is not None else self.config.top_k top_k = top_k if top_k is not None else self.config.top_k
@@ -224,7 +224,6 @@ class TFGenerationMixin:
assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer." assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
assert isinstance(do_sample, bool), "`do_sample` should be a boolean." assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean." assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
assert isinstance(use_cache, bool), "`use_cache` should be a boolean."
assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer." assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
assert temperature > 0, "`temperature` should be strictly positive." assert temperature > 0, "`temperature` should be strictly positive."
assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer." assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."

View File

@@ -462,7 +462,6 @@ class GenerationMixin:
pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
use_cache = use_cache if use_cache is not None else self.config.use_cache
if input_ids is None: if input_ids is None:
# init `input_ids` with bos_token_id # init `input_ids` with bos_token_id

View File

@@ -730,7 +730,7 @@ class AlbertModel(AlbertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a Albert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
`sentence order prediction (classification)` head. `sentence order prediction (classification)` head.
""", """,
ALBERT_START_DOCSTRING, ALBERT_START_DOCSTRING,

View File

@@ -809,7 +809,7 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
Albert Model with two heads on top for pre-training: a `masked language modeling` head and a `sentence order Albert Model with two heads on top for pretraining: a `masked language modeling` head and a `sentence order
prediction` (classification) head. prediction` (classification) head.
""", """,
ALBERT_START_DOCSTRING, ALBERT_START_DOCSTRING,

View File

@@ -108,6 +108,8 @@ class BartConfig(PretrainedConfig):
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`): force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
:obj:`True` for `bart-large-cnn`. :obj:`True` for `bart-large-cnn`.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
""" """
model_type = "bart" model_type = "bart"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
@@ -134,9 +136,6 @@ class BartConfig(PretrainedConfig):
classifier_dropout=0.0, classifier_dropout=0.0,
num_labels=3, num_labels=3,
is_encoder_decoder=True, is_encoder_decoder=True,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
normalize_before=False, normalize_before=False,
add_final_layer_norm=False, add_final_layer_norm=False,
do_blenderbot_90_layernorm=False, do_blenderbot_90_layernorm=False,
@@ -145,6 +144,10 @@ class BartConfig(PretrainedConfig):
static_position_embeddings=False, static_position_embeddings=False,
add_bias_logits=False, add_bias_logits=False,
force_bos_token_to_be_generated=False, force_bos_token_to_be_generated=False,
use_cache=True,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
**common_kwargs **common_kwargs
): ):
r""" r"""
@@ -208,6 +211,8 @@ class BartConfig(PretrainedConfig):
self.do_blenderbot_90_layernorm = do_blenderbot_90_layernorm self.do_blenderbot_90_layernorm = do_blenderbot_90_layernorm
self.use_cache = use_cache
@property @property
def num_attention_heads(self) -> int: def num_attention_heads(self) -> int:
return self.encoder_attention_heads return self.encoder_attention_heads

View File

@@ -888,7 +888,7 @@ class BertModel(BertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a `next Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next
sentence prediction (classification)` head. sentence prediction (classification)` head.
""", """,
BERT_START_DOCSTRING, BERT_START_DOCSTRING,

View File

@@ -90,7 +90,7 @@ TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
class TFBertPreTrainingLoss: class TFBertPreTrainingLoss:
""" """
Loss function suitable for BERT-like pre-training, that is, the task of pretraining a language model by combining Loss function suitable for BERT-like pretraining, that is, the task of pretraining a language model by combining
NSP + MLM. .. note:: Any label of -100 will be ignored (along with the corresponding logits) in the loss NSP + MLM. .. note:: Any label of -100 will be ignored (along with the corresponding logits) in the loss
computation. computation.
""" """
@@ -878,7 +878,7 @@ class TFBertModel(TFBertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
Bert Model with two heads on top as done during the pre-training: Bert Model with two heads on top as done during the pretraining:
a `masked language modeling` head and a `next sentence prediction (classification)` head. a `masked language modeling` head and a `next sentence prediction (classification)` head.
""", """,
BERT_START_DOCSTRING, BERT_START_DOCSTRING,

View File

@@ -80,7 +80,7 @@ class BertweetTokenizer(PreTrainedTokenizer):
normalization (:obj:`bool`, `optional`, defaults to :obj:`False`) normalization (:obj:`bool`, `optional`, defaults to :obj:`False`)
Whether or not to apply a normalization preprocess. Whether or not to apply a normalization preprocess.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`): bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token. The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
.. note:: .. note::

View File

@@ -61,6 +61,9 @@ class CTRLConfig(PretrainedConfig):
The epsilon to use in the layer normalization layers The epsilon to use in the layer normalization layers
initializer_range (:obj:`float`, `optional`, defaults to 0.02): initializer_range (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
Examples:: Examples::
@@ -98,6 +101,7 @@ class CTRLConfig(PretrainedConfig):
summary_activation=None, summary_activation=None,
summary_proj_to_labels=True, summary_proj_to_labels=True,
summary_first_dropout=0.1, summary_first_dropout=0.1,
use_cache=True,
**kwargs **kwargs
): ):
super().__init__(**kwargs) super().__init__(**kwargs)
@@ -119,6 +123,7 @@ class CTRLConfig(PretrainedConfig):
self.summary_activation = summary_activation self.summary_activation = summary_activation
self.summary_first_dropout = summary_first_dropout self.summary_first_dropout = summary_first_dropout
self.summary_proj_to_labels = summary_proj_to_labels self.summary_proj_to_labels = summary_proj_to_labels
self.use_cache = use_cache
@property @property
def max_position_embeddings(self): def max_position_embeddings(self):

View File

@@ -772,7 +772,7 @@ DEBERTA_START_DOCSTRING = r"""
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
<https://arxiv.org/abs/2006.03654>`_ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It's build on top of <https://arxiv.org/abs/2006.03654>`_ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It's build on top of
BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two
improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pre-training data. improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pretraining data.
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to

View File

@@ -891,8 +891,7 @@ class ElectraForSequenceClassification(ElectraPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
Electra model with a binary classification head on top as used during pre-training for identifying generated Electra model with a binary classification head on top as used during pretraining for identifying generated tokens.
tokens.
It is recommended to load the discriminator checkpoint into that model. It is recommended to load the discriminator checkpoint into that model.
""", """,

View File

@@ -789,8 +789,7 @@ class TFElectraModel(TFElectraPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
Electra model with a binary classification head on top as used during pre-training for identifying generated Electra model with a binary classification head on top as used during pretraining for identifying generated tokens.
tokens.
Even though both the discriminator and generator may be loaded into this model, the discriminator is the only model Even though both the discriminator and generator may be loaded into this model, the discriminator is the only model
of the two to have the correct classification head to be used for this model. of the two to have the correct classification head to be used for this model.

View File

@@ -109,6 +109,8 @@ class FSMTConfig(PretrainedConfig):
early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`) early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop the beam Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop the beam
search when at least ``num_beams`` sentences are finished per batch or not. search when at least ``num_beams`` sentences are finished per batch or not.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
Examples:: Examples::
@@ -142,9 +144,6 @@ class FSMTConfig(PretrainedConfig):
dropout=0.1, dropout=0.1,
activation_dropout=0.0, activation_dropout=0.0,
init_std=0.02, init_std=0.02,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
decoder_start_token_id=2, decoder_start_token_id=2,
is_encoder_decoder=True, is_encoder_decoder=True,
scale_embedding=True, scale_embedding=True,
@@ -152,6 +151,10 @@ class FSMTConfig(PretrainedConfig):
num_beams=5, num_beams=5,
length_penalty=1.0, length_penalty=1.0,
early_stopping=False, early_stopping=False,
use_cache=True,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
**common_kwargs **common_kwargs
): ):
if "hidden_size" in common_kwargs: if "hidden_size" in common_kwargs:
@@ -196,6 +199,8 @@ class FSMTConfig(PretrainedConfig):
self.activation_dropout = activation_dropout self.activation_dropout = activation_dropout
self.dropout = dropout self.dropout = dropout
self.use_cache = use_cache
@property @property
def num_attention_heads(self) -> int: def num_attention_heads(self) -> int:
return self.encoder_attention_heads return self.encoder_attention_heads

View File

@@ -1241,7 +1241,7 @@ class TFFunnelModel(TFFunnelPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
Funnel model with a binary classification head on top as used during pre-training for identifying generated tokens. Funnel model with a binary classification head on top as used during pretraining for identifying generated tokens.
""", """,
FUNNEL_START_DOCSTRING, FUNNEL_START_DOCSTRING,
) )

View File

@@ -104,6 +104,8 @@ class GPT2Config(PretrainedConfig):
The dropout ratio to be used after the projection and activation. The dropout ratio to be used after the projection and activation.
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass. Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
Example:: Example::
@@ -142,9 +144,10 @@ class GPT2Config(PretrainedConfig):
summary_activation=None, summary_activation=None,
summary_proj_to_labels=True, summary_proj_to_labels=True,
summary_first_dropout=0.1, summary_first_dropout=0.1,
gradient_checkpointing=False,
use_cache=True,
bos_token_id=50256, bos_token_id=50256,
eos_token_id=50256, eos_token_id=50256,
gradient_checkpointing=False,
**kwargs **kwargs
): ):
super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs) super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
@@ -168,6 +171,7 @@ class GPT2Config(PretrainedConfig):
self.summary_first_dropout = summary_first_dropout self.summary_first_dropout = summary_first_dropout
self.summary_proj_to_labels = summary_proj_to_labels self.summary_proj_to_labels = summary_proj_to_labels
self.gradient_checkpointing = gradient_checkpointing self.gradient_checkpointing = gradient_checkpointing
self.use_cache = use_cache
self.bos_token_id = bos_token_id self.bos_token_id = bos_token_id
self.eos_token_id = eos_token_id self.eos_token_id = eos_token_id

View File

@@ -1013,7 +1013,7 @@ class LxmertModel(LxmertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
"""Lxmert Model with a specified pre-training head on top. """, """Lxmert Model with a specified pretraining head on top. """,
LXMERT_START_DOCSTRING, LXMERT_START_DOCSTRING,
) )
class LxmertForPreTraining(LxmertPreTrainedModel): class LxmertForPreTraining(LxmertPreTrainedModel):
@@ -1024,7 +1024,7 @@ class LxmertForPreTraining(LxmertPreTrainedModel):
self.num_qa_labels = config.num_qa_labels self.num_qa_labels = config.num_qa_labels
self.visual_loss_normalizer = config.visual_loss_normalizer self.visual_loss_normalizer = config.visual_loss_normalizer
# Use of pre-training tasks # Use of pretraining tasks
self.task_mask_lm = config.task_mask_lm self.task_mask_lm = config.task_mask_lm
self.task_obj_predict = config.task_obj_predict self.task_obj_predict = config.task_obj_predict
self.task_matched = config.task_matched self.task_matched = config.task_matched

View File

@@ -1176,7 +1176,7 @@ class TFLxmertForPreTraining(TFLxmertPreTrainedModel):
self.num_qa_labels = config.num_qa_labels self.num_qa_labels = config.num_qa_labels
self.visual_loss_normalizer = config.visual_loss_normalizer self.visual_loss_normalizer = config.visual_loss_normalizer
# Use of pre-training tasks # Use of pretraining tasks
self.task_mask_lm = config.task_mask_lm self.task_mask_lm = config.task_mask_lm
self.task_obj_predict = config.task_obj_predict self.task_obj_predict = config.task_obj_predict
self.task_matched = config.task_matched self.task_matched = config.task_matched

View File

@@ -933,7 +933,7 @@ class MobileBertModel(MobileBertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a MobileBert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
`next sentence prediction (classification)` head. `next sentence prediction (classification)` head.
""", """,
MOBILEBERT_START_DOCSTRING, MOBILEBERT_START_DOCSTRING,

View File

@@ -1014,7 +1014,7 @@ class TFMobileBertModel(TFMobileBertPreTrainedModel):
@add_start_docstrings( @add_start_docstrings(
""" """
MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a MobileBert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
`next sentence prediction (classification)` head. `next sentence prediction (classification)` head.
""", """,
MOBILEBERT_START_DOCSTRING, MOBILEBERT_START_DOCSTRING,

View File

@@ -96,6 +96,9 @@ class OpenAIGPTConfig(PretrainedConfig):
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`. :class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
The dropout ratio to be used after the projection and activation. The dropout ratio to be used after the projection and activation.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
Examples:: Examples::
@@ -133,6 +136,7 @@ class OpenAIGPTConfig(PretrainedConfig):
summary_activation=None, summary_activation=None,
summary_proj_to_labels=True, summary_proj_to_labels=True,
summary_first_dropout=0.1, summary_first_dropout=0.1,
use_cache=True,
**kwargs **kwargs
): ):
super().__init__(**kwargs) super().__init__(**kwargs)
@@ -155,6 +159,7 @@ class OpenAIGPTConfig(PretrainedConfig):
self.summary_activation = summary_activation self.summary_activation = summary_activation
self.summary_first_dropout = summary_first_dropout self.summary_first_dropout = summary_first_dropout
self.summary_proj_to_labels = summary_proj_to_labels self.summary_proj_to_labels = summary_proj_to_labels
self.use_cache = use_cache
@property @property
def max_position_embeddings(self): def max_position_embeddings(self):

View File

@@ -90,6 +90,8 @@ class ProphetNetConfig(PretrainedConfig):
eps (:obj:`float`, `optional`, defaults to 0.0): eps (:obj:`float`, `optional`, defaults to 0.0):
Controls the ``epsilon`` parameter value for label smoothing in the loss calculation. If set to 0, no label Controls the ``epsilon`` parameter value for label smoothing in the loss calculation. If set to 0, no label
smoothing is performed. smoothing is performed.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
""" """
model_type = "prophetnet" model_type = "prophetnet"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
@@ -112,15 +114,16 @@ class ProphetNetConfig(PretrainedConfig):
init_std=0.02, init_std=0.02,
is_encoder_decoder=True, is_encoder_decoder=True,
add_cross_attention=True, add_cross_attention=True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
decoder_start_token_id=0, decoder_start_token_id=0,
ngram=2, ngram=2,
num_buckets=32, num_buckets=32,
relative_max_distance=128, relative_max_distance=128,
disable_ngram_loss=False, disable_ngram_loss=False,
eps=0.0, eps=0.0,
use_cache=True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
**kwargs **kwargs
): ):
super().__init__( super().__init__(
@@ -156,6 +159,8 @@ class ProphetNetConfig(PretrainedConfig):
self.activation_dropout = activation_dropout self.activation_dropout = activation_dropout
self.dropout = dropout self.dropout = dropout
self.use_cache = use_cache
@property @property
def num_attention_heads(self) -> int: def num_attention_heads(self) -> int:
return self.num_encoder_attention_heads return self.num_encoder_attention_heads

View File

@@ -72,6 +72,8 @@ RAG_CONFIG_DOC = r"""
output_retrieved(:obj:`bool`, `optional`, defaults to :obj:`False`): output_retrieved(:obj:`bool`, `optional`, defaults to :obj:`False`):
If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and
:obj:`context_attention_mask` are returned. See returned tensors for more detail. :obj:`context_attention_mask` are returned. See returned tensors for more detail.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
""" """
@@ -107,6 +109,7 @@ class RagConfig(PretrainedConfig):
exclude_bos_score=False, exclude_bos_score=False,
do_marginalize=False, do_marginalize=False,
output_retrieved=False, output_retrieved=False,
use_cache=True,
**kwargs **kwargs
): ):
super().__init__( super().__init__(
@@ -156,6 +159,8 @@ class RagConfig(PretrainedConfig):
self.do_deduplication = do_deduplication self.do_deduplication = do_deduplication
self.use_cache = use_cache
@classmethod @classmethod
def from_question_encoder_generator_configs( def from_question_encoder_generator_configs(
cls, question_encoder_config: PretrainedConfig, generator_config: PretrainedConfig, **kwargs cls, question_encoder_config: PretrainedConfig, generator_config: PretrainedConfig, **kwargs

View File

@@ -138,6 +138,8 @@ class ReformerConfig(PretrainedConfig):
:obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`): tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to tie input and output embeddings. Whether to tie input and output embeddings.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
Examples:: Examples::
@@ -188,6 +190,7 @@ class ReformerConfig(PretrainedConfig):
pad_token_id=0, pad_token_id=0,
vocab_size=320, vocab_size=320,
tie_word_embeddings=False, tie_word_embeddings=False,
use_cache=True,
**kwargs **kwargs
): ):
super().__init__( super().__init__(
@@ -226,3 +229,4 @@ class ReformerConfig(PretrainedConfig):
self.axial_norm_std = axial_norm_std self.axial_norm_std = axial_norm_std
self.chunk_size_lm_head = chunk_size_lm_head self.chunk_size_lm_head = chunk_size_lm_head
self.attn_layers = attn_layers self.attn_layers = attn_layers
self.use_cache = use_cache

View File

@@ -69,6 +69,8 @@ class T5Config(PretrainedConfig):
feed_forward_proj (:obj:`string`, `optional`, defaults to :obj:`"relu"`): feed_forward_proj (:obj:`string`, `optional`, defaults to :obj:`"relu"`):
Type of feed forward layer to be used. Should be one of :obj:`"relu"` or :obj:`"gated-gelu"`. T5v1.1 uses Type of feed forward layer to be used. Should be one of :obj:`"relu"` or :obj:`"gated-gelu"`. T5v1.1 uses
the :obj:`"gated-gelu"` feed forward projection. Original T5 uses :obj:`"relu"`. the :obj:`"gated-gelu"` feed forward projection. Original T5 uses :obj:`"relu"`.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models).
""" """
model_type = "t5" model_type = "t5"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
@@ -88,6 +90,7 @@ class T5Config(PretrainedConfig):
initializer_factor=1.0, initializer_factor=1.0,
feed_forward_proj="relu", feed_forward_proj="relu",
is_encoder_decoder=True, is_encoder_decoder=True,
use_cache=True,
pad_token_id=0, pad_token_id=0,
eos_token_id=1, eos_token_id=1,
**kwargs **kwargs
@@ -112,6 +115,7 @@ class T5Config(PretrainedConfig):
self.layer_norm_epsilon = layer_norm_epsilon self.layer_norm_epsilon = layer_norm_epsilon
self.initializer_factor = initializer_factor self.initializer_factor = initializer_factor
self.feed_forward_proj = feed_forward_proj self.feed_forward_proj = feed_forward_proj
self.use_cache = use_cache
@property @property
def hidden_size(self): def hidden_size(self):

View File

@@ -884,7 +884,7 @@ T5_INPUTS_DOCSTRING = r"""
:func:`transformers.PreTrainedTokenizer.__call__` and :func:`transformers.PreTrainedTokenizer.encode` for :func:`transformers.PreTrainedTokenizer.__call__` and :func:`transformers.PreTrainedTokenizer.encode` for
details. details.
To know more on how to prepare :obj:`inputs` for pre-training take a look at `T5 Training To know more on how to prepare :obj:`inputs` for pretraining take a look at `T5 Training
<./t5.html#training>`__. <./t5.html#training>`__.
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for

View File

@@ -15,6 +15,8 @@
# limitations under the License. # limitations under the License.
""" XLNet configuration """ """ XLNet configuration """
import warnings
from ...configuration_utils import PretrainedConfig from ...configuration_utils import PretrainedConfig
from ...utils import logging from ...utils import logging
@@ -106,12 +108,18 @@ class XLNetConfig(PretrainedConfig):
Used in the SQuAD evaluation script. Used in the SQuAD evaluation script.
end_n_top (:obj:`int`, `optional`, defaults to 5): end_n_top (:obj:`int`, `optional`, defaults to 5):
Used in the SQuAD evaluation script. Used in the SQuAD evaluation script.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`): use_mems_eval (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last pre-computed hidden states. Whether or not the model should make use of the recurrent memory mechanism in evaluation mode.
use_mems_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should make use of the recurrent memory mechanism in train mode.
.. note:: .. note::
This flag behaves differently from with other models: it just controls the inference behavior, during For pretraining, it is recommended to set ``use_mems_train`` to :obj:`True`. For fine-tuning, it is
training the model always uses ``use_cache=True``. recommended to set ``use_mems_train`` to :obj:`False` as discussed `here
<https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587>`__. If ``use_mems_train`` is set
to :obj:`True`, one has to make sure that the train batches are correctly pre-processed, `e.g.`
:obj:`batch_1 = [[This line is], [This is the]]` and :obj:`batch_2 = [[ the first line], [ second
line]]` and that all batches are of equal size.
Examples:: Examples::
@@ -145,6 +153,8 @@ class XLNetConfig(PretrainedConfig):
dropout=0.1, dropout=0.1,
mem_len=512, mem_len=512,
reuse_len=None, reuse_len=None,
use_mems_eval=True,
use_mems_train=False,
bi_data=False, bi_data=False,
clamp_len=-1, clamp_len=-1,
same_length=False, same_length=False,
@@ -197,6 +207,16 @@ class XLNetConfig(PretrainedConfig):
self.pad_token_id = pad_token_id self.pad_token_id = pad_token_id
self.eos_token_id = eos_token_id self.eos_token_id = eos_token_id
if "use_cache" in kwargs:
warnings.warn(
"The `use_cache` argument is deprecated and will be removed in a future version, use `use_mems_eval` instead.",
FutureWarning,
)
use_mems_eval = kwargs["use_cache"]
self.use_mems_eval = use_mems_eval
self.use_mems_train = use_mems_train
@property @property
def max_position_embeddings(self): def max_position_embeddings(self):
return -1 return -1

View File

@@ -440,6 +440,9 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
self.layer = [TFXLNetLayer(config, name="layer_._{}".format(i)) for i in range(config.n_layer)] self.layer = [TFXLNetLayer(config, name="layer_._{}".format(i)) for i in range(config.n_layer)]
self.dropout = tf.keras.layers.Dropout(config.dropout) self.dropout = tf.keras.layers.Dropout(config.dropout)
self.use_mems_eval = config.use_mems_eval
self.use_mems_train = config.use_mems_train
def get_input_embeddings(self): def get_input_embeddings(self):
return self.word_embedding return self.word_embedding
@@ -489,14 +492,23 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
return ret return ret
def cache_mem(self, curr_out, prev_mem): def cache_mem(self, curr_out, prev_mem):
"""cache hidden states into memory.""" # cache hidden states into memory.
if self.reuse_len is not None and self.reuse_len > 0: if self.reuse_len is not None and self.reuse_len > 0:
curr_out = curr_out[: self.reuse_len] curr_out = curr_out[: self.reuse_len]
if prev_mem is None: if self.mem_len is None or self.mem_len == 0:
new_mem = curr_out[-self.mem_len :] # If :obj:`use_mems` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
# and returns all of the past and current hidden states.
cutoff = 0
else: else:
new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len :] # If :obj:`use_mems` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
# states. This is the preferred setting for training and long-form generation.
cutoff = -self.mem_len
if prev_mem is None:
# if :obj:`use_mems` is active and `mem_len` is defined, the model
new_mem = curr_out[cutoff:]
else:
new_mem = tf.concat([prev_mem, curr_out], 0)[cutoff:]
return tf.stop_gradient(new_mem) return tf.stop_gradient(new_mem)
@@ -569,7 +581,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
input_mask=None, input_mask=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=True, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
@@ -587,7 +599,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -602,6 +614,11 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
) )
return_dict = inputs["return_dict"] if inputs["return_dict"] is not None else self.return_dict return_dict = inputs["return_dict"] if inputs["return_dict"] is not None else self.return_dict
if training:
use_mems = use_mems if use_mems is not None else self.use_mems_train
else:
use_mems = use_mems if use_mems is not None else self.use_mems_eval
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
# but we want a unified interface in the library with the batch size on the first dimension # but we want a unified interface in the library with the batch size on the first dimension
# so we move here the first dimension (batch) to the end # so we move here the first dimension (batch) to the end
@@ -737,7 +754,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
hidden_states = [] if output_hidden_states else None hidden_states = [] if output_hidden_states else None
for i, layer_module in enumerate(self.layer): for i, layer_module in enumerate(self.layer):
# cache new mems # cache new mems
if self.mem_len is not None and self.mem_len > 0 and use_cache: if use_mems:
new_mems = new_mems + (self.cache_mem(output_h, inputs["mems"][i]),) new_mems = new_mems + (self.cache_mem(output_h, inputs["mems"][i]),)
if output_hidden_states: if output_hidden_states:
hidden_states.append((output_h, output_g) if output_g is not None else output_h) hidden_states.append((output_h, output_g) if output_g is not None else output_h)
@@ -768,7 +785,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method) # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
output = tf.transpose(output, perm=(1, 0, 2)) output = tf.transpose(output, perm=(1, 0, 2))
if not (self.mem_len is not None and self.mem_len > 0 and use_cache): if not use_mems:
new_mems = None new_mems = None
if output_hidden_states: if output_hidden_states:
if output_g is not None: if output_g is not None:
@@ -1066,7 +1083,7 @@ XLNET_INPUTS_DOCSTRING = r"""
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids` decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
as they have already been computed. as they have already been computed.
:obj::obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`. :obj::obj:`use_mems` has to be set to :obj:`True` to make use of :obj:`mems`.
perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`): perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``: Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
@@ -1147,7 +1164,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
input_mask=None, input_mask=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=True, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
@@ -1165,7 +1182,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -1182,7 +1199,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
input_mask=inputs["input_mask"], input_mask=inputs["input_mask"],
head_mask=inputs["head_mask"], head_mask=inputs["head_mask"],
inputs_embeds=inputs["inputs_embeds"], inputs_embeds=inputs["inputs_embeds"],
use_cache=inputs["use_cache"], use_mems=inputs["use_mems"],
output_attentions=inputs["output_attentions"], output_attentions=inputs["output_attentions"],
output_hidden_states=inputs["output_hidden_states"], output_hidden_states=inputs["output_hidden_states"],
return_dict=inputs["return_dict"], return_dict=inputs["return_dict"],
@@ -1207,7 +1224,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.lm_loss.input_embeddings return self.lm_loss.input_embeddings
def prepare_inputs_for_generation(self, inputs, past, **kwargs): def prepare_inputs_for_generation(self, inputs, past, use_mems=None, **kwargs):
# Add dummy token at the end (no attention on this one) # Add dummy token at the end (no attention on this one)
# At every pass, the attention values for the new token and the two last generated tokens # At every pass, the attention values for the new token and the two last generated tokens
@@ -1238,7 +1255,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
"input_ids": inputs, "input_ids": inputs,
"perm_mask": perm_mask, "perm_mask": perm_mask,
"target_mapping": target_mapping, "target_mapping": target_mapping,
"use_cache": kwargs["use_cache"], "use_mems": kwargs.get("use_mems"),
} }
# if past is defined in model kwargs then use it for faster decoding # if past is defined in model kwargs then use it for faster decoding
@@ -1260,7 +1277,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
input_mask=None, input_mask=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=True, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
@@ -1309,7 +1326,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -1328,7 +1345,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
input_mask=inputs["input_mask"], input_mask=inputs["input_mask"],
head_mask=inputs["head_mask"], head_mask=inputs["head_mask"],
inputs_embeds=inputs["inputs_embeds"], inputs_embeds=inputs["inputs_embeds"],
use_cache=inputs["use_cache"], use_mems=inputs["use_mems"],
output_attentions=inputs["output_attentions"], output_attentions=inputs["output_attentions"],
output_hidden_states=inputs["output_hidden_states"], output_hidden_states=inputs["output_hidden_states"],
return_dict=return_dict, return_dict=return_dict,
@@ -1395,7 +1412,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
input_mask=None, input_mask=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=True, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
@@ -1420,7 +1437,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -1439,7 +1456,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
input_mask=inputs["input_mask"], input_mask=inputs["input_mask"],
head_mask=inputs["head_mask"], head_mask=inputs["head_mask"],
inputs_embeds=inputs["inputs_embeds"], inputs_embeds=inputs["inputs_embeds"],
use_cache=inputs["use_cache"], use_mems=inputs["use_mems"],
output_attentions=inputs["output_attentions"], output_attentions=inputs["output_attentions"],
output_hidden_states=inputs["output_hidden_states"], output_hidden_states=inputs["output_hidden_states"],
return_dict=return_dict, return_dict=return_dict,
@@ -1512,7 +1529,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
target_mapping=None, target_mapping=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=True, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
@@ -1526,6 +1543,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
num_choices]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See num_choices]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
:obj:`input_ids` above) :obj:`input_ids` above)
""" """
inputs = input_processing( inputs = input_processing(
func=self.call, func=self.call,
input_ids=input_ids, input_ids=input_ids,
@@ -1537,7 +1555,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -1579,7 +1597,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
flat_input_mask, flat_input_mask,
inputs["head_mask"], inputs["head_mask"],
flat_inputs_embeds, flat_inputs_embeds,
inputs["use_cache"], inputs["use_mems"],
inputs["output_attentions"], inputs["output_attentions"],
inputs["output_hidden_states"], inputs["output_hidden_states"],
return_dict=return_dict, return_dict=return_dict,
@@ -1639,7 +1657,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
input_mask=None, input_mask=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=True, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
@@ -1663,7 +1681,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -1682,7 +1700,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
input_mask=inputs["input_mask"], input_mask=inputs["input_mask"],
head_mask=inputs["head_mask"], head_mask=inputs["head_mask"],
inputs_embeds=inputs["inputs_embeds"], inputs_embeds=inputs["inputs_embeds"],
use_cache=inputs["use_cache"], use_mems=inputs["use_mems"],
output_attentions=inputs["output_attentions"], output_attentions=inputs["output_attentions"],
output_hidden_states=inputs["output_hidden_states"], output_hidden_states=inputs["output_hidden_states"],
return_dict=return_dict, return_dict=return_dict,
@@ -1739,7 +1757,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
input_mask=None, input_mask=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=True, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
@@ -1769,7 +1787,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -1789,7 +1807,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
input_mask=inputs["input_mask"], input_mask=inputs["input_mask"],
head_mask=inputs["head_mask"], head_mask=inputs["head_mask"],
inputs_embeds=inputs["inputs_embeds"], inputs_embeds=inputs["inputs_embeds"],
use_cache=inputs["use_cache"], use_mems=inputs["use_mems"],
output_attentions=inputs["output_attentions"], output_attentions=inputs["output_attentions"],
output_hidden_states=inputs["output_hidden_states"], output_hidden_states=inputs["output_hidden_states"],
return_dict=return_dict, return_dict=return_dict,

View File

@@ -16,6 +16,7 @@
""" """
PyTorch XLNet model. PyTorch XLNet model.
""" """
import warnings
from dataclasses import dataclass from dataclasses import dataclass
from typing import List, Optional, Tuple from typing import List, Optional, Tuple
@@ -876,7 +877,7 @@ XLNET_INPUTS_DOCSTRING = r"""
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids` decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
as they have already been computed. as they have already been computed.
:obj::obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`. :obj:`use_mems` has to be set to :obj:`True` to make use of :obj:`mems`.
perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`): perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``: Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
@@ -997,15 +998,15 @@ class XLNetModel(XLNetPreTrainedModel):
curr_out = curr_out[: self.reuse_len] curr_out = curr_out[: self.reuse_len]
if self.mem_len is None or self.mem_len == 0: if self.mem_len is None or self.mem_len == 0:
# If :obj:`use_cache` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time # If :obj:`use_mems` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
# and returns all of the past and current hidden states. # and returns all of the past and current hidden states.
cutoff = 0 cutoff = 0
else: else:
# If :obj:`use_cache` is active and `mem_len` is defined, the model returns the last `mem_len` hidden # If :obj:`use_mems` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
# states. This is the preferred setting for training and long-form generation. # states. This is the preferred setting for training and long-form generation.
cutoff = -self.mem_len cutoff = -self.mem_len
if prev_mem is None: if prev_mem is None:
# if :obj:`use_cache` is active and `mem_len` is defined, the model # if :obj:`use_mems` is active and `mem_len` is defined, the model
new_mem = curr_out[cutoff:] new_mem = curr_out[cutoff:]
else: else:
new_mem = torch.cat([prev_mem, curr_out], dim=0)[cutoff:] new_mem = torch.cat([prev_mem, curr_out], dim=0)[cutoff:]
@@ -1080,10 +1081,11 @@ class XLNetModel(XLNetPreTrainedModel):
input_mask=None, input_mask=None,
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
use_cache=None, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
**kwargs, # delete after depreciation warning is removed
): ):
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
@@ -1091,7 +1093,18 @@ class XLNetModel(XLNetPreTrainedModel):
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
) )
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
if "use_cache" in kwargs:
warnings.warn(
"The `use_cache` argument is deprecated and will be removed in a future version, use `use_mems` instead.",
FutureWarning,
)
use_mems = kwargs["use_cache"]
if self.training:
use_mems = use_mems if use_mems is not None else self.config.use_mems_train
else:
use_mems = use_mems if use_mems is not None else self.config.use_mems_eval
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end # the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
# but we want a unified interface in the library with the batch size on the first dimension # but we want a unified interface in the library with the batch size on the first dimension
@@ -1222,7 +1235,7 @@ class XLNetModel(XLNetPreTrainedModel):
attentions = [] if output_attentions else None attentions = [] if output_attentions else None
hidden_states = [] if output_hidden_states else None hidden_states = [] if output_hidden_states else None
for i, layer_module in enumerate(self.layer): for i, layer_module in enumerate(self.layer):
if use_cache: if use_mems:
# cache new mems # cache new mems
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),) new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
if output_hidden_states: if output_hidden_states:
@@ -1253,7 +1266,7 @@ class XLNetModel(XLNetPreTrainedModel):
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method) # Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
output = output.permute(1, 0, 2).contiguous() output = output.permute(1, 0, 2).contiguous()
if not use_cache: if not use_mems:
new_mems = None new_mems = None
if output_hidden_states: if output_hidden_states:
@@ -1299,7 +1312,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
def get_output_embeddings(self): def get_output_embeddings(self):
return self.lm_loss return self.lm_loss
def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs): def prepare_inputs_for_generation(self, input_ids, past=None, use_mems=None, **kwargs):
# Add dummy token at the end (no attention on this one) # Add dummy token at the end (no attention on this one)
effective_batch_size = input_ids.shape[0] effective_batch_size = input_ids.shape[0]
@@ -1332,7 +1345,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
"input_ids": input_ids, "input_ids": input_ids,
"perm_mask": perm_mask, "perm_mask": perm_mask,
"target_mapping": target_mapping, "target_mapping": target_mapping,
"use_cache": use_cache, "use_mems": use_mems,
} }
# if past is defined in model kwargs then use it for faster decoding # if past is defined in model kwargs then use it for faster decoding
@@ -1355,10 +1368,11 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
labels=None, labels=None,
use_cache=None, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
**kwargs, # delete when `use_cache` is removed in XLNetModel
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`):
@@ -1407,7 +1421,6 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
>>> next_token_logits = outputs.logits # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size] >>> next_token_logits = outputs.logits # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
transformer_outputs = self.transformer( transformer_outputs = self.transformer(
input_ids, input_ids,
@@ -1419,10 +1432,11 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
**kwargs,
) )
logits = self.lm_loss(transformer_outputs[0]) logits = self.lm_loss(transformer_outputs[0])
@@ -1483,10 +1497,11 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
labels=None, labels=None,
use_cache=None, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
**kwargs, # delete when `use_cache` is removed in XLNetModel
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
@@ -1495,7 +1510,6 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy). If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
transformer_outputs = self.transformer( transformer_outputs = self.transformer(
input_ids, input_ids,
@@ -1507,10 +1521,11 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
**kwargs,
) )
output = transformer_outputs[0] output = transformer_outputs[0]
@@ -1576,10 +1591,11 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
labels=None, labels=None,
use_cache=None, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
**kwargs, # delete when `use_cache` is removed in XLNetModel
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
@@ -1588,7 +1604,6 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
`input_ids` above) `input_ids` above)
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
outputs = self.transformer( outputs = self.transformer(
input_ids, input_ids,
@@ -1600,7 +1615,7 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
@@ -1673,10 +1688,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
head_mask=None, head_mask=None,
inputs_embeds=None, inputs_embeds=None,
labels=None, labels=None,
use_cache=None, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
**kwargs, # delete when `use_cache` is removed in XLNetModel
): ):
r""" r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
@@ -1685,7 +1701,7 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
:obj:`input_ids` above) :obj:`input_ids` above)
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1] num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
@@ -1708,10 +1724,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
target_mapping=target_mapping, target_mapping=target_mapping,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=flat_inputs_embeds, inputs_embeds=flat_inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
**kwargs,
) )
output = transformer_outputs[0] output = transformer_outputs[0]
@@ -1775,10 +1792,11 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
inputs_embeds=None, inputs_embeds=None,
start_positions=None, start_positions=None,
end_positions=None, end_positions=None,
use_cache=None, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
**kwargs, # delete when `use_cache` is removed in XLNetModel
): ):
r""" r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
@@ -1791,7 +1809,6 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
sequence are not taken into account for computing the loss. sequence are not taken into account for computing the loss.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
outputs = self.transformer( outputs = self.transformer(
input_ids, input_ids,
@@ -1803,10 +1820,11 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
**kwargs,
) )
sequence_output = outputs[0] sequence_output = outputs[0]
@@ -1885,10 +1903,11 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
is_impossible=None, is_impossible=None,
cls_index=None, cls_index=None,
p_mask=None, p_mask=None,
use_cache=None, use_mems=None,
output_attentions=None, output_attentions=None,
output_hidden_states=None, output_hidden_states=None,
return_dict=None, return_dict=None,
**kwargs, # delete when `use_cache` is removed in XLNetModel
): ):
r""" r"""
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
@@ -1926,7 +1945,6 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
>>> loss = outputs.loss >>> loss = outputs.loss
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
transformer_outputs = self.transformer( transformer_outputs = self.transformer(
input_ids, input_ids,
@@ -1938,10 +1956,11 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
input_mask=input_mask, input_mask=input_mask,
head_mask=head_mask, head_mask=head_mask,
inputs_embeds=inputs_embeds, inputs_embeds=inputs_embeds,
use_cache=use_cache, use_mems=use_mems,
output_attentions=output_attentions, output_attentions=output_attentions,
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
**kwargs,
) )
hidden_states = transformer_outputs[0] hidden_states = transformer_outputs[0]
start_logits = self.start_logits(hidden_states, p_mask=p_mask) start_logits = self.start_logits(hidden_states, p_mask=p_mask)

View File

@@ -153,7 +153,7 @@ class TFXLNetModelTester:
inputs = [input_ids_1, input_mask] inputs = [input_ids_1, input_mask]
result = model(inputs) result = model(inputs)
config.mem_len = 0 config.use_mems_eval = False
model = TFXLNetModel(config) model = TFXLNetModel(config)
no_mems_outputs = model(inputs) no_mems_outputs = model(inputs)
self.parent.assertEqual(len(no_mems_outputs), 1) self.parent.assertEqual(len(no_mems_outputs), 1)

View File

@@ -206,7 +206,36 @@ class XLNetModelTester:
[(self.seq_length, self.batch_size, self.hidden_size)] * self.num_hidden_layers, [(self.seq_length, self.batch_size, self.hidden_size)] * self.num_hidden_layers,
) )
def create_and_check_xlnet_model_use_cache( def create_and_check_use_mems_train(
self,
config,
input_ids_1,
input_ids_2,
input_ids_q,
perm_mask,
input_mask,
target_mapping,
segment_ids,
lm_labels,
sequence_labels,
is_impossible_labels,
token_labels,
):
model = XLNetForSequenceClassification(config)
model.to(torch_device)
model.train()
train_size = input_ids_1.shape[0]
batch_size = 4
for i in range(train_size // batch_size + 1):
input_ids = input_ids_1[i : (i + 1) * batch_size]
labels = sequence_labels[i : (i + 1) * batch_size]
outputs = model(input_ids=input_ids, labels=labels, return_dict=True)
self.parent.assertIsNone(outputs.mems)
self.parent.assertIsNotNone(outputs.loss)
def create_and_check_xlnet_model_use_mems(
self, self,
config, config,
input_ids_1, input_ids_1,
@@ -234,8 +263,8 @@ class XLNetModelTester:
device=torch_device, device=torch_device,
) )
causal_mask = torch.triu(causal_mask, diagonal=0) causal_mask = torch.triu(causal_mask, diagonal=0)
outputs_cache = model(input_ids_1, use_cache=True, perm_mask=causal_mask) outputs_cache = model(input_ids_1, use_mems=True, perm_mask=causal_mask)
outputs_no_cache = model(input_ids_1, use_cache=False, perm_mask=causal_mask) outputs_no_cache = model(input_ids_1, use_mems=False, perm_mask=causal_mask)
outputs_conf = model(input_ids_1) outputs_conf = model(input_ids_1)
self.parent.assertTrue(len(outputs_cache) == len(outputs_conf)) self.parent.assertTrue(len(outputs_cache) == len(outputs_conf))
@@ -525,11 +554,15 @@ class XLNetModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
config_and_inputs = self.model_tester.prepare_config_and_inputs() config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_xlnet_base_model(*config_and_inputs) self.model_tester.create_and_check_xlnet_base_model(*config_and_inputs)
def test_xlnet_base_model_use_cache(self): def test_xlnet_base_model_use_mems(self):
# checking that in auto-regressive mode, :obj:`use_cache` gives the same results # checking that in auto-regressive mode, :obj:`use_mems` gives the same results
self.model_tester.set_seed() self.model_tester.set_seed()
config_and_inputs = self.model_tester.prepare_config_and_inputs() config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_xlnet_model_use_cache(*config_and_inputs) self.model_tester.create_and_check_xlnet_model_use_mems(*config_and_inputs)
def test_seq_classification_use_mems_train(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_use_mems_train(*config_and_inputs)
def test_xlnet_base_model_with_att_output(self): def test_xlnet_base_model_with_att_output(self):
self.model_tester.set_seed() self.model_tester.set_seed()