[XLNet] Fix mems behavior (#8567)
* fix mems in xlnet * fix use_mems * fix use_mem_len * fix use mems * clean docs * fix tf typo * make xlnet tf for generation work * fix tf test * refactor use cache * add use cache for missing models * correct use_cache in generate * correct use cache in tf generate * fix tf * correct getattr typo * make sylvain happy * change in docs as well * do not apply to cookie cutter statements * fix tf test * make pytorch model fully backward compatible
This commit is contained in:
committed by
GitHub
parent
369f1d77b4
commit
2a6fbe6a40
@@ -97,6 +97,6 @@ You should check out our [swift-coreml-transformers](https://github.com/huggingf
|
|||||||
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
|
It contains a set of tools to convert PyTorch or TensorFlow 2.0 trained Transformer models (currently contains `GPT-2`,
|
||||||
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
`DistilGPT-2`, `BERT`, and `DistilBERT`) to CoreML models that run on iOS devices.
|
||||||
|
|
||||||
At some point in the future, you'll be able to seamlessly move from pre-training or fine-tuning models in PyTorch or
|
At some point in the future, you'll be able to seamlessly move from pretraining or fine-tuning models in PyTorch or
|
||||||
TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its
|
TensorFlow 2.0 to productizing them in CoreML, or prototype a model or an app in CoreML then research its
|
||||||
hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting!
|
hyperparameters or architecture from PyTorch or TensorFlow 2.0. Super exciting!
|
||||||
|
|||||||
@@ -10,7 +10,7 @@ Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Ali
|
|||||||
|
|
||||||
The abstract from the paper is the following:
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
*Unsupervised pre-training of large neural models has recently revolutionized Natural Language Processing. By
|
*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
|
||||||
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
|
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
|
||||||
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
|
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
|
||||||
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
|
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
|
||||||
|
|||||||
@@ -20,8 +20,8 @@ disentangled attention mechanism, where each word is represented using two vecto
|
|||||||
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
||||||
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
||||||
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
||||||
of model pre-training and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half
|
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
||||||
of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||||
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
||||||
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
||||||
|
|
||||||
|
|||||||
@@ -18,9 +18,9 @@ operating these large models in on-the-edge and/or under constrained computation
|
|||||||
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
|
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
|
||||||
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
|
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
|
||||||
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
|
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
|
||||||
knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by
|
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
|
||||||
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
|
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
|
||||||
biases learned by larger models during pre-training, we introduce a triple loss combining language modeling,
|
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
|
||||||
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
|
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
|
||||||
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
|
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
|
||||||
study.*
|
study.*
|
||||||
|
|||||||
@@ -12,14 +12,14 @@ identify which tokens were replaced by the generator in the sequence.
|
|||||||
|
|
||||||
The abstract from the paper is the following:
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
|
*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
|
||||||
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
||||||
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
|
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
|
||||||
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
|
more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
|
||||||
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
|
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
|
||||||
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
|
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
|
||||||
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
|
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
|
||||||
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
|
demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
|
||||||
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
|
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
|
||||||
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
|
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
|
||||||
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
|
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
|
||||||
|
|||||||
@@ -19,7 +19,7 @@ representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018;
|
|||||||
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
|
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
|
||||||
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
|
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
|
||||||
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
|
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
|
||||||
time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation
|
time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
|
||||||
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
|
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
|
||||||
community for further reproducible experiments in French NLP.*
|
community for further reproducible experiments in French NLP.*
|
||||||
|
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ The abstract from the paper is the following:
|
|||||||
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
|
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
|
||||||
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
|
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
|
||||||
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
|
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
|
||||||
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a
|
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
|
||||||
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
|
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
|
||||||
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
|
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
|
||||||
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
|
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
|
||||||
|
|||||||
@@ -6,19 +6,19 @@ Overview
|
|||||||
|
|
||||||
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
|
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
|
||||||
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
|
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
|
||||||
Ming Zhou. It's a simple but effective pre-training method of text and layout for document image understanding and
|
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
|
||||||
information extraction tasks, such as form understanding and receipt understanding.
|
information extraction tasks, such as form understanding and receipt understanding.
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
|
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
|
||||||
widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation,
|
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
|
||||||
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
|
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
|
||||||
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
|
the \textbf{LayoutLM} to jointly model interactions between text and layout information across scanned document images,
|
||||||
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
|
which is beneficial for a great number of real-world document image understanding tasks such as information extraction
|
||||||
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
|
from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into
|
||||||
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
|
LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single
|
||||||
framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks,
|
framework for document-level pretraining. It achieves new state-of-the-art results in several downstream tasks,
|
||||||
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
|
including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image
|
||||||
classification (from 93.07 to 94.42).*
|
classification (from 93.07 to 94.42).*
|
||||||
|
|
||||||
|
|||||||
@@ -19,7 +19,7 @@ Encoder Representations from Transformers) framework to learn these vision-and-l
|
|||||||
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
|
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
|
||||||
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
|
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
|
||||||
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
|
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
|
||||||
pre-training tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
||||||
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
|
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
|
||||||
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
|
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
|
||||||
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
|
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
|
||||||
|
|||||||
@@ -13,7 +13,7 @@ The MBart model was presented in `Multilingual Denoising Pre-training for Neural
|
|||||||
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
|
|
||||||
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
|
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
|
||||||
corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete
|
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
|
||||||
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
|
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
|
||||||
on the encoder, decoder, or reconstructing parts of the text.
|
on the encoder, decoder, or reconstructing parts of the text.
|
||||||
|
|
||||||
|
|||||||
@@ -17,7 +17,7 @@ the next token.
|
|||||||
|
|
||||||
The abstract from the paper is the following:
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
|
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
|
||||||
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
||||||
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
||||||
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
||||||
@@ -25,7 +25,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
|
|||||||
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
||||||
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
||||||
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
||||||
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
|
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
|
||||||
|
|
||||||
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
||||||
|
|
||||||
|
|||||||
@@ -17,7 +17,7 @@ The abstract from the paper is the following:
|
|||||||
task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
|
task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning
|
||||||
has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
|
has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of
|
||||||
transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
|
transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a
|
||||||
text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer
|
text-to-text format. Our systematic study compares pretraining objectives, architectures, unlabeled datasets, transfer
|
||||||
approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
|
approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration
|
||||||
with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
|
with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering
|
||||||
summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
|
summarization, question answering, text classification, and more. To facilitate future work on transfer learning for
|
||||||
|
|||||||
@@ -19,7 +19,7 @@ just the next token. Its architecture is identical to ProhpetNet, but the model
|
|||||||
|
|
||||||
The abstract from the paper is the following:
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
*In this paper, we present a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel
|
*In this paper, we present a new sequence-to-sequence pretraining model called ProphetNet, which introduces a novel
|
||||||
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of
|
||||||
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
the optimization of one-step ahead prediction in traditional sequence-to-sequence model, the ProphetNet is optimized by
|
||||||
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
n-step ahead prediction which predicts the next n tokens simultaneously based on previous context tokens at each time
|
||||||
@@ -27,7 +27,7 @@ step. The future n-gram prediction explicitly encourages the model to plan for t
|
|||||||
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large scale
|
||||||
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for
|
||||||
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new
|
||||||
state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.*
|
state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.*
|
||||||
|
|
||||||
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
The Authors' code can be found `here <https://github.com/microsoft/ProphetNet>`__.
|
||||||
|
|
||||||
|
|||||||
@@ -527,7 +527,7 @@ Pegasus
|
|||||||
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
|
<https://arxiv.org/pdf/1912.08777.pdf>`_, Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu on Dec 18, 2019.
|
||||||
|
|
||||||
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
|
Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on
|
||||||
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pre-training
|
two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining
|
||||||
objective, called Gap Sentence Generation (GSG).
|
objective, called Gap Sentence Generation (GSG).
|
||||||
|
|
||||||
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
|
* MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in
|
||||||
@@ -609,7 +609,7 @@ MT5
|
|||||||
`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
|
`mT5: A massively multilingual pre-trained text-to-text transformer <https://arxiv.org/abs/2010.11934>`_, Linting Xue
|
||||||
et al.
|
et al.
|
||||||
|
|
||||||
The model architecture is same as T5. mT5's pre-training objective includes T5's self-supervised training, but not T5's
|
The model architecture is same as T5. mT5's pretraining objective includes T5's self-supervised training, but not T5's
|
||||||
supervised training. mT5 is trained on 101 languages.
|
supervised training. mT5 is trained on 101 languages.
|
||||||
|
|
||||||
The library provides a version of this model for conditional generation.
|
The library provides a version of this model for conditional generation.
|
||||||
@@ -630,8 +630,8 @@ MBart
|
|||||||
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
|
`Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu,
|
||||||
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
|
|
||||||
The model architecture and pre-training objective is same as BART, but MBart is trained on 25 languages and is intended
|
The model architecture and pretraining objective is same as BART, but MBart is trained on 25 languages and is intended
|
||||||
for supervised and unsupervised machine translation. MBart is one of the first methods for pre-training a complete
|
for supervised and unsupervised machine translation. MBart is one of the first methods for pretraining a complete
|
||||||
sequence-to-sequence model by denoising full texts in multiple languages,
|
sequence-to-sequence model by denoising full texts in multiple languages,
|
||||||
|
|
||||||
The library provides a version of this model for conditional generation.
|
The library provides a version of this model for conditional generation.
|
||||||
@@ -658,7 +658,7 @@ ProphetNet
|
|||||||
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
||||||
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
||||||
|
|
||||||
ProphetNet introduces a novel *sequence-to-sequence* pre-training objective, called *future n-gram prediction*. In
|
ProphetNet introduces a novel *sequence-to-sequence* pretraining objective, called *future n-gram prediction*. In
|
||||||
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
|
future n-gram prediction, the model predicts the next n tokens simultaneously based on previous context tokens at each
|
||||||
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
|
time step instead instead of just the single next token. The future n-gram prediction explicitly encourages the model
|
||||||
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
|
to plan for the future tokens and prevent overfitting on strong local correlations. The model architecture is based on
|
||||||
@@ -683,8 +683,8 @@ XLM-ProphetNet
|
|||||||
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
`ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training, <https://arxiv.org/abs/2001.04063>`__ by
|
||||||
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, Ming Zhou.
|
||||||
|
|
||||||
XLM-ProphetNet's model architecture and pre-training objective is same as ProphetNet, but XLM-ProphetNet was
|
XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained
|
||||||
pre-trained on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
|
on the cross-lingual dataset `XGLUE <https://arxiv.org/abs/2004.01401>`__.
|
||||||
|
|
||||||
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
|
The library provides a pre-trained version of this model for multi-lingual conditional generation and fine-tuned
|
||||||
versions for headline generation and question generation, respectively.
|
versions for headline generation and question generation, respectively.
|
||||||
|
|||||||
@@ -305,7 +305,7 @@ Language modeling is the task of fitting a model to a corpus, which can be domai
|
|||||||
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
|
transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling,
|
||||||
GPT-2 with causal language modeling.
|
GPT-2 with causal language modeling.
|
||||||
|
|
||||||
Language modeling can be useful outside of pre-training as well, for example to shift the model distribution to be
|
Language modeling can be useful outside of pretraining as well, for example to shift the model distribution to be
|
||||||
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
|
domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or
|
||||||
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
|
on scientific papers e.g. `LysandreJik/arxiv-nlp <https://huggingface.co/lysandre/arxiv-nlp>`__.
|
||||||
|
|
||||||
|
|||||||
@@ -55,8 +55,6 @@ class PretrainedConfig(object):
|
|||||||
Whether or not the model should return all hidden-states.
|
Whether or not the model should return all hidden-states.
|
||||||
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
Whether or not the model should returns all attentions.
|
Whether or not the model should returns all attentions.
|
||||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
|
||||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
|
||||||
return_dict (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
return_dict (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
Whether or not the model should return a :class:`~transformers.file_utils.ModelOutput` instead of a plain
|
Whether or not the model should return a :class:`~transformers.file_utils.ModelOutput` instead of a plain
|
||||||
tuple.
|
tuple.
|
||||||
@@ -168,7 +166,6 @@ class PretrainedConfig(object):
|
|||||||
self.return_dict = kwargs.pop("return_dict", True)
|
self.return_dict = kwargs.pop("return_dict", True)
|
||||||
self.output_hidden_states = kwargs.pop("output_hidden_states", False)
|
self.output_hidden_states = kwargs.pop("output_hidden_states", False)
|
||||||
self.output_attentions = kwargs.pop("output_attentions", False)
|
self.output_attentions = kwargs.pop("output_attentions", False)
|
||||||
self.use_cache = kwargs.pop("use_cache", True) # Not used by all models
|
|
||||||
self.torchscript = kwargs.pop("torchscript", False) # Only used by PyTorch models
|
self.torchscript = kwargs.pop("torchscript", False) # Only used by PyTorch models
|
||||||
self.use_bfloat16 = kwargs.pop("use_bfloat16", False)
|
self.use_bfloat16 = kwargs.pop("use_bfloat16", False)
|
||||||
self.pruned_heads = kwargs.pop("pruned_heads", {})
|
self.pruned_heads = kwargs.pop("pruned_heads", {})
|
||||||
|
|||||||
@@ -229,7 +229,7 @@ class LineByLineWithSOPTextDataset(Dataset):
|
|||||||
# to `block_size` anyways, so short sequences are generally wasted
|
# to `block_size` anyways, so short sequences are generally wasted
|
||||||
# computation. However, we *sometimes*
|
# computation. However, we *sometimes*
|
||||||
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
|
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
|
||||||
# sequences to minimize the mismatch between pre-training and fine-tuning.
|
# sequences to minimize the mismatch between pretraining and fine-tuning.
|
||||||
# The `target_seq_length` is just a rough target however, whereas
|
# The `target_seq_length` is just a rough target however, whereas
|
||||||
# `block_size` is a hard limit.
|
# `block_size` is a hard limit.
|
||||||
target_seq_length = max_num_tokens
|
target_seq_length = max_num_tokens
|
||||||
@@ -425,7 +425,7 @@ class TextDatasetForNextSentencePrediction(Dataset):
|
|||||||
# to `block_size` anyways, so short sequences are generally wasted
|
# to `block_size` anyways, so short sequences are generally wasted
|
||||||
# computation. However, we *sometimes*
|
# computation. However, we *sometimes*
|
||||||
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
|
# (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
|
||||||
# sequences to minimize the mismatch between pre-training and fine-tuning.
|
# sequences to minimize the mismatch between pretraining and fine-tuning.
|
||||||
# The `target_seq_length` is just a rough target however, whereas
|
# The `target_seq_length` is just a rough target however, whereas
|
||||||
# `block_size` is a hard limit.
|
# `block_size` is a hard limit.
|
||||||
target_seq_length = max_num_tokens
|
target_seq_length = max_num_tokens
|
||||||
|
|||||||
@@ -38,6 +38,7 @@ class TFGenerationMixin:
|
|||||||
|
|
||||||
def _use_cache(self, outputs, use_cache):
|
def _use_cache(self, outputs, use_cache):
|
||||||
"""During generation, decide whether to pass the `past` variable to the next forward pass."""
|
"""During generation, decide whether to pass the `past` variable to the next forward pass."""
|
||||||
|
use_cache = getattr(self.config, "use_cache", False)
|
||||||
if len(outputs) <= 1 or use_cache is False:
|
if len(outputs) <= 1 or use_cache is False:
|
||||||
return False
|
return False
|
||||||
if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
|
if hasattr(self.config, "mem_len") and self.config.mem_len == 0:
|
||||||
@@ -194,7 +195,6 @@ class TFGenerationMixin:
|
|||||||
min_length = min_length if min_length is not None else self.config.min_length
|
min_length = min_length if min_length is not None else self.config.min_length
|
||||||
do_sample = do_sample if do_sample is not None else self.config.do_sample
|
do_sample = do_sample if do_sample is not None else self.config.do_sample
|
||||||
early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
|
early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
|
||||||
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
|
||||||
num_beams = num_beams if num_beams is not None else self.config.num_beams
|
num_beams = num_beams if num_beams is not None else self.config.num_beams
|
||||||
temperature = temperature if temperature is not None else self.config.temperature
|
temperature = temperature if temperature is not None else self.config.temperature
|
||||||
top_k = top_k if top_k is not None else self.config.top_k
|
top_k = top_k if top_k is not None else self.config.top_k
|
||||||
@@ -224,7 +224,6 @@ class TFGenerationMixin:
|
|||||||
assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
|
assert isinstance(min_length, int) and min_length >= 0, "`min_length` should be a positive integer."
|
||||||
assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
|
assert isinstance(do_sample, bool), "`do_sample` should be a boolean."
|
||||||
assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
|
assert isinstance(early_stopping, bool), "`early_stopping` should be a boolean."
|
||||||
assert isinstance(use_cache, bool), "`use_cache` should be a boolean."
|
|
||||||
assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
|
assert isinstance(num_beams, int) and num_beams > 0, "`num_beams` should be a strictly positive integer."
|
||||||
assert temperature > 0, "`temperature` should be strictly positive."
|
assert temperature > 0, "`temperature` should be strictly positive."
|
||||||
assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."
|
assert isinstance(top_k, int) and top_k >= 0, "`top_k` should be a positive integer."
|
||||||
|
|||||||
@@ -462,7 +462,6 @@ class GenerationMixin:
|
|||||||
pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
|
pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
|
||||||
bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
|
bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
|
||||||
eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
|
eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
|
||||||
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
|
||||||
|
|
||||||
if input_ids is None:
|
if input_ids is None:
|
||||||
# init `input_ids` with bos_token_id
|
# init `input_ids` with bos_token_id
|
||||||
|
|||||||
@@ -730,7 +730,7 @@ class AlbertModel(AlbertPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
Albert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a
|
Albert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
|
||||||
`sentence order prediction (classification)` head.
|
`sentence order prediction (classification)` head.
|
||||||
""",
|
""",
|
||||||
ALBERT_START_DOCSTRING,
|
ALBERT_START_DOCSTRING,
|
||||||
|
|||||||
@@ -809,7 +809,7 @@ class TFAlbertModel(TFAlbertPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
Albert Model with two heads on top for pre-training: a `masked language modeling` head and a `sentence order
|
Albert Model with two heads on top for pretraining: a `masked language modeling` head and a `sentence order
|
||||||
prediction` (classification) head.
|
prediction` (classification) head.
|
||||||
""",
|
""",
|
||||||
ALBERT_START_DOCSTRING,
|
ALBERT_START_DOCSTRING,
|
||||||
|
|||||||
@@ -108,6 +108,8 @@ class BartConfig(PretrainedConfig):
|
|||||||
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
|
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
|
||||||
:obj:`True` for `bart-large-cnn`.
|
:obj:`True` for `bart-large-cnn`.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
"""
|
"""
|
||||||
model_type = "bart"
|
model_type = "bart"
|
||||||
keys_to_ignore_at_inference = ["past_key_values"]
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
@@ -134,9 +136,6 @@ class BartConfig(PretrainedConfig):
|
|||||||
classifier_dropout=0.0,
|
classifier_dropout=0.0,
|
||||||
num_labels=3,
|
num_labels=3,
|
||||||
is_encoder_decoder=True,
|
is_encoder_decoder=True,
|
||||||
pad_token_id=1,
|
|
||||||
bos_token_id=0,
|
|
||||||
eos_token_id=2,
|
|
||||||
normalize_before=False,
|
normalize_before=False,
|
||||||
add_final_layer_norm=False,
|
add_final_layer_norm=False,
|
||||||
do_blenderbot_90_layernorm=False,
|
do_blenderbot_90_layernorm=False,
|
||||||
@@ -145,6 +144,10 @@ class BartConfig(PretrainedConfig):
|
|||||||
static_position_embeddings=False,
|
static_position_embeddings=False,
|
||||||
add_bias_logits=False,
|
add_bias_logits=False,
|
||||||
force_bos_token_to_be_generated=False,
|
force_bos_token_to_be_generated=False,
|
||||||
|
use_cache=True,
|
||||||
|
pad_token_id=1,
|
||||||
|
bos_token_id=0,
|
||||||
|
eos_token_id=2,
|
||||||
**common_kwargs
|
**common_kwargs
|
||||||
):
|
):
|
||||||
r"""
|
r"""
|
||||||
@@ -208,6 +211,8 @@ class BartConfig(PretrainedConfig):
|
|||||||
|
|
||||||
self.do_blenderbot_90_layernorm = do_blenderbot_90_layernorm
|
self.do_blenderbot_90_layernorm = do_blenderbot_90_layernorm
|
||||||
|
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def num_attention_heads(self) -> int:
|
def num_attention_heads(self) -> int:
|
||||||
return self.encoder_attention_heads
|
return self.encoder_attention_heads
|
||||||
|
|||||||
@@ -888,7 +888,7 @@ class BertModel(BertPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
Bert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a `next
|
Bert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a `next
|
||||||
sentence prediction (classification)` head.
|
sentence prediction (classification)` head.
|
||||||
""",
|
""",
|
||||||
BERT_START_DOCSTRING,
|
BERT_START_DOCSTRING,
|
||||||
|
|||||||
@@ -90,7 +90,7 @@ TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
|||||||
|
|
||||||
class TFBertPreTrainingLoss:
|
class TFBertPreTrainingLoss:
|
||||||
"""
|
"""
|
||||||
Loss function suitable for BERT-like pre-training, that is, the task of pretraining a language model by combining
|
Loss function suitable for BERT-like pretraining, that is, the task of pretraining a language model by combining
|
||||||
NSP + MLM. .. note:: Any label of -100 will be ignored (along with the corresponding logits) in the loss
|
NSP + MLM. .. note:: Any label of -100 will be ignored (along with the corresponding logits) in the loss
|
||||||
computation.
|
computation.
|
||||||
"""
|
"""
|
||||||
@@ -878,7 +878,7 @@ class TFBertModel(TFBertPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
Bert Model with two heads on top as done during the pre-training:
|
Bert Model with two heads on top as done during the pretraining:
|
||||||
a `masked language modeling` head and a `next sentence prediction (classification)` head.
|
a `masked language modeling` head and a `next sentence prediction (classification)` head.
|
||||||
""",
|
""",
|
||||||
BERT_START_DOCSTRING,
|
BERT_START_DOCSTRING,
|
||||||
|
|||||||
@@ -80,7 +80,7 @@ class BertweetTokenizer(PreTrainedTokenizer):
|
|||||||
normalization (:obj:`bool`, `optional`, defaults to :obj:`False`)
|
normalization (:obj:`bool`, `optional`, defaults to :obj:`False`)
|
||||||
Whether or not to apply a normalization preprocess.
|
Whether or not to apply a normalization preprocess.
|
||||||
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
|
bos_token (:obj:`str`, `optional`, defaults to :obj:`"<s>"`):
|
||||||
The beginning of sequence token that was used during pre-training. Can be used a sequence classifier token.
|
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
|
|||||||
@@ -61,6 +61,9 @@ class CTRLConfig(PretrainedConfig):
|
|||||||
The epsilon to use in the layer normalization layers
|
The epsilon to use in the layer normalization layers
|
||||||
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
||||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
|
|
||||||
|
|
||||||
Examples::
|
Examples::
|
||||||
|
|
||||||
@@ -98,6 +101,7 @@ class CTRLConfig(PretrainedConfig):
|
|||||||
summary_activation=None,
|
summary_activation=None,
|
||||||
summary_proj_to_labels=True,
|
summary_proj_to_labels=True,
|
||||||
summary_first_dropout=0.1,
|
summary_first_dropout=0.1,
|
||||||
|
use_cache=True,
|
||||||
**kwargs
|
**kwargs
|
||||||
):
|
):
|
||||||
super().__init__(**kwargs)
|
super().__init__(**kwargs)
|
||||||
@@ -119,6 +123,7 @@ class CTRLConfig(PretrainedConfig):
|
|||||||
self.summary_activation = summary_activation
|
self.summary_activation = summary_activation
|
||||||
self.summary_first_dropout = summary_first_dropout
|
self.summary_first_dropout = summary_first_dropout
|
||||||
self.summary_proj_to_labels = summary_proj_to_labels
|
self.summary_proj_to_labels = summary_proj_to_labels
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def max_position_embeddings(self):
|
def max_position_embeddings(self):
|
||||||
|
|||||||
@@ -772,7 +772,7 @@ DEBERTA_START_DOCSTRING = r"""
|
|||||||
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
||||||
<https://arxiv.org/abs/2006.03654>`_ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It's build on top of
|
<https://arxiv.org/abs/2006.03654>`_ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It's build on top of
|
||||||
BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two
|
BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two
|
||||||
improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pre-training data.
|
improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pretraining data.
|
||||||
|
|
||||||
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
|
This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
|
||||||
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
|
subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
|
||||||
|
|||||||
@@ -891,8 +891,7 @@ class ElectraForSequenceClassification(ElectraPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
Electra model with a binary classification head on top as used during pre-training for identifying generated
|
Electra model with a binary classification head on top as used during pretraining for identifying generated tokens.
|
||||||
tokens.
|
|
||||||
|
|
||||||
It is recommended to load the discriminator checkpoint into that model.
|
It is recommended to load the discriminator checkpoint into that model.
|
||||||
""",
|
""",
|
||||||
|
|||||||
@@ -789,8 +789,7 @@ class TFElectraModel(TFElectraPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
Electra model with a binary classification head on top as used during pre-training for identifying generated
|
Electra model with a binary classification head on top as used during pretraining for identifying generated tokens.
|
||||||
tokens.
|
|
||||||
|
|
||||||
Even though both the discriminator and generator may be loaded into this model, the discriminator is the only model
|
Even though both the discriminator and generator may be loaded into this model, the discriminator is the only model
|
||||||
of the two to have the correct classification head to be used for this model.
|
of the two to have the correct classification head to be used for this model.
|
||||||
|
|||||||
@@ -109,6 +109,8 @@ class FSMTConfig(PretrainedConfig):
|
|||||||
early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
|
early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`)
|
||||||
Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop the beam
|
Flag that will be used by default in the :obj:`generate` method of the model. Whether to stop the beam
|
||||||
search when at least ``num_beams`` sentences are finished per batch or not.
|
search when at least ``num_beams`` sentences are finished per batch or not.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
|
|
||||||
Examples::
|
Examples::
|
||||||
|
|
||||||
@@ -142,9 +144,6 @@ class FSMTConfig(PretrainedConfig):
|
|||||||
dropout=0.1,
|
dropout=0.1,
|
||||||
activation_dropout=0.0,
|
activation_dropout=0.0,
|
||||||
init_std=0.02,
|
init_std=0.02,
|
||||||
pad_token_id=1,
|
|
||||||
bos_token_id=0,
|
|
||||||
eos_token_id=2,
|
|
||||||
decoder_start_token_id=2,
|
decoder_start_token_id=2,
|
||||||
is_encoder_decoder=True,
|
is_encoder_decoder=True,
|
||||||
scale_embedding=True,
|
scale_embedding=True,
|
||||||
@@ -152,6 +151,10 @@ class FSMTConfig(PretrainedConfig):
|
|||||||
num_beams=5,
|
num_beams=5,
|
||||||
length_penalty=1.0,
|
length_penalty=1.0,
|
||||||
early_stopping=False,
|
early_stopping=False,
|
||||||
|
use_cache=True,
|
||||||
|
pad_token_id=1,
|
||||||
|
bos_token_id=0,
|
||||||
|
eos_token_id=2,
|
||||||
**common_kwargs
|
**common_kwargs
|
||||||
):
|
):
|
||||||
if "hidden_size" in common_kwargs:
|
if "hidden_size" in common_kwargs:
|
||||||
@@ -196,6 +199,8 @@ class FSMTConfig(PretrainedConfig):
|
|||||||
self.activation_dropout = activation_dropout
|
self.activation_dropout = activation_dropout
|
||||||
self.dropout = dropout
|
self.dropout = dropout
|
||||||
|
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def num_attention_heads(self) -> int:
|
def num_attention_heads(self) -> int:
|
||||||
return self.encoder_attention_heads
|
return self.encoder_attention_heads
|
||||||
|
|||||||
@@ -1241,7 +1241,7 @@ class TFFunnelModel(TFFunnelPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
Funnel model with a binary classification head on top as used during pre-training for identifying generated tokens.
|
Funnel model with a binary classification head on top as used during pretraining for identifying generated tokens.
|
||||||
""",
|
""",
|
||||||
FUNNEL_START_DOCSTRING,
|
FUNNEL_START_DOCSTRING,
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -104,6 +104,8 @@ class GPT2Config(PretrainedConfig):
|
|||||||
The dropout ratio to be used after the projection and activation.
|
The dropout ratio to be used after the projection and activation.
|
||||||
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.
|
Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
|
|
||||||
Example::
|
Example::
|
||||||
|
|
||||||
@@ -142,9 +144,10 @@ class GPT2Config(PretrainedConfig):
|
|||||||
summary_activation=None,
|
summary_activation=None,
|
||||||
summary_proj_to_labels=True,
|
summary_proj_to_labels=True,
|
||||||
summary_first_dropout=0.1,
|
summary_first_dropout=0.1,
|
||||||
|
gradient_checkpointing=False,
|
||||||
|
use_cache=True,
|
||||||
bos_token_id=50256,
|
bos_token_id=50256,
|
||||||
eos_token_id=50256,
|
eos_token_id=50256,
|
||||||
gradient_checkpointing=False,
|
|
||||||
**kwargs
|
**kwargs
|
||||||
):
|
):
|
||||||
super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
||||||
@@ -168,6 +171,7 @@ class GPT2Config(PretrainedConfig):
|
|||||||
self.summary_first_dropout = summary_first_dropout
|
self.summary_first_dropout = summary_first_dropout
|
||||||
self.summary_proj_to_labels = summary_proj_to_labels
|
self.summary_proj_to_labels = summary_proj_to_labels
|
||||||
self.gradient_checkpointing = gradient_checkpointing
|
self.gradient_checkpointing = gradient_checkpointing
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
self.bos_token_id = bos_token_id
|
self.bos_token_id = bos_token_id
|
||||||
self.eos_token_id = eos_token_id
|
self.eos_token_id = eos_token_id
|
||||||
|
|||||||
@@ -1013,7 +1013,7 @@ class LxmertModel(LxmertPreTrainedModel):
|
|||||||
|
|
||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""Lxmert Model with a specified pre-training head on top. """,
|
"""Lxmert Model with a specified pretraining head on top. """,
|
||||||
LXMERT_START_DOCSTRING,
|
LXMERT_START_DOCSTRING,
|
||||||
)
|
)
|
||||||
class LxmertForPreTraining(LxmertPreTrainedModel):
|
class LxmertForPreTraining(LxmertPreTrainedModel):
|
||||||
@@ -1024,7 +1024,7 @@ class LxmertForPreTraining(LxmertPreTrainedModel):
|
|||||||
self.num_qa_labels = config.num_qa_labels
|
self.num_qa_labels = config.num_qa_labels
|
||||||
self.visual_loss_normalizer = config.visual_loss_normalizer
|
self.visual_loss_normalizer = config.visual_loss_normalizer
|
||||||
|
|
||||||
# Use of pre-training tasks
|
# Use of pretraining tasks
|
||||||
self.task_mask_lm = config.task_mask_lm
|
self.task_mask_lm = config.task_mask_lm
|
||||||
self.task_obj_predict = config.task_obj_predict
|
self.task_obj_predict = config.task_obj_predict
|
||||||
self.task_matched = config.task_matched
|
self.task_matched = config.task_matched
|
||||||
|
|||||||
@@ -1176,7 +1176,7 @@ class TFLxmertForPreTraining(TFLxmertPreTrainedModel):
|
|||||||
self.num_qa_labels = config.num_qa_labels
|
self.num_qa_labels = config.num_qa_labels
|
||||||
self.visual_loss_normalizer = config.visual_loss_normalizer
|
self.visual_loss_normalizer = config.visual_loss_normalizer
|
||||||
|
|
||||||
# Use of pre-training tasks
|
# Use of pretraining tasks
|
||||||
self.task_mask_lm = config.task_mask_lm
|
self.task_mask_lm = config.task_mask_lm
|
||||||
self.task_obj_predict = config.task_obj_predict
|
self.task_obj_predict = config.task_obj_predict
|
||||||
self.task_matched = config.task_matched
|
self.task_matched = config.task_matched
|
||||||
|
|||||||
@@ -933,7 +933,7 @@ class MobileBertModel(MobileBertPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a
|
MobileBert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
|
||||||
`next sentence prediction (classification)` head.
|
`next sentence prediction (classification)` head.
|
||||||
""",
|
""",
|
||||||
MOBILEBERT_START_DOCSTRING,
|
MOBILEBERT_START_DOCSTRING,
|
||||||
|
|||||||
@@ -1014,7 +1014,7 @@ class TFMobileBertModel(TFMobileBertPreTrainedModel):
|
|||||||
|
|
||||||
@add_start_docstrings(
|
@add_start_docstrings(
|
||||||
"""
|
"""
|
||||||
MobileBert Model with two heads on top as done during the pre-training: a `masked language modeling` head and a
|
MobileBert Model with two heads on top as done during the pretraining: a `masked language modeling` head and a
|
||||||
`next sentence prediction (classification)` head.
|
`next sentence prediction (classification)` head.
|
||||||
""",
|
""",
|
||||||
MOBILEBERT_START_DOCSTRING,
|
MOBILEBERT_START_DOCSTRING,
|
||||||
|
|||||||
@@ -96,6 +96,9 @@ class OpenAIGPTConfig(PretrainedConfig):
|
|||||||
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
|
:class:`~transformers.OpenAIGPTDoubleHeadsModel` and :class:`~transformers.OpenAIGPTDoubleHeadsModel`.
|
||||||
|
|
||||||
The dropout ratio to be used after the projection and activation.
|
The dropout ratio to be used after the projection and activation.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
|
|
||||||
|
|
||||||
Examples::
|
Examples::
|
||||||
|
|
||||||
@@ -133,6 +136,7 @@ class OpenAIGPTConfig(PretrainedConfig):
|
|||||||
summary_activation=None,
|
summary_activation=None,
|
||||||
summary_proj_to_labels=True,
|
summary_proj_to_labels=True,
|
||||||
summary_first_dropout=0.1,
|
summary_first_dropout=0.1,
|
||||||
|
use_cache=True,
|
||||||
**kwargs
|
**kwargs
|
||||||
):
|
):
|
||||||
super().__init__(**kwargs)
|
super().__init__(**kwargs)
|
||||||
@@ -155,6 +159,7 @@ class OpenAIGPTConfig(PretrainedConfig):
|
|||||||
self.summary_activation = summary_activation
|
self.summary_activation = summary_activation
|
||||||
self.summary_first_dropout = summary_first_dropout
|
self.summary_first_dropout = summary_first_dropout
|
||||||
self.summary_proj_to_labels = summary_proj_to_labels
|
self.summary_proj_to_labels = summary_proj_to_labels
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def max_position_embeddings(self):
|
def max_position_embeddings(self):
|
||||||
|
|||||||
@@ -90,6 +90,8 @@ class ProphetNetConfig(PretrainedConfig):
|
|||||||
eps (:obj:`float`, `optional`, defaults to 0.0):
|
eps (:obj:`float`, `optional`, defaults to 0.0):
|
||||||
Controls the ``epsilon`` parameter value for label smoothing in the loss calculation. If set to 0, no label
|
Controls the ``epsilon`` parameter value for label smoothing in the loss calculation. If set to 0, no label
|
||||||
smoothing is performed.
|
smoothing is performed.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
"""
|
"""
|
||||||
model_type = "prophetnet"
|
model_type = "prophetnet"
|
||||||
keys_to_ignore_at_inference = ["past_key_values"]
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
@@ -112,15 +114,16 @@ class ProphetNetConfig(PretrainedConfig):
|
|||||||
init_std=0.02,
|
init_std=0.02,
|
||||||
is_encoder_decoder=True,
|
is_encoder_decoder=True,
|
||||||
add_cross_attention=True,
|
add_cross_attention=True,
|
||||||
pad_token_id=0,
|
|
||||||
bos_token_id=1,
|
|
||||||
eos_token_id=2,
|
|
||||||
decoder_start_token_id=0,
|
decoder_start_token_id=0,
|
||||||
ngram=2,
|
ngram=2,
|
||||||
num_buckets=32,
|
num_buckets=32,
|
||||||
relative_max_distance=128,
|
relative_max_distance=128,
|
||||||
disable_ngram_loss=False,
|
disable_ngram_loss=False,
|
||||||
eps=0.0,
|
eps=0.0,
|
||||||
|
use_cache=True,
|
||||||
|
pad_token_id=0,
|
||||||
|
bos_token_id=1,
|
||||||
|
eos_token_id=2,
|
||||||
**kwargs
|
**kwargs
|
||||||
):
|
):
|
||||||
super().__init__(
|
super().__init__(
|
||||||
@@ -156,6 +159,8 @@ class ProphetNetConfig(PretrainedConfig):
|
|||||||
self.activation_dropout = activation_dropout
|
self.activation_dropout = activation_dropout
|
||||||
self.dropout = dropout
|
self.dropout = dropout
|
||||||
|
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def num_attention_heads(self) -> int:
|
def num_attention_heads(self) -> int:
|
||||||
return self.num_encoder_attention_heads
|
return self.num_encoder_attention_heads
|
||||||
|
|||||||
@@ -72,6 +72,8 @@ RAG_CONFIG_DOC = r"""
|
|||||||
output_retrieved(:obj:`bool`, `optional`, defaults to :obj:`False`):
|
output_retrieved(:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and
|
If set to ``True``, :obj:`retrieved_doc_embeds`, :obj:`retrieved_doc_ids`, :obj:`context_input_ids` and
|
||||||
:obj:`context_attention_mask` are returned. See returned tensors for more detail.
|
:obj:`context_attention_mask` are returned. See returned tensors for more detail.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
@@ -107,6 +109,7 @@ class RagConfig(PretrainedConfig):
|
|||||||
exclude_bos_score=False,
|
exclude_bos_score=False,
|
||||||
do_marginalize=False,
|
do_marginalize=False,
|
||||||
output_retrieved=False,
|
output_retrieved=False,
|
||||||
|
use_cache=True,
|
||||||
**kwargs
|
**kwargs
|
||||||
):
|
):
|
||||||
super().__init__(
|
super().__init__(
|
||||||
@@ -156,6 +159,8 @@ class RagConfig(PretrainedConfig):
|
|||||||
|
|
||||||
self.do_deduplication = do_deduplication
|
self.do_deduplication = do_deduplication
|
||||||
|
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_question_encoder_generator_configs(
|
def from_question_encoder_generator_configs(
|
||||||
cls, question_encoder_config: PretrainedConfig, generator_config: PretrainedConfig, **kwargs
|
cls, question_encoder_config: PretrainedConfig, generator_config: PretrainedConfig, **kwargs
|
||||||
|
|||||||
@@ -138,6 +138,8 @@ class ReformerConfig(PretrainedConfig):
|
|||||||
:obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
|
:obj:`inputs_ids` passed when calling :class:`~transformers.ReformerModel`.
|
||||||
tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
tie_word_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
Whether to tie input and output embeddings.
|
Whether to tie input and output embeddings.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
|
|
||||||
Examples::
|
Examples::
|
||||||
|
|
||||||
@@ -188,6 +190,7 @@ class ReformerConfig(PretrainedConfig):
|
|||||||
pad_token_id=0,
|
pad_token_id=0,
|
||||||
vocab_size=320,
|
vocab_size=320,
|
||||||
tie_word_embeddings=False,
|
tie_word_embeddings=False,
|
||||||
|
use_cache=True,
|
||||||
**kwargs
|
**kwargs
|
||||||
):
|
):
|
||||||
super().__init__(
|
super().__init__(
|
||||||
@@ -226,3 +229,4 @@ class ReformerConfig(PretrainedConfig):
|
|||||||
self.axial_norm_std = axial_norm_std
|
self.axial_norm_std = axial_norm_std
|
||||||
self.chunk_size_lm_head = chunk_size_lm_head
|
self.chunk_size_lm_head = chunk_size_lm_head
|
||||||
self.attn_layers = attn_layers
|
self.attn_layers = attn_layers
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|||||||
@@ -69,6 +69,8 @@ class T5Config(PretrainedConfig):
|
|||||||
feed_forward_proj (:obj:`string`, `optional`, defaults to :obj:`"relu"`):
|
feed_forward_proj (:obj:`string`, `optional`, defaults to :obj:`"relu"`):
|
||||||
Type of feed forward layer to be used. Should be one of :obj:`"relu"` or :obj:`"gated-gelu"`. T5v1.1 uses
|
Type of feed forward layer to be used. Should be one of :obj:`"relu"` or :obj:`"gated-gelu"`. T5v1.1 uses
|
||||||
the :obj:`"gated-gelu"` feed forward projection. Original T5 uses :obj:`"relu"`.
|
the :obj:`"gated-gelu"` feed forward projection. Original T5 uses :obj:`"relu"`.
|
||||||
|
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
"""
|
"""
|
||||||
model_type = "t5"
|
model_type = "t5"
|
||||||
keys_to_ignore_at_inference = ["past_key_values"]
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
@@ -88,6 +90,7 @@ class T5Config(PretrainedConfig):
|
|||||||
initializer_factor=1.0,
|
initializer_factor=1.0,
|
||||||
feed_forward_proj="relu",
|
feed_forward_proj="relu",
|
||||||
is_encoder_decoder=True,
|
is_encoder_decoder=True,
|
||||||
|
use_cache=True,
|
||||||
pad_token_id=0,
|
pad_token_id=0,
|
||||||
eos_token_id=1,
|
eos_token_id=1,
|
||||||
**kwargs
|
**kwargs
|
||||||
@@ -112,6 +115,7 @@ class T5Config(PretrainedConfig):
|
|||||||
self.layer_norm_epsilon = layer_norm_epsilon
|
self.layer_norm_epsilon = layer_norm_epsilon
|
||||||
self.initializer_factor = initializer_factor
|
self.initializer_factor = initializer_factor
|
||||||
self.feed_forward_proj = feed_forward_proj
|
self.feed_forward_proj = feed_forward_proj
|
||||||
|
self.use_cache = use_cache
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def hidden_size(self):
|
def hidden_size(self):
|
||||||
|
|||||||
@@ -884,7 +884,7 @@ T5_INPUTS_DOCSTRING = r"""
|
|||||||
:func:`transformers.PreTrainedTokenizer.__call__` and :func:`transformers.PreTrainedTokenizer.encode` for
|
:func:`transformers.PreTrainedTokenizer.__call__` and :func:`transformers.PreTrainedTokenizer.encode` for
|
||||||
details.
|
details.
|
||||||
|
|
||||||
To know more on how to prepare :obj:`inputs` for pre-training take a look at `T5 Training
|
To know more on how to prepare :obj:`inputs` for pretraining take a look at `T5 Training
|
||||||
<./t5.html#training>`__.
|
<./t5.html#training>`__.
|
||||||
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
|
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
|
||||||
Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for
|
Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for
|
||||||
|
|||||||
@@ -15,6 +15,8 @@
|
|||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
""" XLNet configuration """
|
""" XLNet configuration """
|
||||||
|
|
||||||
|
import warnings
|
||||||
|
|
||||||
from ...configuration_utils import PretrainedConfig
|
from ...configuration_utils import PretrainedConfig
|
||||||
from ...utils import logging
|
from ...utils import logging
|
||||||
|
|
||||||
@@ -106,12 +108,18 @@ class XLNetConfig(PretrainedConfig):
|
|||||||
Used in the SQuAD evaluation script.
|
Used in the SQuAD evaluation script.
|
||||||
end_n_top (:obj:`int`, `optional`, defaults to 5):
|
end_n_top (:obj:`int`, `optional`, defaults to 5):
|
||||||
Used in the SQuAD evaluation script.
|
Used in the SQuAD evaluation script.
|
||||||
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
use_mems_eval (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
Whether or not the model should return the last pre-computed hidden states.
|
Whether or not the model should make use of the recurrent memory mechanism in evaluation mode.
|
||||||
|
use_mems_train (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether or not the model should make use of the recurrent memory mechanism in train mode.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
This flag behaves differently from with other models: it just controls the inference behavior, during
|
For pretraining, it is recommended to set ``use_mems_train`` to :obj:`True`. For fine-tuning, it is
|
||||||
training the model always uses ``use_cache=True``.
|
recommended to set ``use_mems_train`` to :obj:`False` as discussed `here
|
||||||
|
<https://github.com/zihangdai/xlnet/issues/41#issuecomment-505102587>`__. If ``use_mems_train`` is set
|
||||||
|
to :obj:`True`, one has to make sure that the train batches are correctly pre-processed, `e.g.`
|
||||||
|
:obj:`batch_1 = [[This line is], [This is the]]` and :obj:`batch_2 = [[ the first line], [ second
|
||||||
|
line]]` and that all batches are of equal size.
|
||||||
|
|
||||||
Examples::
|
Examples::
|
||||||
|
|
||||||
@@ -145,6 +153,8 @@ class XLNetConfig(PretrainedConfig):
|
|||||||
dropout=0.1,
|
dropout=0.1,
|
||||||
mem_len=512,
|
mem_len=512,
|
||||||
reuse_len=None,
|
reuse_len=None,
|
||||||
|
use_mems_eval=True,
|
||||||
|
use_mems_train=False,
|
||||||
bi_data=False,
|
bi_data=False,
|
||||||
clamp_len=-1,
|
clamp_len=-1,
|
||||||
same_length=False,
|
same_length=False,
|
||||||
@@ -197,6 +207,16 @@ class XLNetConfig(PretrainedConfig):
|
|||||||
self.pad_token_id = pad_token_id
|
self.pad_token_id = pad_token_id
|
||||||
self.eos_token_id = eos_token_id
|
self.eos_token_id = eos_token_id
|
||||||
|
|
||||||
|
if "use_cache" in kwargs:
|
||||||
|
warnings.warn(
|
||||||
|
"The `use_cache` argument is deprecated and will be removed in a future version, use `use_mems_eval` instead.",
|
||||||
|
FutureWarning,
|
||||||
|
)
|
||||||
|
use_mems_eval = kwargs["use_cache"]
|
||||||
|
|
||||||
|
self.use_mems_eval = use_mems_eval
|
||||||
|
self.use_mems_train = use_mems_train
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def max_position_embeddings(self):
|
def max_position_embeddings(self):
|
||||||
return -1
|
return -1
|
||||||
|
|||||||
@@ -440,6 +440,9 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
self.layer = [TFXLNetLayer(config, name="layer_._{}".format(i)) for i in range(config.n_layer)]
|
self.layer = [TFXLNetLayer(config, name="layer_._{}".format(i)) for i in range(config.n_layer)]
|
||||||
self.dropout = tf.keras.layers.Dropout(config.dropout)
|
self.dropout = tf.keras.layers.Dropout(config.dropout)
|
||||||
|
|
||||||
|
self.use_mems_eval = config.use_mems_eval
|
||||||
|
self.use_mems_train = config.use_mems_train
|
||||||
|
|
||||||
def get_input_embeddings(self):
|
def get_input_embeddings(self):
|
||||||
return self.word_embedding
|
return self.word_embedding
|
||||||
|
|
||||||
@@ -489,14 +492,23 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
return ret
|
return ret
|
||||||
|
|
||||||
def cache_mem(self, curr_out, prev_mem):
|
def cache_mem(self, curr_out, prev_mem):
|
||||||
"""cache hidden states into memory."""
|
# cache hidden states into memory.
|
||||||
if self.reuse_len is not None and self.reuse_len > 0:
|
if self.reuse_len is not None and self.reuse_len > 0:
|
||||||
curr_out = curr_out[: self.reuse_len]
|
curr_out = curr_out[: self.reuse_len]
|
||||||
|
|
||||||
if prev_mem is None:
|
if self.mem_len is None or self.mem_len == 0:
|
||||||
new_mem = curr_out[-self.mem_len :]
|
# If :obj:`use_mems` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
|
||||||
|
# and returns all of the past and current hidden states.
|
||||||
|
cutoff = 0
|
||||||
else:
|
else:
|
||||||
new_mem = tf.concat([prev_mem, curr_out], 0)[-self.mem_len :]
|
# If :obj:`use_mems` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
|
||||||
|
# states. This is the preferred setting for training and long-form generation.
|
||||||
|
cutoff = -self.mem_len
|
||||||
|
if prev_mem is None:
|
||||||
|
# if :obj:`use_mems` is active and `mem_len` is defined, the model
|
||||||
|
new_mem = curr_out[cutoff:]
|
||||||
|
else:
|
||||||
|
new_mem = tf.concat([prev_mem, curr_out], 0)[cutoff:]
|
||||||
|
|
||||||
return tf.stop_gradient(new_mem)
|
return tf.stop_gradient(new_mem)
|
||||||
|
|
||||||
@@ -569,7 +581,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
input_mask=None,
|
input_mask=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=True,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
@@ -587,7 +599,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -602,6 +614,11 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
)
|
)
|
||||||
return_dict = inputs["return_dict"] if inputs["return_dict"] is not None else self.return_dict
|
return_dict = inputs["return_dict"] if inputs["return_dict"] is not None else self.return_dict
|
||||||
|
|
||||||
|
if training:
|
||||||
|
use_mems = use_mems if use_mems is not None else self.use_mems_train
|
||||||
|
else:
|
||||||
|
use_mems = use_mems if use_mems is not None else self.use_mems_eval
|
||||||
|
|
||||||
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
|
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
|
||||||
# but we want a unified interface in the library with the batch size on the first dimension
|
# but we want a unified interface in the library with the batch size on the first dimension
|
||||||
# so we move here the first dimension (batch) to the end
|
# so we move here the first dimension (batch) to the end
|
||||||
@@ -737,7 +754,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
hidden_states = [] if output_hidden_states else None
|
hidden_states = [] if output_hidden_states else None
|
||||||
for i, layer_module in enumerate(self.layer):
|
for i, layer_module in enumerate(self.layer):
|
||||||
# cache new mems
|
# cache new mems
|
||||||
if self.mem_len is not None and self.mem_len > 0 and use_cache:
|
if use_mems:
|
||||||
new_mems = new_mems + (self.cache_mem(output_h, inputs["mems"][i]),)
|
new_mems = new_mems + (self.cache_mem(output_h, inputs["mems"][i]),)
|
||||||
if output_hidden_states:
|
if output_hidden_states:
|
||||||
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
|
hidden_states.append((output_h, output_g) if output_g is not None else output_h)
|
||||||
@@ -768,7 +785,7 @@ class TFXLNetMainLayer(tf.keras.layers.Layer):
|
|||||||
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
||||||
output = tf.transpose(output, perm=(1, 0, 2))
|
output = tf.transpose(output, perm=(1, 0, 2))
|
||||||
|
|
||||||
if not (self.mem_len is not None and self.mem_len > 0 and use_cache):
|
if not use_mems:
|
||||||
new_mems = None
|
new_mems = None
|
||||||
if output_hidden_states:
|
if output_hidden_states:
|
||||||
if output_g is not None:
|
if output_g is not None:
|
||||||
@@ -1066,7 +1083,7 @@ XLNET_INPUTS_DOCSTRING = r"""
|
|||||||
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
|
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
|
||||||
as they have already been computed.
|
as they have already been computed.
|
||||||
|
|
||||||
:obj::obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`.
|
:obj::obj:`use_mems` has to be set to :obj:`True` to make use of :obj:`mems`.
|
||||||
perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
|
perm_mask (:obj:`tf.Tensor` or :obj:`Numpy array` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
|
||||||
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
|
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
|
||||||
|
|
||||||
@@ -1147,7 +1164,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
|
|||||||
input_mask=None,
|
input_mask=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=True,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
@@ -1165,7 +1182,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1182,7 +1199,7 @@ class TFXLNetModel(TFXLNetPreTrainedModel):
|
|||||||
input_mask=inputs["input_mask"],
|
input_mask=inputs["input_mask"],
|
||||||
head_mask=inputs["head_mask"],
|
head_mask=inputs["head_mask"],
|
||||||
inputs_embeds=inputs["inputs_embeds"],
|
inputs_embeds=inputs["inputs_embeds"],
|
||||||
use_cache=inputs["use_cache"],
|
use_mems=inputs["use_mems"],
|
||||||
output_attentions=inputs["output_attentions"],
|
output_attentions=inputs["output_attentions"],
|
||||||
output_hidden_states=inputs["output_hidden_states"],
|
output_hidden_states=inputs["output_hidden_states"],
|
||||||
return_dict=inputs["return_dict"],
|
return_dict=inputs["return_dict"],
|
||||||
@@ -1207,7 +1224,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||||||
def get_output_embeddings(self):
|
def get_output_embeddings(self):
|
||||||
return self.lm_loss.input_embeddings
|
return self.lm_loss.input_embeddings
|
||||||
|
|
||||||
def prepare_inputs_for_generation(self, inputs, past, **kwargs):
|
def prepare_inputs_for_generation(self, inputs, past, use_mems=None, **kwargs):
|
||||||
# Add dummy token at the end (no attention on this one)
|
# Add dummy token at the end (no attention on this one)
|
||||||
|
|
||||||
# At every pass, the attention values for the new token and the two last generated tokens
|
# At every pass, the attention values for the new token and the two last generated tokens
|
||||||
@@ -1238,7 +1255,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||||||
"input_ids": inputs,
|
"input_ids": inputs,
|
||||||
"perm_mask": perm_mask,
|
"perm_mask": perm_mask,
|
||||||
"target_mapping": target_mapping,
|
"target_mapping": target_mapping,
|
||||||
"use_cache": kwargs["use_cache"],
|
"use_mems": kwargs.get("use_mems"),
|
||||||
}
|
}
|
||||||
|
|
||||||
# if past is defined in model kwargs then use it for faster decoding
|
# if past is defined in model kwargs then use it for faster decoding
|
||||||
@@ -1260,7 +1277,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||||||
input_mask=None,
|
input_mask=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=True,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
@@ -1309,7 +1326,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1328,7 +1345,7 @@ class TFXLNetLMHeadModel(TFXLNetPreTrainedModel, TFCausalLanguageModelingLoss):
|
|||||||
input_mask=inputs["input_mask"],
|
input_mask=inputs["input_mask"],
|
||||||
head_mask=inputs["head_mask"],
|
head_mask=inputs["head_mask"],
|
||||||
inputs_embeds=inputs["inputs_embeds"],
|
inputs_embeds=inputs["inputs_embeds"],
|
||||||
use_cache=inputs["use_cache"],
|
use_mems=inputs["use_mems"],
|
||||||
output_attentions=inputs["output_attentions"],
|
output_attentions=inputs["output_attentions"],
|
||||||
output_hidden_states=inputs["output_hidden_states"],
|
output_hidden_states=inputs["output_hidden_states"],
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1395,7 +1412,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
|
|||||||
input_mask=None,
|
input_mask=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=True,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
@@ -1420,7 +1437,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1439,7 +1456,7 @@ class TFXLNetForSequenceClassification(TFXLNetPreTrainedModel, TFSequenceClassif
|
|||||||
input_mask=inputs["input_mask"],
|
input_mask=inputs["input_mask"],
|
||||||
head_mask=inputs["head_mask"],
|
head_mask=inputs["head_mask"],
|
||||||
inputs_embeds=inputs["inputs_embeds"],
|
inputs_embeds=inputs["inputs_embeds"],
|
||||||
use_cache=inputs["use_cache"],
|
use_mems=inputs["use_mems"],
|
||||||
output_attentions=inputs["output_attentions"],
|
output_attentions=inputs["output_attentions"],
|
||||||
output_hidden_states=inputs["output_hidden_states"],
|
output_hidden_states=inputs["output_hidden_states"],
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1512,7 +1529,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||||||
target_mapping=None,
|
target_mapping=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=True,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
@@ -1526,6 +1543,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||||||
num_choices]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
|
num_choices]`` where :obj:`num_choices` is the size of the second dimension of the input tensors. (See
|
||||||
:obj:`input_ids` above)
|
:obj:`input_ids` above)
|
||||||
"""
|
"""
|
||||||
|
|
||||||
inputs = input_processing(
|
inputs = input_processing(
|
||||||
func=self.call,
|
func=self.call,
|
||||||
input_ids=input_ids,
|
input_ids=input_ids,
|
||||||
@@ -1537,7 +1555,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1579,7 +1597,7 @@ class TFXLNetForMultipleChoice(TFXLNetPreTrainedModel, TFMultipleChoiceLoss):
|
|||||||
flat_input_mask,
|
flat_input_mask,
|
||||||
inputs["head_mask"],
|
inputs["head_mask"],
|
||||||
flat_inputs_embeds,
|
flat_inputs_embeds,
|
||||||
inputs["use_cache"],
|
inputs["use_mems"],
|
||||||
inputs["output_attentions"],
|
inputs["output_attentions"],
|
||||||
inputs["output_hidden_states"],
|
inputs["output_hidden_states"],
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1639,7 +1657,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
|
|||||||
input_mask=None,
|
input_mask=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=True,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
@@ -1663,7 +1681,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1682,7 +1700,7 @@ class TFXLNetForTokenClassification(TFXLNetPreTrainedModel, TFTokenClassificatio
|
|||||||
input_mask=inputs["input_mask"],
|
input_mask=inputs["input_mask"],
|
||||||
head_mask=inputs["head_mask"],
|
head_mask=inputs["head_mask"],
|
||||||
inputs_embeds=inputs["inputs_embeds"],
|
inputs_embeds=inputs["inputs_embeds"],
|
||||||
use_cache=inputs["use_cache"],
|
use_mems=inputs["use_mems"],
|
||||||
output_attentions=inputs["output_attentions"],
|
output_attentions=inputs["output_attentions"],
|
||||||
output_hidden_states=inputs["output_hidden_states"],
|
output_hidden_states=inputs["output_hidden_states"],
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1739,7 +1757,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
|
|||||||
input_mask=None,
|
input_mask=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=True,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
@@ -1769,7 +1787,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1789,7 +1807,7 @@ class TFXLNetForQuestionAnsweringSimple(TFXLNetPreTrainedModel, TFQuestionAnswer
|
|||||||
input_mask=inputs["input_mask"],
|
input_mask=inputs["input_mask"],
|
||||||
head_mask=inputs["head_mask"],
|
head_mask=inputs["head_mask"],
|
||||||
inputs_embeds=inputs["inputs_embeds"],
|
inputs_embeds=inputs["inputs_embeds"],
|
||||||
use_cache=inputs["use_cache"],
|
use_mems=inputs["use_mems"],
|
||||||
output_attentions=inputs["output_attentions"],
|
output_attentions=inputs["output_attentions"],
|
||||||
output_hidden_states=inputs["output_hidden_states"],
|
output_hidden_states=inputs["output_hidden_states"],
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
|
|||||||
@@ -16,6 +16,7 @@
|
|||||||
"""
|
"""
|
||||||
PyTorch XLNet model.
|
PyTorch XLNet model.
|
||||||
"""
|
"""
|
||||||
|
import warnings
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
from typing import List, Optional, Tuple
|
from typing import List, Optional, Tuple
|
||||||
|
|
||||||
@@ -876,7 +877,7 @@ XLNET_INPUTS_DOCSTRING = r"""
|
|||||||
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
|
decoding. The token ids which have their past given to this model should not be passed as :obj:`input_ids`
|
||||||
as they have already been computed.
|
as they have already been computed.
|
||||||
|
|
||||||
:obj::obj:`use_cache` has to be set to :obj:`True` to make use of :obj:`mems`.
|
:obj:`use_mems` has to be set to :obj:`True` to make use of :obj:`mems`.
|
||||||
perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
|
perm_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, sequence_length)`, `optional`):
|
||||||
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
|
Mask to indicate the attention pattern for each input token with values selected in ``[0, 1]``:
|
||||||
|
|
||||||
@@ -997,15 +998,15 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
curr_out = curr_out[: self.reuse_len]
|
curr_out = curr_out[: self.reuse_len]
|
||||||
|
|
||||||
if self.mem_len is None or self.mem_len == 0:
|
if self.mem_len is None or self.mem_len == 0:
|
||||||
# If :obj:`use_cache` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
|
# If :obj:`use_mems` is active but no `mem_len` is defined, the model behaves like GPT-2 at inference time
|
||||||
# and returns all of the past and current hidden states.
|
# and returns all of the past and current hidden states.
|
||||||
cutoff = 0
|
cutoff = 0
|
||||||
else:
|
else:
|
||||||
# If :obj:`use_cache` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
|
# If :obj:`use_mems` is active and `mem_len` is defined, the model returns the last `mem_len` hidden
|
||||||
# states. This is the preferred setting for training and long-form generation.
|
# states. This is the preferred setting for training and long-form generation.
|
||||||
cutoff = -self.mem_len
|
cutoff = -self.mem_len
|
||||||
if prev_mem is None:
|
if prev_mem is None:
|
||||||
# if :obj:`use_cache` is active and `mem_len` is defined, the model
|
# if :obj:`use_mems` is active and `mem_len` is defined, the model
|
||||||
new_mem = curr_out[cutoff:]
|
new_mem = curr_out[cutoff:]
|
||||||
else:
|
else:
|
||||||
new_mem = torch.cat([prev_mem, curr_out], dim=0)[cutoff:]
|
new_mem = torch.cat([prev_mem, curr_out], dim=0)[cutoff:]
|
||||||
@@ -1080,10 +1081,11 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
input_mask=None,
|
input_mask=None,
|
||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
use_cache=None,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
|
**kwargs, # delete after depreciation warning is removed
|
||||||
):
|
):
|
||||||
|
|
||||||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||||
@@ -1091,7 +1093,18 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||||
)
|
)
|
||||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
|
||||||
|
if "use_cache" in kwargs:
|
||||||
|
warnings.warn(
|
||||||
|
"The `use_cache` argument is deprecated and will be removed in a future version, use `use_mems` instead.",
|
||||||
|
FutureWarning,
|
||||||
|
)
|
||||||
|
use_mems = kwargs["use_cache"]
|
||||||
|
|
||||||
|
if self.training:
|
||||||
|
use_mems = use_mems if use_mems is not None else self.config.use_mems_train
|
||||||
|
else:
|
||||||
|
use_mems = use_mems if use_mems is not None else self.config.use_mems_eval
|
||||||
|
|
||||||
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
|
# the original code for XLNet uses shapes [len, bsz] with the batch dimension at the end
|
||||||
# but we want a unified interface in the library with the batch size on the first dimension
|
# but we want a unified interface in the library with the batch size on the first dimension
|
||||||
@@ -1222,7 +1235,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
attentions = [] if output_attentions else None
|
attentions = [] if output_attentions else None
|
||||||
hidden_states = [] if output_hidden_states else None
|
hidden_states = [] if output_hidden_states else None
|
||||||
for i, layer_module in enumerate(self.layer):
|
for i, layer_module in enumerate(self.layer):
|
||||||
if use_cache:
|
if use_mems:
|
||||||
# cache new mems
|
# cache new mems
|
||||||
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
|
new_mems = new_mems + (self.cache_mem(output_h, mems[i]),)
|
||||||
if output_hidden_states:
|
if output_hidden_states:
|
||||||
@@ -1253,7 +1266,7 @@ class XLNetModel(XLNetPreTrainedModel):
|
|||||||
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
# Prepare outputs, we transpose back here to shape [bsz, len, hidden_dim] (cf. beginning of forward() method)
|
||||||
output = output.permute(1, 0, 2).contiguous()
|
output = output.permute(1, 0, 2).contiguous()
|
||||||
|
|
||||||
if not use_cache:
|
if not use_mems:
|
||||||
new_mems = None
|
new_mems = None
|
||||||
|
|
||||||
if output_hidden_states:
|
if output_hidden_states:
|
||||||
@@ -1299,7 +1312,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
def get_output_embeddings(self):
|
def get_output_embeddings(self):
|
||||||
return self.lm_loss
|
return self.lm_loss
|
||||||
|
|
||||||
def prepare_inputs_for_generation(self, input_ids, past=None, use_cache=None, **kwargs):
|
def prepare_inputs_for_generation(self, input_ids, past=None, use_mems=None, **kwargs):
|
||||||
# Add dummy token at the end (no attention on this one)
|
# Add dummy token at the end (no attention on this one)
|
||||||
|
|
||||||
effective_batch_size = input_ids.shape[0]
|
effective_batch_size = input_ids.shape[0]
|
||||||
@@ -1332,7 +1345,7 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
"input_ids": input_ids,
|
"input_ids": input_ids,
|
||||||
"perm_mask": perm_mask,
|
"perm_mask": perm_mask,
|
||||||
"target_mapping": target_mapping,
|
"target_mapping": target_mapping,
|
||||||
"use_cache": use_cache,
|
"use_mems": use_mems,
|
||||||
}
|
}
|
||||||
|
|
||||||
# if past is defined in model kwargs then use it for faster decoding
|
# if past is defined in model kwargs then use it for faster decoding
|
||||||
@@ -1355,10 +1368,11 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
labels=None,
|
labels=None,
|
||||||
use_cache=None,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
|
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||||
):
|
):
|
||||||
r"""
|
r"""
|
||||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`):
|
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, num_predict)`, `optional`):
|
||||||
@@ -1407,7 +1421,6 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
>>> next_token_logits = outputs.logits # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
|
>>> next_token_logits = outputs.logits # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
|
||||||
"""
|
"""
|
||||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
|
||||||
|
|
||||||
transformer_outputs = self.transformer(
|
transformer_outputs = self.transformer(
|
||||||
input_ids,
|
input_ids,
|
||||||
@@ -1419,10 +1432,11 @@ class XLNetLMHeadModel(XLNetPreTrainedModel):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
|
**kwargs,
|
||||||
)
|
)
|
||||||
|
|
||||||
logits = self.lm_loss(transformer_outputs[0])
|
logits = self.lm_loss(transformer_outputs[0])
|
||||||
@@ -1483,10 +1497,11 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
labels=None,
|
labels=None,
|
||||||
use_cache=None,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
|
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||||
):
|
):
|
||||||
r"""
|
r"""
|
||||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||||
@@ -1495,7 +1510,6 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||||||
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
|
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
|
||||||
"""
|
"""
|
||||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
|
||||||
|
|
||||||
transformer_outputs = self.transformer(
|
transformer_outputs = self.transformer(
|
||||||
input_ids,
|
input_ids,
|
||||||
@@ -1507,10 +1521,11 @@ class XLNetForSequenceClassification(XLNetPreTrainedModel):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
|
**kwargs,
|
||||||
)
|
)
|
||||||
output = transformer_outputs[0]
|
output = transformer_outputs[0]
|
||||||
|
|
||||||
@@ -1576,10 +1591,11 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
|
|||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
labels=None,
|
labels=None,
|
||||||
use_cache=None,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
|
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||||
):
|
):
|
||||||
r"""
|
r"""
|
||||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||||
@@ -1588,7 +1604,6 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
|
|||||||
`input_ids` above)
|
`input_ids` above)
|
||||||
"""
|
"""
|
||||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
|
||||||
|
|
||||||
outputs = self.transformer(
|
outputs = self.transformer(
|
||||||
input_ids,
|
input_ids,
|
||||||
@@ -1600,7 +1615,7 @@ class XLNetForTokenClassification(XLNetPreTrainedModel):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
@@ -1673,10 +1688,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||||||
head_mask=None,
|
head_mask=None,
|
||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
labels=None,
|
labels=None,
|
||||||
use_cache=None,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
|
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||||
):
|
):
|
||||||
r"""
|
r"""
|
||||||
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||||
@@ -1685,7 +1701,7 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||||||
:obj:`input_ids` above)
|
:obj:`input_ids` above)
|
||||||
"""
|
"""
|
||||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
|
||||||
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
|
num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
|
||||||
|
|
||||||
flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
|
flat_input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
|
||||||
@@ -1708,10 +1724,11 @@ class XLNetForMultipleChoice(XLNetPreTrainedModel):
|
|||||||
target_mapping=target_mapping,
|
target_mapping=target_mapping,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=flat_inputs_embeds,
|
inputs_embeds=flat_inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
|
**kwargs,
|
||||||
)
|
)
|
||||||
|
|
||||||
output = transformer_outputs[0]
|
output = transformer_outputs[0]
|
||||||
@@ -1775,10 +1792,11 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||||||
inputs_embeds=None,
|
inputs_embeds=None,
|
||||||
start_positions=None,
|
start_positions=None,
|
||||||
end_positions=None,
|
end_positions=None,
|
||||||
use_cache=None,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
|
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||||
):
|
):
|
||||||
r"""
|
r"""
|
||||||
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||||
@@ -1791,7 +1809,6 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||||||
sequence are not taken into account for computing the loss.
|
sequence are not taken into account for computing the loss.
|
||||||
"""
|
"""
|
||||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
|
||||||
|
|
||||||
outputs = self.transformer(
|
outputs = self.transformer(
|
||||||
input_ids,
|
input_ids,
|
||||||
@@ -1803,10 +1820,11 @@ class XLNetForQuestionAnsweringSimple(XLNetPreTrainedModel):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
|
**kwargs,
|
||||||
)
|
)
|
||||||
|
|
||||||
sequence_output = outputs[0]
|
sequence_output = outputs[0]
|
||||||
@@ -1885,10 +1903,11 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||||||
is_impossible=None,
|
is_impossible=None,
|
||||||
cls_index=None,
|
cls_index=None,
|
||||||
p_mask=None,
|
p_mask=None,
|
||||||
use_cache=None,
|
use_mems=None,
|
||||||
output_attentions=None,
|
output_attentions=None,
|
||||||
output_hidden_states=None,
|
output_hidden_states=None,
|
||||||
return_dict=None,
|
return_dict=None,
|
||||||
|
**kwargs, # delete when `use_cache` is removed in XLNetModel
|
||||||
):
|
):
|
||||||
r"""
|
r"""
|
||||||
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
start_positions (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
|
||||||
@@ -1926,7 +1945,6 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||||||
>>> loss = outputs.loss
|
>>> loss = outputs.loss
|
||||||
"""
|
"""
|
||||||
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
use_cache = self.training or (use_cache if use_cache is not None else self.config.use_cache)
|
|
||||||
|
|
||||||
transformer_outputs = self.transformer(
|
transformer_outputs = self.transformer(
|
||||||
input_ids,
|
input_ids,
|
||||||
@@ -1938,10 +1956,11 @@ class XLNetForQuestionAnswering(XLNetPreTrainedModel):
|
|||||||
input_mask=input_mask,
|
input_mask=input_mask,
|
||||||
head_mask=head_mask,
|
head_mask=head_mask,
|
||||||
inputs_embeds=inputs_embeds,
|
inputs_embeds=inputs_embeds,
|
||||||
use_cache=use_cache,
|
use_mems=use_mems,
|
||||||
output_attentions=output_attentions,
|
output_attentions=output_attentions,
|
||||||
output_hidden_states=output_hidden_states,
|
output_hidden_states=output_hidden_states,
|
||||||
return_dict=return_dict,
|
return_dict=return_dict,
|
||||||
|
**kwargs,
|
||||||
)
|
)
|
||||||
hidden_states = transformer_outputs[0]
|
hidden_states = transformer_outputs[0]
|
||||||
start_logits = self.start_logits(hidden_states, p_mask=p_mask)
|
start_logits = self.start_logits(hidden_states, p_mask=p_mask)
|
||||||
|
|||||||
@@ -153,7 +153,7 @@ class TFXLNetModelTester:
|
|||||||
inputs = [input_ids_1, input_mask]
|
inputs = [input_ids_1, input_mask]
|
||||||
result = model(inputs)
|
result = model(inputs)
|
||||||
|
|
||||||
config.mem_len = 0
|
config.use_mems_eval = False
|
||||||
model = TFXLNetModel(config)
|
model = TFXLNetModel(config)
|
||||||
no_mems_outputs = model(inputs)
|
no_mems_outputs = model(inputs)
|
||||||
self.parent.assertEqual(len(no_mems_outputs), 1)
|
self.parent.assertEqual(len(no_mems_outputs), 1)
|
||||||
|
|||||||
@@ -206,7 +206,36 @@ class XLNetModelTester:
|
|||||||
[(self.seq_length, self.batch_size, self.hidden_size)] * self.num_hidden_layers,
|
[(self.seq_length, self.batch_size, self.hidden_size)] * self.num_hidden_layers,
|
||||||
)
|
)
|
||||||
|
|
||||||
def create_and_check_xlnet_model_use_cache(
|
def create_and_check_use_mems_train(
|
||||||
|
self,
|
||||||
|
config,
|
||||||
|
input_ids_1,
|
||||||
|
input_ids_2,
|
||||||
|
input_ids_q,
|
||||||
|
perm_mask,
|
||||||
|
input_mask,
|
||||||
|
target_mapping,
|
||||||
|
segment_ids,
|
||||||
|
lm_labels,
|
||||||
|
sequence_labels,
|
||||||
|
is_impossible_labels,
|
||||||
|
token_labels,
|
||||||
|
):
|
||||||
|
model = XLNetForSequenceClassification(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.train()
|
||||||
|
|
||||||
|
train_size = input_ids_1.shape[0]
|
||||||
|
|
||||||
|
batch_size = 4
|
||||||
|
for i in range(train_size // batch_size + 1):
|
||||||
|
input_ids = input_ids_1[i : (i + 1) * batch_size]
|
||||||
|
labels = sequence_labels[i : (i + 1) * batch_size]
|
||||||
|
outputs = model(input_ids=input_ids, labels=labels, return_dict=True)
|
||||||
|
self.parent.assertIsNone(outputs.mems)
|
||||||
|
self.parent.assertIsNotNone(outputs.loss)
|
||||||
|
|
||||||
|
def create_and_check_xlnet_model_use_mems(
|
||||||
self,
|
self,
|
||||||
config,
|
config,
|
||||||
input_ids_1,
|
input_ids_1,
|
||||||
@@ -234,8 +263,8 @@ class XLNetModelTester:
|
|||||||
device=torch_device,
|
device=torch_device,
|
||||||
)
|
)
|
||||||
causal_mask = torch.triu(causal_mask, diagonal=0)
|
causal_mask = torch.triu(causal_mask, diagonal=0)
|
||||||
outputs_cache = model(input_ids_1, use_cache=True, perm_mask=causal_mask)
|
outputs_cache = model(input_ids_1, use_mems=True, perm_mask=causal_mask)
|
||||||
outputs_no_cache = model(input_ids_1, use_cache=False, perm_mask=causal_mask)
|
outputs_no_cache = model(input_ids_1, use_mems=False, perm_mask=causal_mask)
|
||||||
outputs_conf = model(input_ids_1)
|
outputs_conf = model(input_ids_1)
|
||||||
|
|
||||||
self.parent.assertTrue(len(outputs_cache) == len(outputs_conf))
|
self.parent.assertTrue(len(outputs_cache) == len(outputs_conf))
|
||||||
@@ -525,11 +554,15 @@ class XLNetModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase)
|
|||||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
self.model_tester.create_and_check_xlnet_base_model(*config_and_inputs)
|
self.model_tester.create_and_check_xlnet_base_model(*config_and_inputs)
|
||||||
|
|
||||||
def test_xlnet_base_model_use_cache(self):
|
def test_xlnet_base_model_use_mems(self):
|
||||||
# checking that in auto-regressive mode, :obj:`use_cache` gives the same results
|
# checking that in auto-regressive mode, :obj:`use_mems` gives the same results
|
||||||
self.model_tester.set_seed()
|
self.model_tester.set_seed()
|
||||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
self.model_tester.create_and_check_xlnet_model_use_cache(*config_and_inputs)
|
self.model_tester.create_and_check_xlnet_model_use_mems(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_seq_classification_use_mems_train(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_use_mems_train(*config_and_inputs)
|
||||||
|
|
||||||
def test_xlnet_base_model_with_att_output(self):
|
def test_xlnet_base_model_with_att_output(self):
|
||||||
self.model_tester.set_seed()
|
self.model_tester.set_seed()
|
||||||
|
|||||||
Reference in New Issue
Block a user