Convert model files from rst to mdx (#14865)
* First pass * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
170
docs/source/model_doc/albert.mdx
Normal file
170
docs/source/model_doc/albert.mdx
Normal file
@@ -0,0 +1,170 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# ALBERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
|
||||||
|
Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
|
||||||
|
speed of BERT:
|
||||||
|
|
||||||
|
- Splitting the embedding matrix into two smaller matrices.
|
||||||
|
- Using repeating layers split among groups.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Increasing model size when pretraining natural language representations often results in improved performance on
|
||||||
|
downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
|
||||||
|
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
|
||||||
|
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
|
||||||
|
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
|
||||||
|
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
|
||||||
|
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
|
||||||
|
SQuAD benchmarks while having fewer parameters compared to BERT-large.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
|
||||||
|
than the left.
|
||||||
|
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
|
||||||
|
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
|
||||||
|
number of (repeating) layers.
|
||||||
|
|
||||||
|
This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
|
||||||
|
[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
|
||||||
|
|
||||||
|
## AlbertConfig
|
||||||
|
|
||||||
|
[[autodoc]] AlbertConfig
|
||||||
|
|
||||||
|
## AlbertTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] AlbertTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## AlbertTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] AlbertTokenizerFast
|
||||||
|
|
||||||
|
## Albert specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
|
||||||
|
|
||||||
|
## AlbertModel
|
||||||
|
|
||||||
|
[[autodoc]] AlbertModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## AlbertForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] AlbertForPreTraining
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## AlbertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] AlbertForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## AlbertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] AlbertForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## AlbertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] AlbertForMultipleChoice
|
||||||
|
|
||||||
|
## AlbertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] AlbertForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## AlbertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] AlbertForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFAlbertModel
|
||||||
|
|
||||||
|
[[autodoc]] TFAlbertModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFAlbertForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] TFAlbertForPreTraining
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFAlbertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFAlbertForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFAlbertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFAlbertForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFAlbertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFAlbertForMultipleChoice
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFAlbertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFAlbertForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFAlbertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFAlbertForQuestionAnswering
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxAlbertModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxAlbertModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxAlbertForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] FlaxAlbertForPreTraining
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxAlbertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] FlaxAlbertForMaskedLM
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxAlbertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxAlbertForSequenceClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxAlbertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] FlaxAlbertForMultipleChoice
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxAlbertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxAlbertForTokenClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxAlbertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FlaxAlbertForQuestionAnswering
|
||||||
|
- __call__
|
||||||
@@ -1,226 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
ALBERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
|
|
||||||
<https://arxiv.org/abs/1909.11942>`__ by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
|
|
||||||
Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
|
|
||||||
speed of BERT:
|
|
||||||
|
|
||||||
- Splitting the embedding matrix into two smaller matrices.
|
|
||||||
- Using repeating layers split among groups.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Increasing model size when pretraining natural language representations often results in improved performance on
|
|
||||||
downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
|
|
||||||
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
|
|
||||||
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
|
|
||||||
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
|
|
||||||
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
|
|
||||||
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
|
|
||||||
SQuAD benchmarks while having fewer parameters compared to BERT-large.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
|
|
||||||
than the left.
|
|
||||||
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
|
|
||||||
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
|
|
||||||
number of (repeating) layers.
|
|
||||||
|
|
||||||
This model was contributed by `lysandre <https://huggingface.co/lysandre>`__. This model jax version was contributed by
|
|
||||||
`kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
|
|
||||||
<https://github.com/google-research/ALBERT>`__.
|
|
||||||
|
|
||||||
AlbertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
AlbertTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
AlbertTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
Albert specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.albert.modeling_albert.AlbertForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
AlbertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
AlbertForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertForPreTraining
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
AlbertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
AlbertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
AlbertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertForMultipleChoice
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
AlbertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
AlbertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.AlbertForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFAlbertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFAlbertModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFAlbertForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFAlbertForPreTraining
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFAlbertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFAlbertForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFAlbertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFAlbertForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFAlbertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFAlbertForMultipleChoice
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFAlbertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFAlbertForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFAlbertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFAlbertForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxAlbertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxAlbertModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxAlbertForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxAlbertForPreTraining
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxAlbertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxAlbertForMaskedLM
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxAlbertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxAlbertForSequenceClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxAlbertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxAlbertForMultipleChoice
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxAlbertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxAlbertForTokenClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxAlbertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxAlbertForQuestionAnswering
|
|
||||||
:members: __call__
|
|
||||||
151
docs/source/model_doc/bart.mdx
Normal file
151
docs/source/model_doc/bart.mdx
Normal file
@@ -0,0 +1,151 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BART
|
||||||
|
|
||||||
|
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
|
||||||
|
@patrickvonplaten
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Bart model was proposed in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
|
||||||
|
Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
|
||||||
|
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
|
||||||
|
|
||||||
|
According to the abstract,
|
||||||
|
|
||||||
|
- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
|
||||||
|
left-to-right decoder (like GPT).
|
||||||
|
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
|
||||||
|
where spans of text are replaced with a single mask token.
|
||||||
|
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
|
||||||
|
matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
|
||||||
|
state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
|
||||||
|
of up to 6 ROUGE.
|
||||||
|
|
||||||
|
This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart).
|
||||||
|
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
|
||||||
|
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
|
||||||
|
[examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/README.md).
|
||||||
|
- An example of how to train [`BartForConditionalGeneration`] with a Hugging Face `datasets`
|
||||||
|
object can be found in this [forum discussion](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904).
|
||||||
|
- [Distilled checkpoints](https://huggingface.co/models?search=distilbart) are described in this [paper](https://arxiv.org/abs/2010.13002).
|
||||||
|
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
- Bart doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or
|
||||||
|
[`~BartTokenizer.encode`] to get the proper splitting.
|
||||||
|
- The forward pass of [`BartModel`] will create the `decoder_input_ids` if they are not passed.
|
||||||
|
This is different than some other modeling APIs. A typical use case of this feature is mask filling.
|
||||||
|
- Model predictions are intended to be identical to the original implementation when
|
||||||
|
`force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
|
||||||
|
[`fairseq.encode`] starts with a space.
|
||||||
|
- [`~generation_utils.GenerationMixin.generate`] should be used for conditional generation tasks like
|
||||||
|
summarization, see the example in that docstrings.
|
||||||
|
- Models that load the *facebook/bart-large-cnn* weights will not have a `mask_token_id`, or be able to perform
|
||||||
|
mask-filling tasks.
|
||||||
|
|
||||||
|
## Mask Filling
|
||||||
|
|
||||||
|
The `facebook/bart-base` and `facebook/bart-large` checkpoints can be used to fill multi-token masks.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import BartForConditionalGeneration, BartTokenizer
|
||||||
|
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
|
||||||
|
tok = BartTokenizer.from_pretrained("facebook/bart-large")
|
||||||
|
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
|
||||||
|
batch = tok(example_english_phrase, return_tensors='pt')
|
||||||
|
generated_ids = model.generate(batch['input_ids'])
|
||||||
|
assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
|
||||||
|
```
|
||||||
|
|
||||||
|
## BartConfig
|
||||||
|
|
||||||
|
[[autodoc]] BartConfig
|
||||||
|
- all
|
||||||
|
|
||||||
|
## BartTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BartTokenizer
|
||||||
|
- all
|
||||||
|
|
||||||
|
## BartTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] BartTokenizerFast
|
||||||
|
- all
|
||||||
|
|
||||||
|
## BartModel
|
||||||
|
|
||||||
|
[[autodoc]] BartModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BartForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] BartForConditionalGeneration
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BartForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] BartForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BartForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] BartForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BartForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] BartForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFBartModel
|
||||||
|
|
||||||
|
[[autodoc]] TFBartModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFBartForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] TFBartForConditionalGeneration
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxBartModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBartModel
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxBartForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBartForConditionalGeneration
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxBartForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBartForSequenceClassification
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxBartForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBartForQuestionAnswering
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
@@ -1,182 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BART
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
**DISCLAIMER:** If you see something strange, file a `Github Issue
|
|
||||||
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
|
|
||||||
@patrickvonplaten
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The Bart model was proposed in `BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
|
|
||||||
Translation, and Comprehension <https://arxiv.org/abs/1910.13461>`__ by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
|
|
||||||
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer on 29 Oct, 2019.
|
|
||||||
|
|
||||||
According to the abstract,
|
|
||||||
|
|
||||||
- Bart uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a
|
|
||||||
left-to-right decoder (like GPT).
|
|
||||||
- The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme,
|
|
||||||
where spans of text are replaced with a single mask token.
|
|
||||||
- BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It
|
|
||||||
matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new
|
|
||||||
state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains
|
|
||||||
of up to 6 ROUGE.
|
|
||||||
|
|
||||||
This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The Authors' code can be found `here
|
|
||||||
<https://github.com/pytorch/fairseq/tree/master/examples/bart>`__.
|
|
||||||
|
|
||||||
|
|
||||||
Examples
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in
|
|
||||||
:prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
|
|
||||||
- An example of how to train :class:`~transformers.BartForConditionalGeneration` with a Hugging Face :obj:`datasets`
|
|
||||||
object can be found in this `forum discussion
|
|
||||||
<https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904>`__.
|
|
||||||
- `Distilled checkpoints <https://huggingface.co/models?search=distilbart>`__ are described in this `paper
|
|
||||||
<https://arxiv.org/abs/2010.13002>`__.
|
|
||||||
|
|
||||||
|
|
||||||
Implementation Notes
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
- Bart doesn't use :obj:`token_type_ids` for sequence classification. Use :class:`~transformers.BartTokenizer` or
|
|
||||||
:meth:`~transformers.BartTokenizer.encode` to get the proper splitting.
|
|
||||||
- The forward pass of :class:`~transformers.BartModel` will create the ``decoder_input_ids`` if they are not passed.
|
|
||||||
This is different than some other modeling APIs. A typical use case of this feature is mask filling.
|
|
||||||
- Model predictions are intended to be identical to the original implementation when
|
|
||||||
:obj:`force_bos_token_to_be_generated=True`. This only works, however, if the string you pass to
|
|
||||||
:func:`fairseq.encode` starts with a space.
|
|
||||||
- :meth:`~transformers.generation_utils.GenerationMixin.generate` should be used for conditional generation tasks like
|
|
||||||
summarization, see the example in that docstrings.
|
|
||||||
- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
|
|
||||||
mask-filling tasks.
|
|
||||||
|
|
||||||
Mask Filling
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The :obj:`facebook/bart-base` and :obj:`facebook/bart-large` checkpoints can be used to fill multi-token masks.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import BartForConditionalGeneration, BartTokenizer
|
|
||||||
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
|
|
||||||
tok = BartTokenizer.from_pretrained("facebook/bart-large")
|
|
||||||
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
|
|
||||||
batch = tok(example_english_phrase, return_tensors='pt')
|
|
||||||
generated_ids = model.generate(batch['input_ids'])
|
|
||||||
assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
BartConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BartTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BartTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BartModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BartForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartForConditionalGeneration
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BartForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BartForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BartForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFBartModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFBartModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFBartForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFBartForConditionalGeneration
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBartModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBartModel
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBartForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBartForConditionalGeneration
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBartForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBartForSequenceClassification
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBartForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBartForQuestionAnswering
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
50
docs/source/model_doc/barthez.mdx
Normal file
50
docs/source/model_doc/barthez.mdx
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BARThez
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BARThez model was proposed in [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct,
|
||||||
|
2020.
|
||||||
|
|
||||||
|
The abstract of the paper:
|
||||||
|
|
||||||
|
|
||||||
|
*Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing
|
||||||
|
(NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language
|
||||||
|
understanding tasks. While there are some notable exceptions, most of the available models and research have been
|
||||||
|
conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language
|
||||||
|
(to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research
|
||||||
|
that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as
|
||||||
|
CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also
|
||||||
|
its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel
|
||||||
|
summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already
|
||||||
|
pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez,
|
||||||
|
provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.*
|
||||||
|
|
||||||
|
This model was contributed by [moussakam](https://huggingface.co/moussakam). The Authors' code can be found [here](https://github.com/moussaKam/BARThez).
|
||||||
|
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
|
||||||
|
- BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
|
||||||
|
[examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization/README.md).
|
||||||
|
|
||||||
|
|
||||||
|
## BarthezTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BarthezTokenizer
|
||||||
|
|
||||||
|
## BarthezTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] BarthezTokenizerFast
|
||||||
@@ -1,60 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BARThez
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BARThez model was proposed in `BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
|
|
||||||
<https://arxiv.org/abs/2010.12321>`__ by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis on 23 Oct,
|
|
||||||
2020.
|
|
||||||
|
|
||||||
The abstract of the paper:
|
|
||||||
|
|
||||||
|
|
||||||
*Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing
|
|
||||||
(NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language
|
|
||||||
understanding tasks. While there are some notable exceptions, most of the available models and research have been
|
|
||||||
conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language
|
|
||||||
(to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research
|
|
||||||
that we adapted to suit BART's perturbation schemes. Unlike already existing BERT-based French language models such as
|
|
||||||
CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also
|
|
||||||
its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel
|
|
||||||
summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already
|
|
||||||
pretrained multilingual BART on BARThez's corpus, and we show that the resulting model, which we call mBARTHez,
|
|
||||||
provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.*
|
|
||||||
|
|
||||||
This model was contributed by `moussakam <https://huggingface.co/moussakam>`__. The Authors' code can be found `here
|
|
||||||
<https://github.com/moussaKam/BARThez>`__.
|
|
||||||
|
|
||||||
|
|
||||||
Examples
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
- BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check:
|
|
||||||
:prefix_link:`examples/pytorch/summarization/ <examples/pytorch/summarization/README.md>`.
|
|
||||||
|
|
||||||
|
|
||||||
BarthezTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BarthezTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BarthezTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BarthezTokenizerFast
|
|
||||||
:members:
|
|
||||||
80
docs/source/model_doc/bartpho.mdx
Normal file
80
docs/source/model_doc/bartpho.mdx
Normal file
@@ -0,0 +1,80 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BARTpho
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BARTpho model was proposed in [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*We present BARTpho with two versions -- BARTpho_word and BARTpho_syllable -- the first public large-scale monolingual
|
||||||
|
sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the "large" architecture and pre-training
|
||||||
|
scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments
|
||||||
|
on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho
|
||||||
|
outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future
|
||||||
|
research and applications of generative Vietnamese NLP tasks.*
|
||||||
|
|
||||||
|
Example of use:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> import torch
|
||||||
|
>>> from transformers import AutoModel, AutoTokenizer
|
||||||
|
|
||||||
|
>>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")
|
||||||
|
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
|
||||||
|
|
||||||
|
>>> line = "Chúng tôi là những nghiên cứu viên."
|
||||||
|
|
||||||
|
>>> input_ids = tokenizer(line, return_tensors="pt")
|
||||||
|
|
||||||
|
>>> with torch.no_grad():
|
||||||
|
... features = bartpho(**input_ids) # Models outputs are now tuples
|
||||||
|
|
||||||
|
>>> # With TensorFlow 2.0+:
|
||||||
|
>>> from transformers import TFAutoModel
|
||||||
|
>>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
|
||||||
|
>>> input_ids = tokenizer(line, return_tensors="tf")
|
||||||
|
>>> features = bartpho(**input_ids)
|
||||||
|
```
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of
|
||||||
|
both the encoder and decoder. Thus, usage examples in the [documentation of BART](bart), when adapting to use
|
||||||
|
with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts.
|
||||||
|
For example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MBartForConditionalGeneration
|
||||||
|
>>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
|
||||||
|
>>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
|
||||||
|
>>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
|
||||||
|
>>> logits = bartpho(input_ids).logits
|
||||||
|
>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
|
||||||
|
>>> probs = logits[0, masked_index].softmax(dim=0)
|
||||||
|
>>> values, predictions = probs.topk(5)
|
||||||
|
>>> print(tokenizer.decode(predictions).split())
|
||||||
|
```
|
||||||
|
|
||||||
|
- This implementation is only for tokenization: "monolingual_vocab_file" consists of Vietnamese-specialized types
|
||||||
|
extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
|
||||||
|
Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
|
||||||
|
segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
|
||||||
|
|
||||||
|
This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho).
|
||||||
|
|
||||||
|
## BartphoTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BartphoTokenizer
|
||||||
@@ -1,86 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BARTpho
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BARTpho model was proposed in `BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
|
|
||||||
<https://arxiv.org/abs/2109.09701>`__ by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*We present BARTpho with two versions -- BARTpho_word and BARTpho_syllable -- the first public large-scale monolingual
|
|
||||||
sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the "large" architecture and pre-training
|
|
||||||
scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments
|
|
||||||
on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho
|
|
||||||
outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future
|
|
||||||
research and applications of generative Vietnamese NLP tasks.*
|
|
||||||
|
|
||||||
Example of use:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> import torch
|
|
||||||
>>> from transformers import AutoModel, AutoTokenizer
|
|
||||||
|
|
||||||
>>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable")
|
|
||||||
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable")
|
|
||||||
|
|
||||||
>>> line = "Chúng tôi là những nghiên cứu viên."
|
|
||||||
|
|
||||||
>>> input_ids = tokenizer(line, return_tensors="pt")
|
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
|
||||||
... features = bartpho(**input_ids) # Models outputs are now tuples
|
|
||||||
|
|
||||||
>>> # With TensorFlow 2.0+:
|
|
||||||
>>> from transformers import TFAutoModel
|
|
||||||
>>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
|
|
||||||
>>> input_ids = tokenizer(line, return_tensors="tf")
|
|
||||||
>>> features = bartpho(**input_ids)
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of
|
|
||||||
both the encoder and decoder. Thus, usage examples in the :doc:`documentation of BART <bart>`, when adapting to use
|
|
||||||
with BARTpho, should be adjusted by replacing the BART-specialized classes with the mBART-specialized counterparts.
|
|
||||||
For example:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import MBartForConditionalGeneration
|
|
||||||
>>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
|
|
||||||
>>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
|
|
||||||
>>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
|
|
||||||
>>> logits = bartpho(input_ids).logits
|
|
||||||
>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
|
|
||||||
>>> probs = logits[0, masked_index].softmax(dim=0)
|
|
||||||
>>> values, predictions = probs.topk(5)
|
|
||||||
>>> print(tokenizer.decode(predictions).split())
|
|
||||||
|
|
||||||
- This implementation is only for tokenization: "monolingual_vocab_file" consists of Vietnamese-specialized types
|
|
||||||
extracted from the pre-trained SentencePiece model "vocab_file" that is available from the multilingual XLM-RoBERTa.
|
|
||||||
Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword
|
|
||||||
segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file".
|
|
||||||
|
|
||||||
This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here
|
|
||||||
<https://github.com/VinAIResearch/BARTpho>`__.
|
|
||||||
|
|
||||||
BartphoTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BartphoTokenizer
|
|
||||||
:members:
|
|
||||||
114
docs/source/model_doc/beit.mdx
Normal file
114
docs/source/model_doc/beit.mdx
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BEiT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BEiT model was proposed in [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by
|
||||||
|
Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
|
||||||
|
Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
|
||||||
|
of an image (as done in the [original ViT paper](https://arxiv.org/abs/2010.11929)), BEiT models are pre-trained to
|
||||||
|
predict visual tokens from the codebook of OpenAI's [DALL-E model](https://arxiv.org/abs/2102.12092) given masked
|
||||||
|
patches.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
|
||||||
|
from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
|
||||||
|
modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
|
||||||
|
patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
|
||||||
|
visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
|
||||||
|
objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
|
||||||
|
directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
|
||||||
|
Experimental results on image classification and semantic segmentation show that our model achieves competitive results
|
||||||
|
with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
|
||||||
|
significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
|
||||||
|
86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
|
||||||
|
outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
|
||||||
|
fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
|
||||||
|
[`ViTFeatureExtractor`] by [`BeitFeatureExtractor`] and
|
||||||
|
[`ViTForImageClassification`] by [`BeitForImageClassification`]).
|
||||||
|
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
|
||||||
|
performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
|
||||||
|
- As the BEiT models expect each image to be of the same size (resolution), one can use
|
||||||
|
[`BeitFeatureExtractor`] to resize (or rescale) and normalize images for the model.
|
||||||
|
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
||||||
|
each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
|
||||||
|
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
|
||||||
|
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of
|
||||||
|
14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
|
||||||
|
images and 1,000 classes).
|
||||||
|
- BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
|
||||||
|
relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
|
||||||
|
bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
|
||||||
|
pre-train a model from scratch, one needs to either set the `use_relative_position_bias` or the
|
||||||
|
`use_relative_position_bias` attribute of [`BeitConfig`] to `True` in order to add
|
||||||
|
position embeddings.
|
||||||
|
|
||||||
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was
|
||||||
|
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
|
||||||
|
|
||||||
|
|
||||||
|
## BEiT specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.beit.modeling_beit.BeitModelOutputWithPooling
|
||||||
|
|
||||||
|
[[autodoc]] models.beit.modeling_flax_beit.FlaxBeitModelOutputWithPooling
|
||||||
|
|
||||||
|
## BeitConfig
|
||||||
|
|
||||||
|
[[autodoc]] BeitConfig
|
||||||
|
|
||||||
|
## BeitFeatureExtractor
|
||||||
|
|
||||||
|
[[autodoc]] BeitFeatureExtractor
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## BeitModel
|
||||||
|
|
||||||
|
[[autodoc]] BeitModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BeitForMaskedImageModeling
|
||||||
|
|
||||||
|
[[autodoc]] BeitForMaskedImageModeling
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BeitForImageClassification
|
||||||
|
|
||||||
|
[[autodoc]] BeitForImageClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BeitForSemanticSegmentation
|
||||||
|
|
||||||
|
[[autodoc]] BeitForSemanticSegmentation
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaxBeitModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBeitModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBeitForMaskedImageModeling
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBeitForMaskedImageModeling
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBeitForImageClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBeitForImageClassification
|
||||||
|
- __call__
|
||||||
@@ -1,144 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BEiT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BEiT model was proposed in `BEiT: BERT Pre-Training of Image Transformers <https://arxiv.org/abs/2106.08254>`__ by
|
|
||||||
Hangbo Bao, Li Dong and Furu Wei. Inspired by BERT, BEiT is the first paper that makes self-supervised pre-training of
|
|
||||||
Vision Transformers (ViTs) outperform supervised pre-training. Rather than pre-training the model to predict the class
|
|
||||||
of an image (as done in the `original ViT paper <https://arxiv.org/abs/2010.11929>`__), BEiT models are pre-trained to
|
|
||||||
predict visual tokens from the codebook of OpenAI's `DALL-E model <https://arxiv.org/abs/2102.12092>`__ given masked
|
|
||||||
patches.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation
|
|
||||||
from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image
|
|
||||||
modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image
|
|
||||||
patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into
|
|
||||||
visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training
|
|
||||||
objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we
|
|
||||||
directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder.
|
|
||||||
Experimental results on image classification and semantic segmentation show that our model achieves competitive results
|
|
||||||
with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K,
|
|
||||||
significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains
|
|
||||||
86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
|
|
||||||
outperform both the :doc:`original model (ViT) <vit>` as well as :doc:`Data-efficient Image Transformers (DeiT)
|
|
||||||
<deit>` when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
|
|
||||||
fine-tuning on custom data `here
|
|
||||||
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer>`__ (you can just replace
|
|
||||||
:class:`~transformers.ViTFeatureExtractor` by :class:`~transformers.BeitFeatureExtractor` and
|
|
||||||
:class:`~transformers.ViTForImageClassification` by :class:`~transformers.BeitForImageClassification`).
|
|
||||||
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
|
|
||||||
performing masked image modeling. You can find it `here
|
|
||||||
<https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT>`__.
|
|
||||||
- As the BEiT models expect each image to be of the same size (resolution), one can use
|
|
||||||
:class:`~transformers.BeitFeatureExtractor` to resize (or rescale) and normalize images for the model.
|
|
||||||
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
|
||||||
each checkpoint. For example, :obj:`microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
|
|
||||||
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the `hub
|
|
||||||
<https://huggingface.co/models?search=microsoft/beit>`__.
|
|
||||||
- The available checkpoints are either (1) pre-trained on `ImageNet-22k <http://www.image-net.org/>`__ (a collection of
|
|
||||||
14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on `ImageNet-1k
|
|
||||||
<http://www.image-net.org/challenges/LSVRC/2012/>`__ (also referred to as ILSVRC 2012, a collection of 1.3 million
|
|
||||||
images and 1,000 classes).
|
|
||||||
- BEiT uses relative position embeddings, inspired by the T5 model. During pre-training, the authors shared the
|
|
||||||
relative position bias among the several self-attention layers. During fine-tuning, each layer's relative position
|
|
||||||
bias is initialized with the shared relative position bias obtained after pre-training. Note that, if one wants to
|
|
||||||
pre-train a model from scratch, one needs to either set the :obj:`use_relative_position_bias` or the
|
|
||||||
:obj:`use_relative_position_bias` attribute of :class:`~transformers.BeitConfig` to :obj:`True` in order to add
|
|
||||||
position embeddings.
|
|
||||||
|
|
||||||
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The JAX/FLAX version of this model was
|
|
||||||
contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
|
|
||||||
<https://github.com/microsoft/unilm/tree/master/beit>`__.
|
|
||||||
|
|
||||||
|
|
||||||
BEiT specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.beit.modeling_beit.BeitModelOutputWithPooling
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.beit.modeling_flax_beit.FlaxBeitModelOutputWithPooling
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BeitConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BeitConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BeitFeatureExtractor
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BeitFeatureExtractor
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
BeitModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BeitModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BeitForMaskedImageModeling
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BeitForMaskedImageModeling
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BeitForImageClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BeitForImageClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BeitForSemanticSegmentation
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BeitForSemanticSegmentation
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBeitModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBeitModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBeitForMaskedImageModeling
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBeitForMaskedImageModeling
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBeitForImageClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBeitForImageClassification
|
|
||||||
:members: __call__
|
|
||||||
74
docs/source/model_doc/bert_japanese.mdx
Normal file
74
docs/source/model_doc/bert_japanese.mdx
Normal file
@@ -0,0 +1,74 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BertJapanese
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BERT models trained on Japanese text.
|
||||||
|
|
||||||
|
There are models with two different tokenization methods:
|
||||||
|
|
||||||
|
- Tokenize with MeCab and WordPiece. This requires some extra dependencies, [fugashi](https://github.com/polm/fugashi) which is a wrapper around [MeCab](https://taku910.github.io/mecab/).
|
||||||
|
- Tokenize into characters.
|
||||||
|
|
||||||
|
To use *MecabTokenizer*, you should `pip install transformers["ja"]` (or `pip install -e .["ja"]` if you install
|
||||||
|
from source) to install dependencies.
|
||||||
|
|
||||||
|
See [details on cl-tohoku repository](https://github.com/cl-tohoku/bert-japanese).
|
||||||
|
|
||||||
|
Example of using a model with MeCab and WordPiece tokenization:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> import torch
|
||||||
|
>>> from transformers import AutoModel, AutoTokenizer
|
||||||
|
|
||||||
|
>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
|
||||||
|
|
||||||
|
>>> ## Input Japanese Text
|
||||||
|
>>> line = "吾輩は猫である。"
|
||||||
|
|
||||||
|
>>> inputs = tokenizer(line, return_tensors="pt")
|
||||||
|
|
||||||
|
>>> print(tokenizer.decode(inputs['input_ids'][0]))
|
||||||
|
[CLS] 吾輩 は 猫 で ある 。 [SEP]
|
||||||
|
|
||||||
|
>>> outputs = bertjapanese(**inputs)
|
||||||
|
```
|
||||||
|
|
||||||
|
Example of using a model with Character tokenization:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
|
||||||
|
|
||||||
|
>>> ## Input Japanese Text
|
||||||
|
>>> line = "吾輩は猫である。"
|
||||||
|
|
||||||
|
>>> inputs = tokenizer(line, return_tensors="pt")
|
||||||
|
|
||||||
|
>>> print(tokenizer.decode(inputs['input_ids'][0]))
|
||||||
|
[CLS] 吾 輩 は 猫 で あ る 。 [SEP]
|
||||||
|
|
||||||
|
>>> outputs = bertjapanese(**inputs)
|
||||||
|
```
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- This implementation is the same as BERT, except for tokenization method. Refer to the [documentation of BERT](bert) for more usage examples.
|
||||||
|
|
||||||
|
This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku).
|
||||||
|
|
||||||
|
## BertJapaneseTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BertJapaneseTokenizer
|
||||||
@@ -1,80 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BertJapanese
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BERT models trained on Japanese text.
|
|
||||||
|
|
||||||
There are models with two different tokenization methods:
|
|
||||||
|
|
||||||
- Tokenize with MeCab and WordPiece. This requires some extra dependencies, `fugashi
|
|
||||||
<https://github.com/polm/fugashi>`__ which is a wrapper around `MeCab <https://taku910.github.io/mecab/>`__.
|
|
||||||
- Tokenize into characters.
|
|
||||||
|
|
||||||
To use `MecabTokenizer`, you should ``pip install transformers["ja"]`` (or ``pip install -e .["ja"]`` if you install
|
|
||||||
from source) to install dependencies.
|
|
||||||
|
|
||||||
See `details on cl-tohoku repository <https://github.com/cl-tohoku/bert-japanese>`__.
|
|
||||||
|
|
||||||
Example of using a model with MeCab and WordPiece tokenization:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> import torch
|
|
||||||
>>> from transformers import AutoModel, AutoTokenizer
|
|
||||||
|
|
||||||
>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
|
|
||||||
|
|
||||||
>>> ## Input Japanese Text
|
|
||||||
>>> line = "吾輩は猫である。"
|
|
||||||
|
|
||||||
>>> inputs = tokenizer(line, return_tensors="pt")
|
|
||||||
|
|
||||||
>>> print(tokenizer.decode(inputs['input_ids'][0]))
|
|
||||||
[CLS] 吾輩 は 猫 で ある 。 [SEP]
|
|
||||||
|
|
||||||
>>> outputs = bertjapanese(**inputs)
|
|
||||||
|
|
||||||
Example of using a model with Character tokenization:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char")
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char")
|
|
||||||
|
|
||||||
>>> ## Input Japanese Text
|
|
||||||
>>> line = "吾輩は猫である。"
|
|
||||||
|
|
||||||
>>> inputs = tokenizer(line, return_tensors="pt")
|
|
||||||
|
|
||||||
>>> print(tokenizer.decode(inputs['input_ids'][0]))
|
|
||||||
[CLS] 吾 輩 は 猫 で あ る 。 [SEP]
|
|
||||||
|
|
||||||
>>> outputs = bertjapanese(**inputs)
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- This implementation is the same as BERT, except for tokenization method. Refer to the :doc:`documentation of BERT
|
|
||||||
<bert>` for more usage examples.
|
|
||||||
|
|
||||||
This model was contributed by `cl-tohoku <https://huggingface.co/cl-tohoku>`__.
|
|
||||||
|
|
||||||
BertJapaneseTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BertJapaneseTokenizer
|
|
||||||
:members:
|
|
||||||
98
docs/source/model_doc/bertgeneration.mdx
Normal file
98
docs/source/model_doc/bertgeneration.mdx
Normal file
@@ -0,0 +1,98 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BertGeneration
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using
|
||||||
|
[`EncoderDecoderModel`] as proposed in [Leveraging Pre-trained Checkpoints for Sequence Generation
|
||||||
|
Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
|
||||||
|
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
|
||||||
|
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
|
||||||
|
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
|
||||||
|
developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT,
|
||||||
|
GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both
|
||||||
|
encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
|
||||||
|
Text Summarization, Sentence Splitting, and Sentence Fusion.*
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
|
||||||
|
- The model can be used in combination with the [`EncoderDecoderModel`] to leverage two pretrained
|
||||||
|
BERT checkpoints for subsequent fine-tuning.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> # leverage checkpoints for Bert2Bert model...
|
||||||
|
>>> # use BERT's cls token as BOS token and sep token as EOS token
|
||||||
|
>>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
|
||||||
|
>>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
|
||||||
|
>>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
|
||||||
|
>>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
|
||||||
|
|
||||||
|
>>> # create tokenizer...
|
||||||
|
>>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
|
||||||
|
|
||||||
|
>>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
|
||||||
|
>>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
>>> # train...
|
||||||
|
>>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
|
||||||
|
>>> loss.backward()
|
||||||
|
```
|
||||||
|
|
||||||
|
- Pretrained [`EncoderDecoderModel`] are also directly available in the model hub, e.g.,
|
||||||
|
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> # instantiate sentence fusion model
|
||||||
|
>>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
|
||||||
|
|
||||||
|
>>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
>>> outputs = sentence_fuser.generate(input_ids)
|
||||||
|
|
||||||
|
>>> print(tokenizer.decode(outputs[0]))
|
||||||
|
```
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- [`BertGenerationEncoder`] and [`BertGenerationDecoder`] should be used in
|
||||||
|
combination with [`EncoderDecoder`].
|
||||||
|
- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
|
||||||
|
Therefore, no EOS token should be added to the end of the input.
|
||||||
|
|
||||||
|
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
|
||||||
|
found [here](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder).
|
||||||
|
|
||||||
|
## BertGenerationConfig
|
||||||
|
|
||||||
|
[[autodoc]] BertGenerationConfig
|
||||||
|
|
||||||
|
## BertGenerationTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BertGenerationTokenizer
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## BertGenerationEncoder
|
||||||
|
|
||||||
|
[[autodoc]] BertGenerationEncoder
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BertGenerationDecoder
|
||||||
|
|
||||||
|
[[autodoc]] BertGenerationDecoder
|
||||||
|
- forward
|
||||||
@@ -1,109 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BertGeneration
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BertGeneration model is a BERT model that can be leveraged for sequence-to-sequence tasks using
|
|
||||||
:class:`~transformers.EncoderDecoderModel` as proposed in `Leveraging Pre-trained Checkpoints for Sequence Generation
|
|
||||||
Tasks <https://arxiv.org/abs/1907.12461>`__ by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Unsupervised pretraining of large neural models has recently revolutionized Natural Language Processing. By
|
|
||||||
warm-starting from the publicly released checkpoints, NLP practitioners have pushed the state-of-the-art on multiple
|
|
||||||
benchmarks while saving significant amounts of compute time. So far the focus has been mainly on the Natural Language
|
|
||||||
Understanding tasks. In this paper, we demonstrate the efficacy of pre-trained checkpoints for Sequence Generation. We
|
|
||||||
developed a Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT,
|
|
||||||
GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the utility of initializing our model, both
|
|
||||||
encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation,
|
|
||||||
Text Summarization, Sentence Splitting, and Sentence Fusion.*
|
|
||||||
|
|
||||||
Usage:
|
|
||||||
|
|
||||||
- The model can be used in combination with the :class:`~transformers.EncoderDecoderModel` to leverage two pretrained
|
|
||||||
BERT checkpoints for subsequent fine-tuning.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> # leverage checkpoints for Bert2Bert model...
|
|
||||||
>>> # use BERT's cls token as BOS token and sep token as EOS token
|
|
||||||
>>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
|
|
||||||
>>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
|
|
||||||
>>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
|
|
||||||
>>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)
|
|
||||||
|
|
||||||
>>> # create tokenizer...
|
|
||||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")
|
|
||||||
|
|
||||||
>>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
|
|
||||||
>>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
>>> # train...
|
|
||||||
>>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
|
|
||||||
>>> loss.backward()
|
|
||||||
|
|
||||||
|
|
||||||
- Pretrained :class:`~transformers.EncoderDecoderModel` are also directly available in the model hub, e.g.,
|
|
||||||
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> # instantiate sentence fusion model
|
|
||||||
>>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")
|
|
||||||
|
|
||||||
>>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
>>> outputs = sentence_fuser.generate(input_ids)
|
|
||||||
|
|
||||||
>>> print(tokenizer.decode(outputs[0]))
|
|
||||||
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- :class:`~transformers.BertGenerationEncoder` and :class:`~transformers.BertGenerationDecoder` should be used in
|
|
||||||
combination with :class:`~transformers.EncoderDecoder`.
|
|
||||||
- For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input.
|
|
||||||
Therefore, no EOS token should be added to the end of the input.
|
|
||||||
|
|
||||||
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
|
|
||||||
found `here <https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder>`__.
|
|
||||||
|
|
||||||
BertGenerationConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BertGenerationConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BertGenerationTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BertGenerationTokenizer
|
|
||||||
:members: save_vocabulary
|
|
||||||
|
|
||||||
BertGenerationEncoder
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BertGenerationEncoder
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BertGenerationDecoder
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BertGenerationDecoder
|
|
||||||
:members: forward
|
|
||||||
58
docs/source/model_doc/bertweet.mdx
Normal file
58
docs/source/model_doc/bertweet.mdx
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BERTweet
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BERTweet model was proposed in [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
|
||||||
|
the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
|
||||||
|
al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
|
||||||
|
2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
|
||||||
|
Part-of-speech tagging, Named-entity recognition and text classification.*
|
||||||
|
|
||||||
|
Example of use:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> import torch
|
||||||
|
>>> from transformers import AutoModel, AutoTokenizer
|
||||||
|
|
||||||
|
>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
|
||||||
|
|
||||||
|
>>> # For transformers v4.x+:
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
|
||||||
|
|
||||||
|
>>> # For transformers v3.x:
|
||||||
|
>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
|
||||||
|
|
||||||
|
>>> # INPUT TWEET IS ALREADY NORMALIZED!
|
||||||
|
>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
|
||||||
|
|
||||||
|
>>> input_ids = torch.tensor([tokenizer.encode(line)])
|
||||||
|
|
||||||
|
>>> with torch.no_grad():
|
||||||
|
... features = bertweet(input_ids) # Models outputs are now tuples
|
||||||
|
|
||||||
|
>>> # With TensorFlow 2.0+:
|
||||||
|
>>> # from transformers import TFAutoModel
|
||||||
|
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
|
||||||
|
```
|
||||||
|
|
||||||
|
This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet).
|
||||||
|
|
||||||
|
## BertweetTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BertweetTokenizer
|
||||||
@@ -1,64 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BERTweet
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BERTweet model was proposed in `BERTweet: A pre-trained language model for English Tweets
|
|
||||||
<https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf>`__ by Dat Quoc Nguyen, Thanh Vu, Anh Tuan Nguyen.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having
|
|
||||||
the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et
|
|
||||||
al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al.,
|
|
||||||
2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks:
|
|
||||||
Part-of-speech tagging, Named-entity recognition and text classification.*
|
|
||||||
|
|
||||||
Example of use:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> import torch
|
|
||||||
>>> from transformers import AutoModel, AutoTokenizer
|
|
||||||
|
|
||||||
>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
|
|
||||||
|
|
||||||
>>> # For transformers v4.x+:
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
|
|
||||||
|
|
||||||
>>> # For transformers v3.x:
|
|
||||||
>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")
|
|
||||||
|
|
||||||
>>> # INPUT TWEET IS ALREADY NORMALIZED!
|
|
||||||
>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:"
|
|
||||||
|
|
||||||
>>> input_ids = torch.tensor([tokenizer.encode(line)])
|
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
|
||||||
... features = bertweet(input_ids) # Models outputs are now tuples
|
|
||||||
|
|
||||||
>>> # With TensorFlow 2.0+:
|
|
||||||
>>> # from transformers import TFAutoModel
|
|
||||||
>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base")
|
|
||||||
|
|
||||||
This model was contributed by `dqnguyen <https://huggingface.co/dqnguyen>`__. The original code can be found `here
|
|
||||||
<https://github.com/VinAIResearch/BERTweet>`__.
|
|
||||||
|
|
||||||
BertweetTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BertweetTokenizer
|
|
||||||
:members:
|
|
||||||
146
docs/source/model_doc/bigbird.mdx
Normal file
146
docs/source/model_doc/bigbird.mdx
Normal file
@@ -0,0 +1,146 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BigBird
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BigBird model was proposed in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by
|
||||||
|
Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
|
||||||
|
Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
|
||||||
|
based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
|
||||||
|
attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
|
||||||
|
has been shown that applying sparse, global, and random attention approximates full attention, while being
|
||||||
|
computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
|
||||||
|
BigBird has shown improved performance on various long document NLP tasks, such as question answering and
|
||||||
|
summarization, compared to BERT or RoBERTa.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
|
||||||
|
Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
|
||||||
|
length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
|
||||||
|
reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
|
||||||
|
is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
|
||||||
|
theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
|
||||||
|
sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
|
||||||
|
8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
|
||||||
|
BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
|
||||||
|
propose novel applications to genomics data.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
|
||||||
|
- BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
|
||||||
|
**original_full** is advised as there is no benefit in using **block_sparse** attention.
|
||||||
|
- The code currently uses window size of 3 blocks and 2 global blocks.
|
||||||
|
- Sequence length must be divisible by block size.
|
||||||
|
- Current implementation supports only **ITC**.
|
||||||
|
- Current implementation doesn't support **num_random_blocks = 0**
|
||||||
|
|
||||||
|
This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta). The original code can be found
|
||||||
|
[here](https://github.com/google-research/bigbird).
|
||||||
|
|
||||||
|
## BigBirdConfig
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdConfig
|
||||||
|
|
||||||
|
## BigBirdTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## BigBirdTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdTokenizerFast
|
||||||
|
|
||||||
|
## BigBird specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput
|
||||||
|
|
||||||
|
## BigBirdModel
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdForPreTraining
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaxBigBirdModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBigBirdModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBigBirdForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBigBirdForPreTraining
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBigBirdForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBigBirdForMaskedLM
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBigBirdForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBigBirdForSequenceClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBigBirdForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBigBirdForMultipleChoice
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBigBirdForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBigBirdForTokenClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxBigBirdForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBigBirdForQuestionAnswering
|
||||||
|
- __call__
|
||||||
@@ -1,185 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BigBird
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BigBird model was proposed in `Big Bird: Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by
|
|
||||||
Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
|
|
||||||
Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
|
|
||||||
based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
|
|
||||||
attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
|
|
||||||
has been shown that applying sparse, global, and random attention approximates full attention, while being
|
|
||||||
computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
|
|
||||||
BigBird has shown improved performance on various long document NLP tasks, such as question answering and
|
|
||||||
summarization, compared to BERT or RoBERTa.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
|
|
||||||
Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
|
|
||||||
length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
|
|
||||||
reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
|
|
||||||
is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
|
|
||||||
theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
|
|
||||||
sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
|
|
||||||
8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
|
|
||||||
BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
|
|
||||||
propose novel applications to genomics data.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- For an in-detail explanation on how BigBird's attention works, see `this blog post
|
|
||||||
<https://huggingface.co/blog/big-bird>`__.
|
|
||||||
- BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
|
|
||||||
**original_full** is advised as there is no benefit in using **block_sparse** attention.
|
|
||||||
- The code currently uses window size of 3 blocks and 2 global blocks.
|
|
||||||
- Sequence length must be divisible by block size.
|
|
||||||
- Current implementation supports only **ITC**.
|
|
||||||
- Current implementation doesn't support **num_random_blocks = 0**
|
|
||||||
|
|
||||||
This model was contributed by `vasudevgupta <https://huggingface.co/vasudevgupta>`__. The original code can be found
|
|
||||||
`here <https://github.com/google-research/bigbird>`__.
|
|
||||||
|
|
||||||
BigBirdConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
BigBirdTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
BigBird specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdForPreTraining
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBigBirdModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBigBirdModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBigBirdForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBigBirdForPreTraining
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBigBirdForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBigBirdForMaskedLM
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBigBirdForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBigBirdForSequenceClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBigBirdForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBigBirdForMultipleChoice
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBigBirdForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBigBirdForTokenClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBigBirdForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBigBirdForQuestionAnswering
|
|
||||||
:members: __call__
|
|
||||||
81
docs/source/model_doc/bigbird_pegasus.mdx
Normal file
81
docs/source/model_doc/bigbird_pegasus.mdx
Normal file
@@ -0,0 +1,81 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# BigBirdPegasus
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The BigBird model was proposed in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by
|
||||||
|
Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
|
||||||
|
Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
|
||||||
|
based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
|
||||||
|
attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
|
||||||
|
has been shown that applying sparse, global, and random attention approximates full attention, while being
|
||||||
|
computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
|
||||||
|
BigBird has shown improved performance on various long document NLP tasks, such as question answering and
|
||||||
|
summarization, compared to BERT or RoBERTa.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
|
||||||
|
Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
|
||||||
|
length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
|
||||||
|
reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
|
||||||
|
is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
|
||||||
|
theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
|
||||||
|
sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
|
||||||
|
8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
|
||||||
|
BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
|
||||||
|
propose novel applications to genomics data.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird).
|
||||||
|
- BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
|
||||||
|
**original_full** is advised as there is no benefit in using **block_sparse** attention.
|
||||||
|
- The code currently uses window size of 3 blocks and 2 global blocks.
|
||||||
|
- Sequence length must be divisible by block size.
|
||||||
|
- Current implementation supports only **ITC**.
|
||||||
|
- Current implementation doesn't support **num_random_blocks = 0**.
|
||||||
|
- BigBirdPegasus uses the [PegasusTokenizer](https://github.com/huggingface/transformers/blob/master/src/transformers/models/pegasus/tokenization_pegasus.py).
|
||||||
|
|
||||||
|
The original code can be found [here](https://github.com/google-research/bigbird).
|
||||||
|
|
||||||
|
## BigBirdPegasusConfig
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdPegasusConfig
|
||||||
|
- all
|
||||||
|
|
||||||
|
## BigBirdPegasusModel
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdPegasusModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdPegasusForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdPegasusForConditionalGeneration
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdPegasusForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdPegasusForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdPegasusForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdPegasusForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BigBirdPegasusForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] BigBirdPegasusForCausalLM
|
||||||
|
- forward
|
||||||
@@ -1,98 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
BigBirdPegasus
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BigBird model was proposed in `Big Bird: Transformers for Longer Sequences <https://arxiv.org/abs/2007.14062>`__ by
|
|
||||||
Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon,
|
|
||||||
Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others. BigBird, is a sparse-attention
|
|
||||||
based transformer which extends Transformer based models, such as BERT to much longer sequences. In addition to sparse
|
|
||||||
attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it
|
|
||||||
has been shown that applying sparse, global, and random attention approximates full attention, while being
|
|
||||||
computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context,
|
|
||||||
BigBird has shown improved performance on various long document NLP tasks, such as question answering and
|
|
||||||
summarization, compared to BERT or RoBERTa.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP.
|
|
||||||
Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence
|
|
||||||
length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that
|
|
||||||
reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and
|
|
||||||
is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our
|
|
||||||
theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire
|
|
||||||
sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to
|
|
||||||
8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context,
|
|
||||||
BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also
|
|
||||||
propose novel applications to genomics data.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- For an in-detail explanation on how BigBird's attention works, see `this blog post
|
|
||||||
<https://huggingface.co/blog/big-bird>`__.
|
|
||||||
- BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using
|
|
||||||
**original_full** is advised as there is no benefit in using **block_sparse** attention.
|
|
||||||
- The code currently uses window size of 3 blocks and 2 global blocks.
|
|
||||||
- Sequence length must be divisible by block size.
|
|
||||||
- Current implementation supports only **ITC**.
|
|
||||||
- Current implementation doesn't support **num_random_blocks = 0**.
|
|
||||||
- BigBirdPegasus uses the `PegasusTokenizer
|
|
||||||
<https://github.com/huggingface/transformers/blob/master/src/transformers/models/pegasus/tokenization_pegasus.py>`__.
|
|
||||||
|
|
||||||
The original code can be found `here <https://github.com/google-research/bigbird>`__.
|
|
||||||
|
|
||||||
BigBirdPegasusConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdPegasusConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdPegasusModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdPegasusModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdPegasusForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdPegasusForConditionalGeneration
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdPegasusForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdPegasusForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdPegasusForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdPegasusForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BigBirdPegasusForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BigBirdPegasusForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
118
docs/source/model_doc/blenderbot.mdx
Normal file
118
docs/source/model_doc/blenderbot.mdx
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Blenderbot
|
||||||
|
|
||||||
|
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) .
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
|
||||||
|
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
|
||||||
|
|
||||||
|
The abstract of the paper is the following:
|
||||||
|
|
||||||
|
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
|
||||||
|
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
|
||||||
|
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
|
||||||
|
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
|
||||||
|
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
|
||||||
|
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
|
||||||
|
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
|
||||||
|
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
|
||||||
|
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
|
||||||
|
failure cases of our models.*
|
||||||
|
|
||||||
|
This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/facebookresearch/ParlAI) .
|
||||||
|
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
- Blenderbot uses a standard [seq2seq model transformer](https://arxiv.org/pdf/1706.03762.pdf) based architecture.
|
||||||
|
- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot).
|
||||||
|
- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as
|
||||||
|
`facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with
|
||||||
|
[BlenderbotSmall](blenderbot_small).
|
||||||
|
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
Here is an example of model usage:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
|
||||||
|
>>> mname = 'facebook/blenderbot-400M-distill'
|
||||||
|
>>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
|
||||||
|
>>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
|
||||||
|
>>> UTTERANCE = "My friends are cool but they eat too many carbs."
|
||||||
|
>>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
|
||||||
|
>>> reply_ids = model.generate(**inputs)
|
||||||
|
>>> print(tokenizer.batch_decode(reply_ids))
|
||||||
|
["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
|
||||||
|
```
|
||||||
|
|
||||||
|
## BlenderbotConfig
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotConfig
|
||||||
|
|
||||||
|
## BlenderbotTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
|
||||||
|
## BlenderbotTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotTokenizerFast
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
|
||||||
|
## BlenderbotModel
|
||||||
|
|
||||||
|
See `transformers.BartModel` for arguments to *forward* and *generate*
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BlenderbotForConditionalGeneration
|
||||||
|
|
||||||
|
See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* and *generate*
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotForConditionalGeneration
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BlenderbotForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFBlenderbotModel
|
||||||
|
|
||||||
|
[[autodoc]] TFBlenderbotModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFBlenderbotForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] TFBlenderbotForConditionalGeneration
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxBlenderbotModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBlenderbotModel
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxBlenderbotForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBlenderbotForConditionalGeneration
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
@@ -1,141 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
Blenderbot
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
**DISCLAIMER:** If you see something strange, file a `Github Issue
|
|
||||||
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ .
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
|
|
||||||
<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
|
|
||||||
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
|
|
||||||
|
|
||||||
The abstract of the paper is the following:
|
|
||||||
|
|
||||||
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
|
|
||||||
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
|
|
||||||
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
|
|
||||||
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
|
|
||||||
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
|
|
||||||
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
|
|
||||||
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
|
|
||||||
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
|
|
||||||
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
|
|
||||||
failure cases of our models.*
|
|
||||||
|
|
||||||
This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__. The authors' code can be found `here
|
|
||||||
<https://github.com/facebookresearch/ParlAI>`__ .
|
|
||||||
|
|
||||||
|
|
||||||
Implementation Notes
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
- Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
|
|
||||||
- Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
|
|
||||||
- This is the `default` Blenderbot model class. However, some smaller checkpoints, such as
|
|
||||||
``facebook/blenderbot_small_90M``, have a different architecture and consequently should be used with
|
|
||||||
`BlenderbotSmall <blenderbot_small>`__.
|
|
||||||
|
|
||||||
|
|
||||||
Usage
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Here is an example of model usage:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
|
|
||||||
>>> mname = 'facebook/blenderbot-400M-distill'
|
|
||||||
>>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
|
|
||||||
>>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
|
|
||||||
>>> UTTERANCE = "My friends are cool but they eat too many carbs."
|
|
||||||
>>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
|
|
||||||
>>> reply_ids = model.generate(**inputs)
|
|
||||||
>>> print(tokenizer.batch_decode(reply_ids))
|
|
||||||
["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
BlenderbotTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotTokenizerFast
|
|
||||||
:members: build_inputs_with_special_tokens
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
See :obj:`transformers.BartModel` for arguments to `forward` and `generate`
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotForConditionalGeneration
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFBlenderbotModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFBlenderbotModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFBlenderbotForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFBlenderbotForConditionalGeneration
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBlenderbotModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBlenderbotModel
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBlenderbotForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBlenderbotForConditionalGeneration
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
95
docs/source/model_doc/blenderbot_small.mdx
Normal file
95
docs/source/model_doc/blenderbot_small.mdx
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Blenderbot Small
|
||||||
|
|
||||||
|
Note that [`BlenderbotSmallModel`] and
|
||||||
|
[`BlenderbotSmallForConditionalGeneration`] are only used in combination with the checkpoint
|
||||||
|
[facebook/blenderbot-90M](https://huggingface.co/facebook/blenderbot-90M). Larger Blenderbot checkpoints should
|
||||||
|
instead be used with [`BlenderbotModel`] and
|
||||||
|
[`BlenderbotForConditionalGeneration`]
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
|
||||||
|
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
|
||||||
|
|
||||||
|
The abstract of the paper is the following:
|
||||||
|
|
||||||
|
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
|
||||||
|
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
|
||||||
|
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
|
||||||
|
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
|
||||||
|
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
|
||||||
|
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
|
||||||
|
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
|
||||||
|
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
|
||||||
|
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
|
||||||
|
failure cases of our models.*
|
||||||
|
|
||||||
|
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be
|
||||||
|
found [here](https://github.com/facebookresearch/ParlAI) .
|
||||||
|
|
||||||
|
## BlenderbotSmallConfig
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotSmallConfig
|
||||||
|
|
||||||
|
## BlenderbotSmallTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotSmallTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## BlenderbotSmallTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotSmallTokenizerFast
|
||||||
|
|
||||||
|
## BlenderbotSmallModel
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotSmallModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BlenderbotSmallForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotSmallForConditionalGeneration
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## BlenderbotSmallForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] BlenderbotSmallForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFBlenderbotSmallModel
|
||||||
|
|
||||||
|
[[autodoc]] TFBlenderbotSmallModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFBlenderbotSmallForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] TFBlenderbotSmallForConditionalGeneration
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxBlenderbotSmallModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBlenderbotSmallModel
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxBlenderbotForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] FlaxBlenderbotSmallForConditionalGeneration
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
@@ -1,113 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
Blenderbot Small
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Note that :class:`~transformers.BlenderbotSmallModel` and
|
|
||||||
:class:`~transformers.BlenderbotSmallForConditionalGeneration` are only used in combination with the checkpoint
|
|
||||||
`facebook/blenderbot-90M <https://huggingface.co/facebook/blenderbot-90M>`__. Larger Blenderbot checkpoints should
|
|
||||||
instead be used with :class:`~transformers.BlenderbotModel` and
|
|
||||||
:class:`~transformers.BlenderbotForConditionalGeneration`
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
|
|
||||||
<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
|
|
||||||
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
|
|
||||||
|
|
||||||
The abstract of the paper is the following:
|
|
||||||
|
|
||||||
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
|
|
||||||
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
|
|
||||||
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
|
|
||||||
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
|
|
||||||
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
|
|
||||||
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
|
|
||||||
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
|
|
||||||
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
|
|
||||||
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
|
|
||||||
failure cases of our models.*
|
|
||||||
|
|
||||||
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The authors' code can be
|
|
||||||
found `here <https://github.com/facebookresearch/ParlAI>`__ .
|
|
||||||
|
|
||||||
BlenderbotSmallConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotSmallConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotSmallTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotSmallTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotSmallTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotSmallTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotSmallModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotSmallModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotSmallForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotSmallForConditionalGeneration
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
BlenderbotSmallForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.BlenderbotSmallForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFBlenderbotSmallModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFBlenderbotSmallModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFBlenderbotSmallForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFBlenderbotSmallForConditionalGeneration
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBlenderbotSmallModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBlenderbotSmallModel
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxBlenderbotForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxBlenderbotSmallForConditionalGeneration
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
@@ -1,5 +1,4 @@
|
|||||||
..
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
the License. You may obtain a copy of the License at
|
the License. You may obtain a copy of the License at
|
||||||
@@ -9,14 +8,13 @@
|
|||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
specific language governing permissions and limitations under the License.
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
BORT
|
# BORT
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
## Overview
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The BORT model was proposed in `Optimal Subarchitecture Extraction for BERT <https://arxiv.org/abs/2010.10499>`__ by
|
The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by
|
||||||
Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
|
Adrian de Wynter and Daniel J. Perry. It is an optimal subset of architectural parameters for the BERT, which the
|
||||||
authors refer to as "Bort".
|
authors refer to as "Bort".
|
||||||
|
|
||||||
@@ -34,14 +32,11 @@ absolute, with respect to BERT-large, on multiple public natural language unders
|
|||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- BORT's model architecture is based on BERT, so one can refer to :doc:`BERT's documentation page <bert>` for the
|
- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the
|
||||||
model's API as well as usage examples.
|
model's API as well as usage examples.
|
||||||
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to :doc:`RoBERTa's documentation page
|
- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples.
|
||||||
<roberta>` for the tokenizer's API as well as usage examples.
|
- BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) ,
|
||||||
- BORT requires a specific fine-tuning algorithm, called `Agora
|
|
||||||
<https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology>`__ ,
|
|
||||||
that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
|
that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the
|
||||||
algorithm to make BORT fine-tuning work.
|
algorithm to make BORT fine-tuning work.
|
||||||
|
|
||||||
This model was contributed by `stefan-it <https://huggingface.co/stefan-it>`__. The original code can be found `here
|
This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/).
|
||||||
<https://github.com/alexa/bort/>`__.
|
|
||||||
80
docs/source/model_doc/byt5.mdx
Normal file
80
docs/source/model_doc/byt5.mdx
Normal file
@@ -0,0 +1,80 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# ByT5
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The ByT5 model was presented in [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
|
||||||
|
Kale, Adam Roberts, Colin Raffel.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
|
||||||
|
Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
|
||||||
|
the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
|
||||||
|
can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
|
||||||
|
removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
|
||||||
|
sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
|
||||||
|
operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
|
||||||
|
minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
|
||||||
|
training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
|
||||||
|
counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
|
||||||
|
tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
|
||||||
|
pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
|
||||||
|
experiments.*
|
||||||
|
|
||||||
|
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
|
||||||
|
found [here](https://github.com/google-research/byt5).
|
||||||
|
|
||||||
|
ByT5's architecture is based on the T5v1.1 model, so one can refer to [T5v1.1's documentation page](t5v1.1). They
|
||||||
|
only differ in how inputs should be prepared for the model, see the code examples below.
|
||||||
|
|
||||||
|
Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
|
||||||
|
fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
|
||||||
|
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import T5ForConditionalGeneration
|
||||||
|
import torch
|
||||||
|
|
||||||
|
model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
|
||||||
|
|
||||||
|
input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens
|
||||||
|
labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3 # add 3 for special tokens
|
||||||
|
|
||||||
|
loss = model(input_ids, labels=labels).loss # forward pass
|
||||||
|
```
|
||||||
|
|
||||||
|
For batched inference and training it is however recommended to make use of the tokenizer:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import T5ForConditionalGeneration, AutoTokenizer
|
||||||
|
|
||||||
|
model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
|
||||||
|
|
||||||
|
model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
|
||||||
|
labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
loss = model(**model_inputs, labels=labels).loss # forward pass
|
||||||
|
```
|
||||||
|
|
||||||
|
## ByT5Tokenizer
|
||||||
|
|
||||||
|
[[autodoc]] ByT5Tokenizer
|
||||||
|
|
||||||
|
See [`ByT5Tokenizer`] for all details.
|
||||||
@@ -1,86 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
ByT5
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The ByT5 model was presented in `ByT5: Towards a token-free future with pre-trained byte-to-byte models
|
|
||||||
<https://arxiv.org/abs/2105.13626>`_ by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
|
|
||||||
Kale, Adam Roberts, Colin Raffel.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
|
|
||||||
Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
|
|
||||||
the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
|
|
||||||
can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
|
|
||||||
removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
|
|
||||||
sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
|
|
||||||
operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
|
|
||||||
minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
|
|
||||||
training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
|
|
||||||
counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
|
|
||||||
tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
|
|
||||||
pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
|
|
||||||
experiments.*
|
|
||||||
|
|
||||||
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__. The original code can be
|
|
||||||
found `here <https://github.com/google-research/byt5>`__.
|
|
||||||
|
|
||||||
ByT5's architecture is based on the T5v1.1 model, so one can refer to :doc:`T5v1.1's documentation page <t5v1.1>`. They
|
|
||||||
only differ in how inputs should be prepared for the model, see the code examples below.
|
|
||||||
|
|
||||||
Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
|
|
||||||
fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
|
|
||||||
|
|
||||||
|
|
||||||
Example
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import T5ForConditionalGeneration
|
|
||||||
import torch
|
|
||||||
|
|
||||||
model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
|
|
||||||
|
|
||||||
input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens
|
|
||||||
labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3 # add 3 for special tokens
|
|
||||||
|
|
||||||
loss = model(input_ids, labels=labels).loss # forward pass
|
|
||||||
|
|
||||||
|
|
||||||
For batched inference and training it is however recommended to make use of the tokenizer:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import T5ForConditionalGeneration, AutoTokenizer
|
|
||||||
|
|
||||||
model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
|
|
||||||
|
|
||||||
model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
|
|
||||||
labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
loss = model(**model_inputs, labels=labels).loss # forward pass
|
|
||||||
|
|
||||||
ByT5Tokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ByT5Tokenizer
|
|
||||||
|
|
||||||
See :class:`~transformers.ByT5Tokenizer` for all details.
|
|
||||||
106
docs/source/model_doc/camembert.mdx
Normal file
106
docs/source/model_doc/camembert.mdx
Normal file
@@ -0,0 +1,106 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# CamemBERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by
|
||||||
|
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
|
||||||
|
Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
|
||||||
|
trained on 138GB of French text.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
|
||||||
|
models have either been trained on English data or on the concatenation of data in multiple languages. This makes
|
||||||
|
practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
|
||||||
|
we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
|
||||||
|
performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
|
||||||
|
dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
|
||||||
|
for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
|
||||||
|
downstream applications for French NLP.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples
|
||||||
|
as well as the information relative to the inputs and outputs.
|
||||||
|
|
||||||
|
This model was contributed by [camembert](https://huggingface.co/camembert). The original code can be found [here](https://camembert-model.fr/).
|
||||||
|
|
||||||
|
## CamembertConfig
|
||||||
|
|
||||||
|
[[autodoc]] CamembertConfig
|
||||||
|
|
||||||
|
## CamembertTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] CamembertTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## CamembertTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] CamembertTokenizerFast
|
||||||
|
|
||||||
|
## CamembertModel
|
||||||
|
|
||||||
|
[[autodoc]] CamembertModel
|
||||||
|
|
||||||
|
## CamembertForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] CamembertForCausalLM
|
||||||
|
|
||||||
|
## CamembertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] CamembertForMaskedLM
|
||||||
|
|
||||||
|
## CamembertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] CamembertForSequenceClassification
|
||||||
|
|
||||||
|
## CamembertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] CamembertForMultipleChoice
|
||||||
|
|
||||||
|
## CamembertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] CamembertForTokenClassification
|
||||||
|
|
||||||
|
## CamembertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] CamembertForQuestionAnswering
|
||||||
|
|
||||||
|
## TFCamembertModel
|
||||||
|
|
||||||
|
[[autodoc]] TFCamembertModel
|
||||||
|
|
||||||
|
## TFCamembertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFCamembertForMaskedLM
|
||||||
|
|
||||||
|
## TFCamembertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFCamembertForSequenceClassification
|
||||||
|
|
||||||
|
## TFCamembertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFCamembertForMultipleChoice
|
||||||
|
|
||||||
|
## TFCamembertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFCamembertForTokenClassification
|
||||||
|
|
||||||
|
## TFCamembertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFCamembertForQuestionAnswering
|
||||||
@@ -1,153 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
CamemBERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The CamemBERT model was proposed in `CamemBERT: a Tasty French Language Model <https://arxiv.org/abs/1911.03894>`__ by
|
|
||||||
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la
|
|
||||||
Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model
|
|
||||||
trained on 138GB of French text.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available
|
|
||||||
models have either been trained on English data or on the concatenation of data in multiple languages. This makes
|
|
||||||
practical use of such models --in all languages except English-- very limited. Aiming to address this issue for French,
|
|
||||||
we release CamemBERT, a French version of the Bi-directional Encoders for Transformers (BERT). We measure the
|
|
||||||
performance of CamemBERT compared to multilingual models in multiple downstream tasks, namely part-of-speech tagging,
|
|
||||||
dependency parsing, named-entity recognition, and natural language inference. CamemBERT improves the state of the art
|
|
||||||
for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and
|
|
||||||
downstream applications for French NLP.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- This implementation is the same as RoBERTa. Refer to the :doc:`documentation of RoBERTa <roberta>` for usage examples
|
|
||||||
as well as the information relative to the inputs and outputs.
|
|
||||||
|
|
||||||
This model was contributed by `camembert <https://huggingface.co/camembert>`__. The original code can be found `here
|
|
||||||
<https://camembert-model.fr/>`__.
|
|
||||||
|
|
||||||
CamembertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
CamembertTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertModel
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertForCausalLM
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertForMaskedLM
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertForSequenceClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertForMultipleChoice
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertForTokenClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CamembertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CamembertForQuestionAnswering
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFCamembertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCamembertModel
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFCamembertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCamembertForMaskedLM
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFCamembertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCamembertForSequenceClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFCamembertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCamembertForMultipleChoice
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFCamembertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCamembertForTokenClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFCamembertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCamembertForQuestionAnswering
|
|
||||||
:members:
|
|
||||||
133
docs/source/model_doc/canine.mdx
Normal file
133
docs/source/model_doc/canine.mdx
Normal file
@@ -0,0 +1,133 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# CANINE
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The CANINE model was proposed in [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
|
||||||
|
Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
|
||||||
|
among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
|
||||||
|
Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
|
||||||
|
Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
|
||||||
|
downsampling strategy, before applying a deep Transformer encoder.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models
|
||||||
|
still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword
|
||||||
|
lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all
|
||||||
|
languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE,
|
||||||
|
a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a
|
||||||
|
pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
|
||||||
|
To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input
|
||||||
|
sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
|
||||||
|
2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
|
||||||
|
layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
|
||||||
|
the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
|
||||||
|
after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
|
||||||
|
downsampling can be found in the paper.
|
||||||
|
- CANINE uses a max sequence length of 2048 characters by default. One can use [`CanineTokenizer`]
|
||||||
|
to prepare text for the model.
|
||||||
|
- Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
|
||||||
|
(which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
|
||||||
|
tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
|
||||||
|
details for this can be found in the paper.
|
||||||
|
- Models:
|
||||||
|
|
||||||
|
- [google/canine-c](https://huggingface.co/google/canine-c): Pre-trained with autoregressive character loss,
|
||||||
|
12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
|
||||||
|
- [google/canine-s](https://huggingface.co/google/canine-s): Pre-trained with subword loss, 12-layer,
|
||||||
|
768-hidden, 12-heads, 121M parameters (size ~500 MB).
|
||||||
|
|
||||||
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine).
|
||||||
|
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
CANINE works on raw characters, so it can be used without a tokenizer:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import CanineModel
|
||||||
|
>>> import torch
|
||||||
|
|
||||||
|
>>> model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
|
||||||
|
|
||||||
|
>>> text = "hello world"
|
||||||
|
>>> # use Python's built-in ord() function to turn each character into its unicode code point id
|
||||||
|
>>> input_ids = torch.tensor([[ord(char) for char in text]])
|
||||||
|
|
||||||
|
>>> outputs = model(input_ids) # forward pass
|
||||||
|
>>> pooled_output = outputs.pooler_output
|
||||||
|
>>> sequence_output = outputs.last_hidden_state
|
||||||
|
```
|
||||||
|
|
||||||
|
For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
|
||||||
|
sequences to the same length):
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import CanineTokenizer, CanineModel
|
||||||
|
|
||||||
|
>>> model = CanineModel.from_pretrained('google/canine-c')
|
||||||
|
>>> tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
|
||||||
|
|
||||||
|
>>> inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
|
||||||
|
>>> encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
|
||||||
|
|
||||||
|
>>> outputs = model(**encoding) # forward pass
|
||||||
|
>>> pooled_output = outputs.pooler_output
|
||||||
|
>>> sequence_output = outputs.last_hidden_state
|
||||||
|
```
|
||||||
|
|
||||||
|
## CANINE specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling
|
||||||
|
|
||||||
|
## CanineConfig
|
||||||
|
|
||||||
|
[[autodoc]] CanineConfig
|
||||||
|
|
||||||
|
## CanineTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] CanineTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
|
||||||
|
## CanineModel
|
||||||
|
|
||||||
|
[[autodoc]] CanineModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## CanineForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] CanineForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## CanineForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] CanineForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## CanineForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] CanineForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## CanineForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] CanineForQuestionAnswering
|
||||||
|
- forward
|
||||||
@@ -1,155 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
CANINE
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The CANINE model was proposed in `CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
|
|
||||||
Representation <https://arxiv.org/abs/2103.06874>`__ by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's
|
|
||||||
among the first papers that trains a Transformer without using an explicit tokenization step (such as Byte Pair
|
|
||||||
Encoding (BPE), WordPiece or SentencePiece). Instead, the model is trained directly at a Unicode character-level.
|
|
||||||
Training at a character-level inevitably comes with a longer sequence length, which CANINE solves with an efficient
|
|
||||||
downsampling strategy, before applying a deep Transformer encoder.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models
|
|
||||||
still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword
|
|
||||||
lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all
|
|
||||||
languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE,
|
|
||||||
a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a
|
|
||||||
pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
|
|
||||||
To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input
|
|
||||||
sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by
|
|
||||||
2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single
|
|
||||||
layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize
|
|
||||||
the character embeddings, using local attention. Next, after downsampling, a "deep" encoder is applied. Finally,
|
|
||||||
after upsampling, a "shallow" encoder is used to create the final character embeddings. Details regarding up- and
|
|
||||||
downsampling can be found in the paper.
|
|
||||||
- CANINE uses a max sequence length of 2048 characters by default. One can use :class:`~transformers.CanineTokenizer`
|
|
||||||
to prepare text for the model.
|
|
||||||
- Classification can be done by placing a linear layer on top of the final hidden state of the special [CLS] token
|
|
||||||
(which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of
|
|
||||||
tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The
|
|
||||||
details for this can be found in the paper.
|
|
||||||
- Models:
|
|
||||||
|
|
||||||
- `google/canine-c <https://huggingface.co/google/canine-c>`__: Pre-trained with autoregressive character loss,
|
|
||||||
12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB).
|
|
||||||
- `google/canine-s <https://huggingface.co/google/canine-s>`__: Pre-trained with subword loss, 12-layer,
|
|
||||||
768-hidden, 12-heads, 121M parameters (size ~500 MB).
|
|
||||||
|
|
||||||
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
|
|
||||||
<https://github.com/google-research/language/tree/master/language/canine>`__.
|
|
||||||
|
|
||||||
|
|
||||||
Example
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
CANINE works on raw characters, so it can be used without a tokenizer:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import CanineModel
|
|
||||||
import torch
|
|
||||||
|
|
||||||
model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
|
|
||||||
|
|
||||||
text = "hello world"
|
|
||||||
# use Python's built-in ord() function to turn each character into its unicode code point id
|
|
||||||
input_ids = torch.tensor([[ord(char) for char in text]])
|
|
||||||
|
|
||||||
outputs = model(input_ids) # forward pass
|
|
||||||
pooled_output = outputs.pooler_output
|
|
||||||
sequence_output = outputs.last_hidden_state
|
|
||||||
|
|
||||||
|
|
||||||
For batched inference and training, it is however recommended to make use of the tokenizer (to pad/truncate all
|
|
||||||
sequences to the same length):
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import CanineTokenizer, CanineModel
|
|
||||||
|
|
||||||
model = CanineModel.from_pretrained('google/canine-c')
|
|
||||||
tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
|
|
||||||
|
|
||||||
inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
|
|
||||||
encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")
|
|
||||||
|
|
||||||
outputs = model(**encoding) # forward pass
|
|
||||||
pooled_output = outputs.pooler_output
|
|
||||||
sequence_output = outputs.last_hidden_state
|
|
||||||
|
|
||||||
|
|
||||||
CANINE specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.canine.modeling_canine.CanineModelOutputWithPooling
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CanineConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CanineConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CanineTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CanineTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences
|
|
||||||
|
|
||||||
|
|
||||||
CanineModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CanineModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
CanineForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CanineForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
CanineForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CanineForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
CanineForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CanineForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
CanineForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CanineForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
143
docs/source/model_doc/clip.mdx
Normal file
143
docs/source/model_doc/clip.mdx
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# CLIP
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The CLIP model was proposed in [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
|
||||||
|
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
|
||||||
|
(Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
|
||||||
|
instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
|
||||||
|
for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This
|
||||||
|
restricted form of supervision limits their generality and usability since additional labeled data is needed to specify
|
||||||
|
any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a
|
||||||
|
much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes
|
||||||
|
with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
|
||||||
|
million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference
|
||||||
|
learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study
|
||||||
|
the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks
|
||||||
|
such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The
|
||||||
|
model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need
|
||||||
|
for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot
|
||||||
|
without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
|
||||||
|
model weights at this https URL.*
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
|
||||||
|
classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
|
||||||
|
features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
|
||||||
|
product between the projected image and text features is then used as a similar score.
|
||||||
|
|
||||||
|
To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
|
||||||
|
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
|
||||||
|
also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
|
||||||
|
The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model.
|
||||||
|
|
||||||
|
The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps
|
||||||
|
[`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both
|
||||||
|
encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
|
||||||
|
[`CLIPProcessor`] and [`CLIPModel`].
|
||||||
|
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from PIL import Image
|
||||||
|
>>> import requests
|
||||||
|
|
||||||
|
>>> from transformers import CLIPProcessor, CLIPModel
|
||||||
|
|
||||||
|
>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
|
||||||
|
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
||||||
|
|
||||||
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
|
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
|
||||||
|
|
||||||
|
>>> outputs = model(**inputs)
|
||||||
|
>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
||||||
|
>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
||||||
|
```
|
||||||
|
|
||||||
|
This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
|
||||||
|
|
||||||
|
## CLIPConfig
|
||||||
|
|
||||||
|
[[autodoc]] CLIPConfig
|
||||||
|
- from_text_vision_configs
|
||||||
|
|
||||||
|
## CLIPTextConfig
|
||||||
|
|
||||||
|
[[autodoc]] CLIPTextConfig
|
||||||
|
|
||||||
|
## CLIPVisionConfig
|
||||||
|
|
||||||
|
[[autodoc]] CLIPVisionConfig
|
||||||
|
|
||||||
|
## CLIPTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] CLIPTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## CLIPTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] CLIPTokenizerFast
|
||||||
|
|
||||||
|
## CLIPFeatureExtractor
|
||||||
|
|
||||||
|
[[autodoc]] CLIPFeatureExtractor
|
||||||
|
|
||||||
|
## CLIPProcessor
|
||||||
|
|
||||||
|
[[autodoc]] CLIPProcessor
|
||||||
|
|
||||||
|
## CLIPModel
|
||||||
|
|
||||||
|
[[autodoc]] CLIPModel
|
||||||
|
- forward
|
||||||
|
- get_text_features
|
||||||
|
- get_image_features
|
||||||
|
|
||||||
|
## CLIPTextModel
|
||||||
|
|
||||||
|
[[autodoc]] CLIPTextModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## CLIPVisionModel
|
||||||
|
|
||||||
|
[[autodoc]] CLIPVisionModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaxCLIPModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxCLIPModel
|
||||||
|
- __call__
|
||||||
|
- get_text_features
|
||||||
|
- get_image_features
|
||||||
|
|
||||||
|
## FlaxCLIPTextModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxCLIPTextModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxCLIPVisionModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxCLIPVisionModel
|
||||||
|
- __call__
|
||||||
@@ -1,174 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
CLIP
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The CLIP model was proposed in `Learning Transferable Visual Models From Natural Language Supervision
|
|
||||||
<https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
|
|
||||||
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP
|
|
||||||
(Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be
|
|
||||||
instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing
|
|
||||||
for the task, similarly to the zero-shot capabilities of GPT-2 and 3.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This
|
|
||||||
restricted form of supervision limits their generality and usability since additional labeled data is needed to specify
|
|
||||||
any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a
|
|
||||||
much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes
|
|
||||||
with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400
|
|
||||||
million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference
|
|
||||||
learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study
|
|
||||||
the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks
|
|
||||||
such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The
|
|
||||||
model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need
|
|
||||||
for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot
|
|
||||||
without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained
|
|
||||||
model weights at this https URL.*
|
|
||||||
|
|
||||||
Usage
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
|
|
||||||
classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text
|
|
||||||
features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
|
|
||||||
product between the projected image and text features is then used as a similar score.
|
|
||||||
|
|
||||||
To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
|
|
||||||
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
|
|
||||||
also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
|
|
||||||
The :class:`~transformers.CLIPFeatureExtractor` can be used to resize (or rescale) and normalize images for the model.
|
|
||||||
|
|
||||||
The :class:`~transformers.CLIPTokenizer` is used to encode the text. The :class:`~transformers.CLIPProcessor` wraps
|
|
||||||
:class:`~transformers.CLIPFeatureExtractor` and :class:`~transformers.CLIPTokenizer` into a single instance to both
|
|
||||||
encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
|
|
||||||
:class:`~transformers.CLIPProcessor` and :class:`~transformers.CLIPModel`.
|
|
||||||
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from PIL import Image
|
|
||||||
>>> import requests
|
|
||||||
|
|
||||||
>>> from transformers import CLIPProcessor, CLIPModel
|
|
||||||
|
|
||||||
>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
|
|
||||||
>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
|
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
|
||||||
|
|
||||||
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
|
|
||||||
|
|
||||||
>>> outputs = model(**inputs)
|
|
||||||
>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
|
||||||
>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
|
||||||
|
|
||||||
|
|
||||||
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here
|
|
||||||
<https://github.com/openai/CLIP>`__.
|
|
||||||
|
|
||||||
CLIPConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPConfig
|
|
||||||
:members: from_text_vision_configs
|
|
||||||
|
|
||||||
|
|
||||||
CLIPTextConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPTextConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CLIPVisionConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPVisionConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
CLIPTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
CLIPTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CLIPFeatureExtractor
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPFeatureExtractor
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CLIPProcessor
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPProcessor
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
CLIPModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPModel
|
|
||||||
:members: forward, get_text_features, get_image_features
|
|
||||||
|
|
||||||
|
|
||||||
CLIPTextModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPTextModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
CLIPVisionModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CLIPVisionModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaxCLIPModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxCLIPModel
|
|
||||||
:members: __call__, get_text_features, get_image_features
|
|
||||||
|
|
||||||
|
|
||||||
FlaxCLIPTextModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxCLIPTextModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxCLIPVisionModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxCLIPVisionModel
|
|
||||||
:members: __call__
|
|
||||||
113
docs/source/model_doc/convbert.mdx
Normal file
113
docs/source/model_doc/convbert.mdx
Normal file
@@ -0,0 +1,113 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# ConvBERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The ConvBERT model was proposed in [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
|
||||||
|
Yan.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
|
||||||
|
natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
|
||||||
|
large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
|
||||||
|
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
|
||||||
|
which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
|
||||||
|
replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
|
||||||
|
rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
|
||||||
|
learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
|
||||||
|
ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
|
||||||
|
fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
|
||||||
|
using less than 1/4 training cost. Code and pre-trained models will be released.*
|
||||||
|
|
||||||
|
ConvBERT training tips are similar to those of BERT.
|
||||||
|
|
||||||
|
This model was contributed by [abhishek](https://huggingface.co/abhishek). The original implementation can be found
|
||||||
|
here: https://github.com/yitu-opensource/ConvBert
|
||||||
|
|
||||||
|
## ConvBertConfig
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertConfig
|
||||||
|
|
||||||
|
## ConvBertTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## ConvBertTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertTokenizerFast
|
||||||
|
|
||||||
|
## ConvBertModel
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ConvBertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ConvBertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ConvBertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ConvBertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ConvBertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] ConvBertForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFConvBertModel
|
||||||
|
|
||||||
|
[[autodoc]] TFConvBertModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFConvBertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFConvBertForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFConvBertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFConvBertForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFConvBertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFConvBertForMultipleChoice
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFConvBertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFConvBertForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFConvBertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFConvBertForQuestionAnswering
|
||||||
|
- call
|
||||||
@@ -1,145 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
ConvBERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The ConvBERT model was proposed in `ConvBERT: Improving BERT with Span-based Dynamic Convolution
|
|
||||||
<https://arxiv.org/abs/2008.02496>`__ by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng
|
|
||||||
Yan.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Pre-trained language models like BERT and its variants have recently achieved impressive performance in various
|
|
||||||
natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers
|
|
||||||
large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for
|
|
||||||
generating the attention map from a global perspective, we observe some heads only need to learn local dependencies,
|
|
||||||
which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to
|
|
||||||
replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the
|
|
||||||
rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context
|
|
||||||
learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that
|
|
||||||
ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and
|
|
||||||
fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while
|
|
||||||
using less than 1/4 training cost. Code and pre-trained models will be released.*
|
|
||||||
|
|
||||||
ConvBERT training tips are similar to those of BERT.
|
|
||||||
|
|
||||||
This model was contributed by `abhishek <https://huggingface.co/abhishek>`__. The original implementation can be found
|
|
||||||
here: https://github.com/yitu-opensource/ConvBert
|
|
||||||
|
|
||||||
ConvBertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ConvBertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ConvBertForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFConvBertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFConvBertModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFConvBertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFConvBertForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFConvBertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFConvBertForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFConvBertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFConvBertForMultipleChoice
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFConvBertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFConvBertForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFConvBertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFConvBertForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
@@ -1,5 +1,4 @@
|
|||||||
..
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
the License. You may obtain a copy of the License at
|
the License. You may obtain a copy of the License at
|
||||||
@@ -9,15 +8,13 @@
|
|||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
specific language governing permissions and limitations under the License.
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
CPM
|
# CPM
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
## Overview
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The CPM model was proposed in `CPM: A Large-scale Generative Chinese Pre-trained Language Model
|
The CPM model was proposed in [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
|
||||||
<https://arxiv.org/abs/2012.00413>`__ by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin,
|
|
||||||
Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
|
Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen,
|
||||||
Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
|
Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun.
|
||||||
|
|
||||||
@@ -33,13 +30,11 @@ language model, which could facilitate several downstream Chinese NLP tasks, suc
|
|||||||
cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
|
cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many
|
||||||
NLP tasks in the settings of few-shot (even zero-shot) learning.*
|
NLP tasks in the settings of few-shot (even zero-shot) learning.*
|
||||||
|
|
||||||
This model was contributed by `canwenxu <https://huggingface.co/canwenxu>`__. The original implementation can be found
|
This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found
|
||||||
here: https://github.com/TsinghuaAI/CPM-Generate
|
here: https://github.com/TsinghuaAI/CPM-Generate
|
||||||
|
|
||||||
Note: We only have a tokenizer here, since the model architecture is the same as GPT-2.
|
Note: We only have a tokenizer here, since the model architecture is the same as GPT-2.
|
||||||
|
|
||||||
CpmTokenizer
|
## CpmTokenizer
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CpmTokenizer
|
[[autodoc]] CpmTokenizer
|
||||||
:members:
|
|
||||||
87
docs/source/model_doc/ctrl.mdx
Normal file
87
docs/source/model_doc/ctrl.mdx
Normal file
@@ -0,0 +1,87 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# CTRL
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
CTRL model was proposed in [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
|
||||||
|
Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
|
||||||
|
of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Large-scale language models show promising text generation capabilities, but users cannot easily control particular
|
||||||
|
aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
|
||||||
|
trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
|
||||||
|
derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
|
||||||
|
providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
|
||||||
|
training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
|
||||||
|
via model-based source attribution.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
|
||||||
|
or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for
|
||||||
|
more information.
|
||||||
|
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
||||||
|
the left.
|
||||||
|
- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
|
||||||
|
token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
|
||||||
|
observed in the *run_generation.py* example script.
|
||||||
|
- The PyTorch models can take the *past* as input, which is the previously computed key/value attention pairs. Using
|
||||||
|
this *past* value prevents the model from re-computing pre-computed values in the context of text generation. See
|
||||||
|
[reusing the past in generative models](../quickstart#using-the-past) for more information on the usage of
|
||||||
|
this argument.
|
||||||
|
|
||||||
|
This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found
|
||||||
|
[here](https://github.com/salesforce/ctrl).
|
||||||
|
|
||||||
|
|
||||||
|
## CTRLConfig
|
||||||
|
|
||||||
|
[[autodoc]] CTRLConfig
|
||||||
|
|
||||||
|
## CTRLTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] CTRLTokenizer
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## CTRLModel
|
||||||
|
|
||||||
|
[[autodoc]] CTRLModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## CTRLLMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] CTRLLMHeadModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## CTRLForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] CTRLForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFCTRLModel
|
||||||
|
|
||||||
|
[[autodoc]] TFCTRLModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFCTRLLMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] TFCTRLLMHeadModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFCTRLForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFCTRLForSequenceClassification
|
||||||
|
- call
|
||||||
@@ -1,105 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
CTRL
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
CTRL model was proposed in `CTRL: A Conditional Transformer Language Model for Controllable Generation
|
|
||||||
<https://arxiv.org/abs/1909.05858>`_ by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and
|
|
||||||
Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus
|
|
||||||
of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.).
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Large-scale language models show promising text generation capabilities, but users cannot easily control particular
|
|
||||||
aspects of the generated text. We release CTRL, a 1.63 billion-parameter conditional transformer language model,
|
|
||||||
trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were
|
|
||||||
derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while
|
|
||||||
providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the
|
|
||||||
training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data
|
|
||||||
via model-based source attribution.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences
|
|
||||||
or links to generate coherent text. Refer to the `original implementation <https://github.com/salesforce/ctrl>`__ for
|
|
||||||
more information.
|
|
||||||
- CTRL is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
|
||||||
the left.
|
|
||||||
- CTRL was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
|
|
||||||
token in a sequence. Leveraging this feature allows CTRL to generate syntactically coherent text as it can be
|
|
||||||
observed in the `run_generation.py` example script.
|
|
||||||
- The PyTorch models can take the `past` as input, which is the previously computed key/value attention pairs. Using
|
|
||||||
this `past` value prevents the model from re-computing pre-computed values in the context of text generation. See
|
|
||||||
`reusing the past in generative models <../quickstart.html#using-the-past>`__ for more information on the usage of
|
|
||||||
this argument.
|
|
||||||
|
|
||||||
This model was contributed by `keskarnitishr <https://huggingface.co/keskarnitishr>`__. The original code can be found
|
|
||||||
`here <https://github.com/salesforce/ctrl>`__.
|
|
||||||
|
|
||||||
|
|
||||||
CTRLConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CTRLConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
CTRLTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CTRLTokenizer
|
|
||||||
:members: save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
CTRLModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CTRLModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
CTRLLMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CTRLLMHeadModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
CTRLForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.CTRLForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFCTRLModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCTRLModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFCTRLLMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCTRLLMHeadModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
TFCTRLForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFCTRLForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
117
docs/source/model_doc/deberta.mdx
Normal file
117
docs/source/model_doc/deberta.mdx
Normal file
@@ -0,0 +1,117 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# DeBERTa
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
|
||||||
|
BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
|
||||||
|
|
||||||
|
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
|
||||||
|
RoBERTa.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
|
||||||
|
language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
|
||||||
|
disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
|
||||||
|
disentangled attention mechanism, where each word is represented using two vectors that encode its content and
|
||||||
|
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
||||||
|
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
||||||
|
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
||||||
|
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
||||||
|
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||||
|
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
||||||
|
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
||||||
|
|
||||||
|
|
||||||
|
This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
|
||||||
|
contributed by [kamalkraj](https://huggingface.co/kamalkraj) . The original code can be found [here](https://github.com/microsoft/DeBERTa).
|
||||||
|
|
||||||
|
|
||||||
|
## DebertaConfig
|
||||||
|
|
||||||
|
[[autodoc]] DebertaConfig
|
||||||
|
|
||||||
|
## DebertaTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] DebertaTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## DebertaTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] DebertaTokenizerFast
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
|
||||||
|
## DebertaModel
|
||||||
|
|
||||||
|
[[autodoc]] DebertaModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaPreTrainedModel
|
||||||
|
|
||||||
|
[[autodoc]] DebertaPreTrainedModel
|
||||||
|
|
||||||
|
## DebertaForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] DebertaForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] DebertaForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] DebertaForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] DebertaForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFDebertaModel
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaPreTrainedModel
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaPreTrainedModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaForQuestionAnswering
|
||||||
|
- call
|
||||||
@@ -1,148 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
DeBERTa
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
|
||||||
<https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
|
|
||||||
BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
|
|
||||||
|
|
||||||
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
|
|
||||||
RoBERTa.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
|
|
||||||
language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
|
|
||||||
disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
|
|
||||||
disentangled attention mechanism, where each word is represented using two vectors that encode its content and
|
|
||||||
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
|
||||||
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
|
||||||
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
|
||||||
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
|
||||||
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
|
||||||
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
|
||||||
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
|
||||||
|
|
||||||
|
|
||||||
This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. This model TF 2.0 implementation was
|
|
||||||
contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__ . The original code can be found `here
|
|
||||||
<https://github.com/microsoft/DeBERTa>`__.
|
|
||||||
|
|
||||||
|
|
||||||
DebertaConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DebertaTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
DebertaTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaTokenizerFast
|
|
||||||
:members: build_inputs_with_special_tokens, create_token_type_ids_from_sequences
|
|
||||||
|
|
||||||
|
|
||||||
DebertaModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaPreTrainedModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaPreTrainedModel
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DebertaForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaPreTrainedModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaPreTrainedModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
132
docs/source/model_doc/deberta_v2.mdx
Normal file
132
docs/source/model_doc/deberta_v2.mdx
Normal file
@@ -0,0 +1,132 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# DeBERTa-v2
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
|
||||||
|
BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
|
||||||
|
|
||||||
|
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
|
||||||
|
RoBERTa.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
|
||||||
|
language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
|
||||||
|
disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
|
||||||
|
disentangled attention mechanism, where each word is represented using two vectors that encode its content and
|
||||||
|
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
||||||
|
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
||||||
|
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
||||||
|
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
||||||
|
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
||||||
|
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
||||||
|
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
||||||
|
|
||||||
|
|
||||||
|
The following information is visible directly on the [original implementation
|
||||||
|
repository](https://github.com/microsoft/DeBERTa). DeBERTa v2 is the second version of the DeBERTa model. It includes
|
||||||
|
the 1.5B model used for the SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can
|
||||||
|
find more details about this submission in the authors'
|
||||||
|
[blog](https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/)
|
||||||
|
|
||||||
|
New in v2:
|
||||||
|
|
||||||
|
- **Vocabulary** In v2 the tokenizer is changed to use a new vocabulary of size 128K built from the training data.
|
||||||
|
Instead of a GPT2-based tokenizer, the tokenizer is now
|
||||||
|
[sentencepiece-based](https://github.com/google/sentencepiece) tokenizer.
|
||||||
|
- **nGiE(nGram Induced Input Encoding)** The DeBERTa-v2 model uses an additional convolution layer aside with the first
|
||||||
|
transformer layer to better learn the local dependency of input tokens.
|
||||||
|
- **Sharing position projection matrix with content projection matrix in attention layer** Based on previous
|
||||||
|
experiments, this can save parameters without affecting the performance.
|
||||||
|
- **Apply bucket to encode relative positions** The DeBERTa-v2 model uses log bucket to encode relative positions
|
||||||
|
similar to T5.
|
||||||
|
- **900M model & 1.5B model** Two additional model sizes are available: 900M and 1.5B, which significantly improves the
|
||||||
|
performance of downstream tasks.
|
||||||
|
|
||||||
|
This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was
|
||||||
|
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/DeBERTa).
|
||||||
|
|
||||||
|
|
||||||
|
## DebertaV2Config
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2Config
|
||||||
|
|
||||||
|
## DebertaV2Tokenizer
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2Tokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## DebertaV2Model
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2Model
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaV2PreTrainedModel
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2PreTrainedModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaV2ForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2ForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaV2ForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2ForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaV2ForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2ForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DebertaV2ForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] DebertaV2ForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFDebertaV2Model
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaV2Model
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaV2PreTrainedModel
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaV2PreTrainedModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaV2ForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaV2ForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaV2ForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaV2ForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaV2ForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaV2ForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDebertaV2ForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFDebertaV2ForQuestionAnswering
|
||||||
|
- call
|
||||||
@@ -1,162 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
DeBERTa-v2
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The DeBERTa model was proposed in `DeBERTa: Decoding-enhanced BERT with Disentangled Attention
|
|
||||||
<https://arxiv.org/abs/2006.03654>`__ by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen It is based on Google's
|
|
||||||
BERT model released in 2018 and Facebook's RoBERTa model released in 2019.
|
|
||||||
|
|
||||||
It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in
|
|
||||||
RoBERTa.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Recent progress in pre-trained neural language models has significantly improved the performance of many natural
|
|
||||||
language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with
|
|
||||||
disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the
|
|
||||||
disentangled attention mechanism, where each word is represented using two vectors that encode its content and
|
|
||||||
position, respectively, and the attention weights among words are computed using disentangled matrices on their
|
|
||||||
contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to
|
|
||||||
predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency
|
|
||||||
of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of
|
|
||||||
the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9%
|
|
||||||
(90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and
|
|
||||||
pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.*
|
|
||||||
|
|
||||||
|
|
||||||
The following information is visible directly on the [original implementation
|
|
||||||
repository](https://github.com/microsoft/DeBERTa). DeBERTa v2 is the second version of the DeBERTa model. It includes
|
|
||||||
the 1.5B model used for the SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can
|
|
||||||
find more details about this submission in the authors'
|
|
||||||
[blog](https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/)
|
|
||||||
|
|
||||||
New in v2:
|
|
||||||
|
|
||||||
- **Vocabulary** In v2 the tokenizer is changed to use a new vocabulary of size 128K built from the training data.
|
|
||||||
Instead of a GPT2-based tokenizer, the tokenizer is now
|
|
||||||
[sentencepiece-based](https://github.com/google/sentencepiece) tokenizer.
|
|
||||||
- **nGiE(nGram Induced Input Encoding)** The DeBERTa-v2 model uses an additional convolution layer aside with the first
|
|
||||||
transformer layer to better learn the local dependency of input tokens.
|
|
||||||
- **Sharing position projection matrix with content projection matrix in attention layer** Based on previous
|
|
||||||
experiments, this can save parameters without affecting the performance.
|
|
||||||
- **Apply bucket to encode relative positions** The DeBERTa-v2 model uses log bucket to encode relative positions
|
|
||||||
similar to T5.
|
|
||||||
- **900M model & 1.5B model** Two additional model sizes are available: 900M and 1.5B, which significantly improves the
|
|
||||||
performance of downstream tasks.
|
|
||||||
|
|
||||||
This model was contributed by `DeBERTa <https://huggingface.co/DeBERTa>`__. This model TF 2.0 implementation was
|
|
||||||
contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found `here
|
|
||||||
<https://github.com/microsoft/DeBERTa>`__.
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2Config
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2Config
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2Tokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2Tokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2Model
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2Model
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2PreTrainedModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2PreTrainedModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2ForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2ForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2ForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2ForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2ForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2ForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DebertaV2ForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DebertaV2ForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaV2Model
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaV2Model
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaV2PreTrainedModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaV2PreTrainedModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaV2ForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaV2ForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaV2ForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaV2ForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaV2ForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaV2ForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDebertaV2ForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDebertaV2ForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
@@ -1,5 +1,4 @@
|
|||||||
..
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
the License. You may obtain a copy of the License at
|
the License. You may obtain a copy of the License at
|
||||||
@@ -9,24 +8,21 @@
|
|||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
specific language governing permissions and limitations under the License.
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
DeiT
|
# DeiT
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
.. note::
|
<Tip>
|
||||||
|
|
||||||
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
|
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
|
||||||
breaking changes to fix it in the future. If you see something strange, file a `Github Issue
|
breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
|
||||||
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
|
|
||||||
|
|
||||||
|
</Tip>
|
||||||
|
|
||||||
Overview
|
## Overview
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The DeiT model was proposed in `Training data-efficient image transformers & distillation through attention
|
The DeiT model was proposed in [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
|
||||||
<https://arxiv.org/abs/2012.12877>`__ by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
|
Sablayrolles, Hervé Jégou. The [Vision Transformer (ViT)](vit) introduced in [Dosovitskiy et al., 2020](https://arxiv.org/abs/2010.11929) has shown that one can match or even outperform existing convolutional neural
|
||||||
Sablayrolles, Hervé Jégou. The `Vision Transformer (ViT) <vit>`__ introduced in `Dosovitskiy et al., 2020
|
|
||||||
<https://arxiv.org/abs/2010.11929>`__ has shown that one can match or even outperform existing convolutional neural
|
|
||||||
networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on
|
networks using a Transformer encoder (BERT-like). However, the ViT models introduced in that paper required training on
|
||||||
expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more
|
expensive infrastructure for multiple weeks, using external data. DeiT (data-efficient image transformers) are more
|
||||||
efficiently trained transformers for image classification, requiring far less data and far less computing resources
|
efficiently trained transformers for image classification, requiring far less data and far less computing resources
|
||||||
@@ -58,54 +54,44 @@ Tips:
|
|||||||
distillation head and the label predicted by the teacher). At inference time, one takes the average prediction
|
distillation head and the label predicted by the teacher). At inference time, one takes the average prediction
|
||||||
between both heads as final prediction. (2) is also called "fine-tuning with distillation", because one relies on a
|
between both heads as final prediction. (2) is also called "fine-tuning with distillation", because one relies on a
|
||||||
teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds to
|
teacher that has already been fine-tuned on the downstream dataset. In terms of models, (1) corresponds to
|
||||||
:class:`~transformers.DeiTForImageClassification` and (2) corresponds to
|
[`DeiTForImageClassification`] and (2) corresponds to
|
||||||
:class:`~transformers.DeiTForImageClassificationWithTeacher`.
|
[`DeiTForImageClassificationWithTeacher`].
|
||||||
- Note that the authors also did try soft distillation for (2) (in which case the distillation prediction head is
|
- Note that the authors also did try soft distillation for (2) (in which case the distillation prediction head is
|
||||||
trained using KL divergence to match the softmax output of the teacher), but hard distillation gave the best results.
|
trained using KL divergence to match the softmax output of the teacher), but hard distillation gave the best results.
|
||||||
- All released checkpoints were pre-trained and fine-tuned on ImageNet-1k only. No external data was used. This is in
|
- All released checkpoints were pre-trained and fine-tuned on ImageNet-1k only. No external data was used. This is in
|
||||||
contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
|
contrast with the original ViT model, which used external data like the JFT-300M dataset/Imagenet-21k for
|
||||||
pre-training.
|
pre-training.
|
||||||
- The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into
|
- The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into
|
||||||
:class:`~transformers.ViTModel` or :class:`~transformers.ViTForImageClassification`. Techniques like data
|
[`ViTModel`] or [`ViTForImageClassification`]. Techniques like data
|
||||||
augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||||
(while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
|
(while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
|
||||||
`facebook/deit-tiny-patch16-224`, `facebook/deit-small-patch16-224`, `facebook/deit-base-patch16-224` and
|
*facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
|
||||||
`facebook/deit-base-patch16-384`. Note that one should use :class:`~transformers.DeiTFeatureExtractor` in order to
|
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTFeatureExtractor`] in order to
|
||||||
prepare images for the model.
|
prepare images for the model.
|
||||||
|
|
||||||
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__.
|
This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
||||||
|
|
||||||
|
|
||||||
DeiTConfig
|
## DeiTConfig
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DeiTConfig
|
[[autodoc]] DeiTConfig
|
||||||
:members:
|
|
||||||
|
|
||||||
|
## DeiTFeatureExtractor
|
||||||
|
|
||||||
DeiTFeatureExtractor
|
[[autodoc]] DeiTFeatureExtractor
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
- __call__
|
||||||
|
|
||||||
.. autoclass:: transformers.DeiTFeatureExtractor
|
## DeiTModel
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
[[autodoc]] DeiTModel
|
||||||
|
- forward
|
||||||
|
|
||||||
DeiTModel
|
## DeiTForImageClassification
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DeiTModel
|
[[autodoc]] DeiTForImageClassification
|
||||||
:members: forward
|
- forward
|
||||||
|
|
||||||
|
## DeiTForImageClassificationWithTeacher
|
||||||
|
|
||||||
DeiTForImageClassification
|
[[autodoc]] DeiTForImageClassificationWithTeacher
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
- forward
|
||||||
|
|
||||||
.. autoclass:: transformers.DeiTForImageClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DeiTForImageClassificationWithTeacher
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DeiTForImageClassificationWithTeacher
|
|
||||||
:members: forward
|
|
||||||
@@ -1,5 +1,4 @@
|
|||||||
..
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
the License. You may obtain a copy of the License at
|
the License. You may obtain a copy of the License at
|
||||||
@@ -9,15 +8,13 @@
|
|||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
specific language governing permissions and limitations under the License.
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
DialoGPT
|
# DialoGPT
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
## Overview
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
DialoGPT was proposed in `DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
|
DialoGPT was proposed in [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
|
||||||
<https://arxiv.org/abs/1911.00536>`_ by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao,
|
|
||||||
Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
|
Jianfeng Gao, Jingjing Liu, Bill Dolan. It's a GPT2 Model trained on 147M conversation-like exchanges extracted from
|
||||||
Reddit.
|
Reddit.
|
||||||
|
|
||||||
@@ -37,8 +34,7 @@ Tips:
|
|||||||
than the left.
|
than the left.
|
||||||
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
|
- DialoGPT was trained with a causal language modeling (CLM) objective on conversational data and is therefore powerful
|
||||||
at response generation in open-domain dialogue systems.
|
at response generation in open-domain dialogue systems.
|
||||||
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on `DialoGPT's model card
|
- DialoGPT enables the user to create a chat bot in just 10 lines of code as shown on [DialoGPT's model card](https://huggingface.co/microsoft/DialoGPT-medium).
|
||||||
<https://huggingface.co/microsoft/DialoGPT-medium>`_.
|
|
||||||
|
|
||||||
Training:
|
Training:
|
||||||
|
|
||||||
@@ -48,6 +44,6 @@ modeling. We first concatenate all dialog turns within a dialogue session into a
|
|||||||
sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
|
sequence length), ended by the end-of-text token.* For more information please confer to the original paper.
|
||||||
|
|
||||||
|
|
||||||
DialoGPT's architecture is based on the GPT2 model, so one can refer to :doc:`GPT2's documentation page <gpt2>`.
|
DialoGPT's architecture is based on the GPT2 model, so one can refer to [GPT2's documentation page](gpt2).
|
||||||
|
|
||||||
The original code can be found `here <https://github.com/microsoft/DialoGPT>`_.
|
The original code can be found [here](https://github.com/microsoft/DialoGPT).
|
||||||
149
docs/source/model_doc/distilbert.mdx
Normal file
149
docs/source/model_doc/distilbert.mdx
Normal file
@@ -0,0 +1,149 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# DistilBERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
|
||||||
|
distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, a
|
||||||
|
distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a
|
||||||
|
small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
|
||||||
|
*bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
|
||||||
|
understanding benchmark.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
|
||||||
|
operating these large models in on-the-edge and/or under constrained computational training or inference budgets
|
||||||
|
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
|
||||||
|
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
|
||||||
|
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
|
||||||
|
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
|
||||||
|
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
|
||||||
|
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
|
||||||
|
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
|
||||||
|
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
|
||||||
|
study.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just
|
||||||
|
separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`).
|
||||||
|
- DistilBERT doesn't have options to select the input positions (`position_ids` input). This could be added if
|
||||||
|
necessary though, just let us know if you need this option.
|
||||||
|
|
||||||
|
This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was
|
||||||
|
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation).
|
||||||
|
|
||||||
|
|
||||||
|
## DistilBertConfig
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertConfig
|
||||||
|
|
||||||
|
## DistilBertTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertTokenizer
|
||||||
|
|
||||||
|
## DistilBertTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertTokenizerFast
|
||||||
|
|
||||||
|
## DistilBertModel
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DistilBertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DistilBertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DistilBertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DistilBertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DistilBertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] DistilBertForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFDistilBertModel
|
||||||
|
|
||||||
|
[[autodoc]] TFDistilBertModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDistilBertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFDistilBertForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDistilBertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFDistilBertForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDistilBertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFDistilBertForMultipleChoice
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDistilBertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFDistilBertForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDistilBertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFDistilBertForQuestionAnswering
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxDistilBertModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxDistilBertModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxDistilBertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] FlaxDistilBertForMaskedLM
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxDistilBertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxDistilBertForSequenceClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxDistilBertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] FlaxDistilBertForMultipleChoice
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxDistilBertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxDistilBertForTokenClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxDistilBertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FlaxDistilBertForQuestionAnswering
|
||||||
|
- __call__
|
||||||
@@ -1,197 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
DistilBERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The DistilBERT model was proposed in the blog post `Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
|
|
||||||
distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__, and the paper `DistilBERT, a
|
|
||||||
distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__. DistilBERT is a
|
|
||||||
small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than
|
|
||||||
`bert-base-uncased`, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language
|
|
||||||
understanding benchmark.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
|
|
||||||
operating these large models in on-the-edge and/or under constrained computational training or inference budgets
|
|
||||||
remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
|
|
||||||
model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
|
|
||||||
counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage
|
|
||||||
knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by
|
|
||||||
40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive
|
|
||||||
biases learned by larger models during pretraining, we introduce a triple loss combining language modeling,
|
|
||||||
distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we
|
|
||||||
demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device
|
|
||||||
study.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- DistilBERT doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
|
|
||||||
separate your segments with the separation token :obj:`tokenizer.sep_token` (or :obj:`[SEP]`).
|
|
||||||
- DistilBERT doesn't have options to select the input positions (:obj:`position_ids` input). This could be added if
|
|
||||||
necessary though, just let us know if you need this option.
|
|
||||||
|
|
||||||
This model was contributed by `victorsanh <https://huggingface.co/victorsanh>`__. This model jax version was
|
|
||||||
contributed by `kamalkraj <https://huggingface.co/kamalkraj>`__. The original code can be found :prefix_link:`here
|
|
||||||
<examples/research_projects/distillation>`.
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DistilBertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DistilBertForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
TFDistilBertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDistilBertModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDistilBertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDistilBertForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDistilBertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDistilBertForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
TFDistilBertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDistilBertForMultipleChoice
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
TFDistilBertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDistilBertForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDistilBertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDistilBertForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxDistilBertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxDistilBertModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxDistilBertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxDistilBertForMaskedLM
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxDistilBertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxDistilBertForSequenceClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxDistilBertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxDistilBertForMultipleChoice
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxDistilBertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxDistilBertForTokenClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxDistilBertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxDistilBertForQuestionAnswering
|
|
||||||
:members: __call__
|
|
||||||
98
docs/source/model_doc/dpr.mdx
Normal file
98
docs/source/model_doc/dpr.mdx
Normal file
@@ -0,0 +1,98 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# DPR
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
|
||||||
|
introduced in [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by
|
||||||
|
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional
|
||||||
|
sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can
|
||||||
|
be practically implemented using dense representations alone, where embeddings are learned from a small number of
|
||||||
|
questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets,
|
||||||
|
our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage
|
||||||
|
retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
|
||||||
|
benchmarks.*
|
||||||
|
|
||||||
|
This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR).
|
||||||
|
|
||||||
|
|
||||||
|
## DPRConfig
|
||||||
|
|
||||||
|
[[autodoc]] DPRConfig
|
||||||
|
|
||||||
|
## DPRContextEncoderTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] DPRContextEncoderTokenizer
|
||||||
|
|
||||||
|
## DPRContextEncoderTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] DPRContextEncoderTokenizerFast
|
||||||
|
|
||||||
|
## DPRQuestionEncoderTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] DPRQuestionEncoderTokenizer
|
||||||
|
|
||||||
|
## DPRQuestionEncoderTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] DPRQuestionEncoderTokenizerFast
|
||||||
|
|
||||||
|
## DPRReaderTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] DPRReaderTokenizer
|
||||||
|
|
||||||
|
## DPRReaderTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] DPRReaderTokenizerFast
|
||||||
|
|
||||||
|
## DPR specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.dpr.modeling_dpr.DPRContextEncoderOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.dpr.modeling_dpr.DPRQuestionEncoderOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.dpr.modeling_dpr.DPRReaderOutput
|
||||||
|
|
||||||
|
## DPRContextEncoder
|
||||||
|
|
||||||
|
[[autodoc]] DPRContextEncoder
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DPRQuestionEncoder
|
||||||
|
|
||||||
|
[[autodoc]] DPRQuestionEncoder
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## DPRReader
|
||||||
|
|
||||||
|
[[autodoc]] DPRReader
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFDPRContextEncoder
|
||||||
|
|
||||||
|
[[autodoc]] TFDPRContextEncoder
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDPRQuestionEncoder
|
||||||
|
|
||||||
|
[[autodoc]] TFDPRQuestionEncoder
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFDPRReader
|
||||||
|
|
||||||
|
[[autodoc]] TFDPRReader
|
||||||
|
- call
|
||||||
@@ -1,133 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
DPR
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Dense Passage Retrieval (DPR) is a set of tools and models for state-of-the-art open-domain Q&A research. It was
|
|
||||||
introduced in `Dense Passage Retrieval for Open-Domain Question Answering <https://arxiv.org/abs/2004.04906>`__ by
|
|
||||||
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional
|
|
||||||
sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can
|
|
||||||
be practically implemented using dense representations alone, where embeddings are learned from a small number of
|
|
||||||
questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets,
|
|
||||||
our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage
|
|
||||||
retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA
|
|
||||||
benchmarks.*
|
|
||||||
|
|
||||||
This model was contributed by `lhoestq <https://huggingface.co/lhoestq>`__. The original code can be found `here
|
|
||||||
<https://github.com/facebookresearch/DPR>`__.
|
|
||||||
|
|
||||||
|
|
||||||
DPRConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DPRContextEncoderTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRContextEncoderTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DPRContextEncoderTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRContextEncoderTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
DPRQuestionEncoderTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRQuestionEncoderTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DPRQuestionEncoderTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRQuestionEncoderTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
DPRReaderTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRReaderTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DPRReaderTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRReaderTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DPR specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.dpr.modeling_dpr.DPRContextEncoderOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.dpr.modeling_dpr.DPRQuestionEncoderOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.dpr.modeling_dpr.DPRReaderOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
DPRContextEncoder
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRContextEncoder
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
DPRQuestionEncoder
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRQuestionEncoder
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
DPRReader
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.DPRReader
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
TFDPRContextEncoder
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDPRContextEncoder
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
TFDPRQuestionEncoder
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDPRQuestionEncoder
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFDPRReader
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFDPRReader
|
|
||||||
:members: call
|
|
||||||
179
docs/source/model_doc/electra.mdx
Normal file
179
docs/source/model_doc/electra.mdx
Normal file
@@ -0,0 +1,179 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# ELECTRA
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The ELECTRA model was proposed in the paper [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
|
||||||
|
Generators](https://openreview.net/pdf?id=r1xMH1BtvB). ELECTRA is a new pretraining approach which trains two
|
||||||
|
transformer models: the generator and the discriminator. The generator's role is to replace tokens in a sequence, and
|
||||||
|
is therefore trained as a masked language model. The discriminator, which is the model we're interested in, tries to
|
||||||
|
identify which tokens were replaced by the generator in the sequence.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
|
||||||
|
and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
||||||
|
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
|
||||||
|
more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
|
||||||
|
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
|
||||||
|
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
|
||||||
|
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
|
||||||
|
demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
|
||||||
|
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
|
||||||
|
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
|
||||||
|
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
|
||||||
|
using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
|
||||||
|
where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
|
||||||
|
using the same amount of compute.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
|
||||||
|
only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
|
||||||
|
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
|
||||||
|
embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
|
||||||
|
layer is used.
|
||||||
|
- The ELECTRA checkpoints saved using [Google Research's implementation](https://github.com/google-research/electra)
|
||||||
|
contain both the generator and discriminator. The conversion script requires the user to name which model to export
|
||||||
|
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
|
||||||
|
available ELECTRA models, however. This means that the discriminator may be loaded in the
|
||||||
|
[`ElectraForMaskedLM`] model, and the generator may be loaded in the
|
||||||
|
[`ElectraForPreTraining`] model (the classification head will be randomly initialized as it
|
||||||
|
doesn't exist in the generator).
|
||||||
|
|
||||||
|
This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra).
|
||||||
|
|
||||||
|
|
||||||
|
## ElectraConfig
|
||||||
|
|
||||||
|
[[autodoc]] ElectraConfig
|
||||||
|
|
||||||
|
## ElectraTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] ElectraTokenizer
|
||||||
|
|
||||||
|
## ElectraTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] ElectraTokenizerFast
|
||||||
|
|
||||||
|
## Electra specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.electra.modeling_electra.ElectraForPreTrainingOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput
|
||||||
|
|
||||||
|
## ElectraModel
|
||||||
|
|
||||||
|
[[autodoc]] ElectraModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ElectraForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] ElectraForPreTraining
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ElectraForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] ElectraForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ElectraForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] ElectraForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ElectraForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] ElectraForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ElectraForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] ElectraForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## ElectraForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] ElectraForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFElectraModel
|
||||||
|
|
||||||
|
[[autodoc]] TFElectraModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFElectraForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] TFElectraForPreTraining
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFElectraForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFElectraForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFElectraForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFElectraForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFElectraForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFElectraForMultipleChoice
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFElectraForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFElectraForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFElectraForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFElectraForQuestionAnswering
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxElectraModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxElectraModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxElectraForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] FlaxElectraForPreTraining
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxElectraForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] FlaxElectraForMaskedLM
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxElectraForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxElectraForSequenceClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxElectraForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] FlaxElectraForMultipleChoice
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxElectraForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxElectraForTokenClassification
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxElectraForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FlaxElectraForQuestionAnswering
|
||||||
|
- __call__
|
||||||
@@ -1,236 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
ELECTRA
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The ELECTRA model was proposed in the paper `ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
|
|
||||||
Generators <https://openreview.net/pdf?id=r1xMH1BtvB>`__. ELECTRA is a new pretraining approach which trains two
|
|
||||||
transformer models: the generator and the discriminator. The generator's role is to replace tokens in a sequence, and
|
|
||||||
is therefore trained as a masked language model. The discriminator, which is the model we're interested in, tries to
|
|
||||||
identify which tokens were replaced by the generator in the sequence.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK]
|
|
||||||
and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
|
||||||
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
|
|
||||||
more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach
|
|
||||||
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
|
|
||||||
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
|
|
||||||
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
|
|
||||||
demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens
|
|
||||||
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
|
|
||||||
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
|
|
||||||
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
|
|
||||||
using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
|
|
||||||
where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
|
|
||||||
using the same amount of compute.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
|
|
||||||
only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
|
|
||||||
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
|
|
||||||
embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
|
|
||||||
layer is used.
|
|
||||||
- The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
|
|
||||||
contain both the generator and discriminator. The conversion script requires the user to name which model to export
|
|
||||||
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
|
|
||||||
available ELECTRA models, however. This means that the discriminator may be loaded in the
|
|
||||||
:class:`~transformers.ElectraForMaskedLM` model, and the generator may be loaded in the
|
|
||||||
:class:`~transformers.ElectraForPreTraining` model (the classification head will be randomly initialized as it
|
|
||||||
doesn't exist in the generator).
|
|
||||||
|
|
||||||
This model was contributed by `lysandre <https://huggingface.co/lysandre>`__. The original code can be found `here
|
|
||||||
<https://github.com/google-research/electra>`__.
|
|
||||||
|
|
||||||
|
|
||||||
ElectraConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
ElectraTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
ElectraTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
Electra specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.electra.modeling_electra.ElectraForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
ElectraModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ElectraForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraForPreTraining
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ElectraForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ElectraForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ElectraForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ElectraForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
ElectraForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.ElectraForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFElectraModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFElectraModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFElectraForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFElectraForPreTraining
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFElectraForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFElectraForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFElectraForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFElectraForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFElectraForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFElectraForMultipleChoice
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFElectraForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFElectraForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFElectraForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFElectraForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxElectraModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxElectraModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxElectraForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxElectraForPreTraining
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxElectraForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxElectraForMaskedLM
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxElectraForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxElectraForSequenceClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxElectraForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxElectraForMultipleChoice
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxElectraForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxElectraForTokenClassification
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxElectraForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxElectraForQuestionAnswering
|
|
||||||
:members: __call__
|
|
||||||
68
docs/source/model_doc/encoderdecoder.mdx
Normal file
68
docs/source/model_doc/encoderdecoder.mdx
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Encoder Decoder Models
|
||||||
|
|
||||||
|
The [`EncoderDecoderModel`] can be used to initialize a sequence-to-sequence model with any
|
||||||
|
pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder.
|
||||||
|
|
||||||
|
The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation tasks
|
||||||
|
was shown in [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by
|
||||||
|
Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
||||||
|
|
||||||
|
After such an [`EncoderDecoderModel`] has been trained/fine-tuned, it can be saved/loaded just like
|
||||||
|
any other models (see the examples for more information).
|
||||||
|
|
||||||
|
An application of this architecture could be to leverage two pretrained [`BertModel`] as the encoder
|
||||||
|
and decoder for a summarization model as was shown in: [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) by Yang Liu and Mirella Lapata.
|
||||||
|
|
||||||
|
The [`~TFEncoderDecoderModel.from_pretrained`] currently doesn't support initializing the model from a
|
||||||
|
pytorch checkpoint. Passing `from_pt=True` to this method will throw an exception. If there are only pytorch
|
||||||
|
checkpoints for a particular encoder-decoder model, a workaround is:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> # a workaround to load from pytorch checkpoint
|
||||||
|
>>> _model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
|
||||||
|
>>> _model.encoder.save_pretrained("./encoder")
|
||||||
|
>>> _model.decoder.save_pretrained("./decoder")
|
||||||
|
>>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
|
||||||
|
... "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
|
||||||
|
... )
|
||||||
|
>>> # This is only for copying some specific attributes of this particular model.
|
||||||
|
>>> model.config = _model.config
|
||||||
|
```
|
||||||
|
|
||||||
|
This model was contributed by [thomwolf](https://github.com/thomwolf). This model's TensorFlow and Flax versions
|
||||||
|
were contributed by [ydshieh](https://github.com/ydshieh).
|
||||||
|
|
||||||
|
|
||||||
|
## EncoderDecoderConfig
|
||||||
|
|
||||||
|
[[autodoc]] EncoderDecoderConfig
|
||||||
|
|
||||||
|
## EncoderDecoderModel
|
||||||
|
|
||||||
|
[[autodoc]] EncoderDecoderModel
|
||||||
|
- forward
|
||||||
|
- from_encoder_decoder_pretrained
|
||||||
|
|
||||||
|
## TFEncoderDecoderModel
|
||||||
|
|
||||||
|
[[autodoc]] TFEncoderDecoderModel
|
||||||
|
- call
|
||||||
|
- from_encoder_decoder_pretrained
|
||||||
|
|
||||||
|
## FlaxEncoderDecoderModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxEncoderDecoderModel
|
||||||
|
- __call__
|
||||||
|
- from_encoder_decoder_pretrained
|
||||||
@@ -1,75 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
Encoder Decoder Models
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
The :class:`~transformers.EncoderDecoderModel` can be used to initialize a sequence-to-sequence model with any
|
|
||||||
pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder.
|
|
||||||
|
|
||||||
The effectiveness of initializing sequence-to-sequence models with pretrained checkpoints for sequence generation tasks
|
|
||||||
was shown in `Leveraging Pre-trained Checkpoints for Sequence Generation Tasks <https://arxiv.org/abs/1907.12461>`__ by
|
|
||||||
Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
|
||||||
|
|
||||||
After such an :class:`~transformers.EncoderDecoderModel` has been trained/fine-tuned, it can be saved/loaded just like
|
|
||||||
any other models (see the examples for more information).
|
|
||||||
|
|
||||||
An application of this architecture could be to leverage two pretrained :class:`~transformers.BertModel` as the encoder
|
|
||||||
and decoder for a summarization model as was shown in: `Text Summarization with Pretrained Encoders
|
|
||||||
<https://arxiv.org/abs/1908.08345>`__ by Yang Liu and Mirella Lapata.
|
|
||||||
|
|
||||||
The :meth:`~transformers.TFEncoderDecoderModel.from_pretrained` currently doesn't support initializing the model from a
|
|
||||||
pytorch checkpoint. Passing ``from_pt=True`` to this method will throw an exception. If there are only pytorch
|
|
||||||
checkpoints for a particular encoder-decoder model, a workaround is:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> # a workaround to load from pytorch checkpoint
|
|
||||||
>>> _model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")
|
|
||||||
>>> _model.encoder.save_pretrained("./encoder")
|
|
||||||
>>> _model.decoder.save_pretrained("./decoder")
|
|
||||||
>>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
|
|
||||||
... "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
|
|
||||||
... )
|
|
||||||
>>> # This is only for copying some specific attributes of this particular model.
|
|
||||||
>>> model.config = _model.config
|
|
||||||
|
|
||||||
This model was contributed by `thomwolf <https://github.com/thomwolf>`__. This model's TensorFlow and Flax versions
|
|
||||||
were contributed by `ydshieh <https://github.com/ydshieh>`__.
|
|
||||||
|
|
||||||
|
|
||||||
EncoderDecoderConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.EncoderDecoderConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
EncoderDecoderModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.EncoderDecoderModel
|
|
||||||
:members: forward, from_encoder_decoder_pretrained
|
|
||||||
|
|
||||||
|
|
||||||
TFEncoderDecoderModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFEncoderDecoderModel
|
|
||||||
:members: call, from_encoder_decoder_pretrained
|
|
||||||
|
|
||||||
|
|
||||||
FlaxEncoderDecoderModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxEncoderDecoderModel
|
|
||||||
:members: __call__, from_encoder_decoder_pretrained
|
|
||||||
109
docs/source/model_doc/flaubert.mdx
Normal file
109
docs/source/model_doc/flaubert.mdx
Normal file
@@ -0,0 +1,109 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# FlauBERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The FlauBERT model was proposed in the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le et al. It's a transformer model pretrained using a masked language
|
||||||
|
modeling (MLM) objective (like BERT).
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Language models have become a key step to achieve state-of-the art results in many different Natural Language
|
||||||
|
Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way
|
||||||
|
to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
|
||||||
|
contextualization at the sentence level. This has been widely demonstrated for English using contextualized
|
||||||
|
representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al.,
|
||||||
|
2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and
|
||||||
|
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
|
||||||
|
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
|
||||||
|
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
|
||||||
|
time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
|
||||||
|
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
|
||||||
|
community for further reproducible experiments in French NLP.*
|
||||||
|
|
||||||
|
This model was contributed by [formiel](https://huggingface.co/formiel). The original code can be found [here](https://github.com/getalp/Flaubert).
|
||||||
|
|
||||||
|
|
||||||
|
## FlaubertConfig
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertConfig
|
||||||
|
|
||||||
|
## FlaubertTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertTokenizer
|
||||||
|
|
||||||
|
## FlaubertModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaubertWithLMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertWithLMHeadModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaubertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaubertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaubertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaubertForQuestionAnsweringSimple
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertForQuestionAnsweringSimple
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaubertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FlaubertForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFFlaubertModel
|
||||||
|
|
||||||
|
[[autodoc]] TFFlaubertModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFlaubertWithLMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] TFFlaubertWithLMHeadModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFlaubertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFFlaubertForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFlaubertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFFlaubertForMultipleChoice
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFlaubertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFFlaubertForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFlaubertForQuestionAnsweringSimple
|
||||||
|
|
||||||
|
[[autodoc]] TFFlaubertForQuestionAnsweringSimple
|
||||||
|
- call
|
||||||
@@ -1,144 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
FlauBERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The FlauBERT model was proposed in the paper `FlauBERT: Unsupervised Language Model Pre-training for French
|
|
||||||
<https://arxiv.org/abs/1912.05372>`__ by Hang Le et al. It's a transformer model pretrained using a masked language
|
|
||||||
modeling (MLM) objective (like BERT).
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Language models have become a key step to achieve state-of-the art results in many different Natural Language
|
|
||||||
Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way
|
|
||||||
to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their
|
|
||||||
contextualization at the sentence level. This has been widely demonstrated for English using contextualized
|
|
||||||
representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al.,
|
|
||||||
2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and
|
|
||||||
heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for
|
|
||||||
Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text
|
|
||||||
classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the
|
|
||||||
time they outperform other pretraining approaches. Different versions of FlauBERT as well as a unified evaluation
|
|
||||||
protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research
|
|
||||||
community for further reproducible experiments in French NLP.*
|
|
||||||
|
|
||||||
This model was contributed by `formiel <https://huggingface.co/formiel>`__. The original code can be found `here
|
|
||||||
<https://github.com/getalp/Flaubert>`__.
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertWithLMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertWithLMHeadModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertForQuestionAnsweringSimple
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertForQuestionAnsweringSimple
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaubertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaubertForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFFlaubertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFlaubertModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFlaubertWithLMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFlaubertWithLMHeadModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFlaubertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFlaubertForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFlaubertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFlaubertForMultipleChoice
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFlaubertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFlaubertForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFlaubertForQuestionAnsweringSimple
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFlaubertForQuestionAnsweringSimple
|
|
||||||
:members: call
|
|
||||||
98
docs/source/model_doc/fnet.mdx
Normal file
98
docs/source/model_doc/fnet.mdx
Normal file
@@ -0,0 +1,98 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# FNet
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The FNet model was proposed in [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by
|
||||||
|
James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT
|
||||||
|
model with a fourier transform which returns only the real parts of the transform. The model is significantly faster
|
||||||
|
than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97%
|
||||||
|
accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model. The abstract from the
|
||||||
|
paper is the following:
|
||||||
|
|
||||||
|
*We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the
|
||||||
|
self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with
|
||||||
|
standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text
|
||||||
|
classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder
|
||||||
|
with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE
|
||||||
|
benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths,
|
||||||
|
our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena
|
||||||
|
benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all
|
||||||
|
sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint
|
||||||
|
and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models
|
||||||
|
outperform Transformer counterparts.*
|
||||||
|
|
||||||
|
Tips on usage:
|
||||||
|
|
||||||
|
- The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with
|
||||||
|
maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum
|
||||||
|
sequence length for fine-tuning and inference.
|
||||||
|
|
||||||
|
This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/google-research/google-research/tree/master/f_net).
|
||||||
|
|
||||||
|
## FNetConfig
|
||||||
|
|
||||||
|
[[autodoc]] FNetConfig
|
||||||
|
|
||||||
|
## FNetTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] FNetTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## FNetTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] FNetTokenizerFast
|
||||||
|
|
||||||
|
## FNetModel
|
||||||
|
|
||||||
|
[[autodoc]] FNetModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FNetForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] FNetForPreTraining
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FNetForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] FNetForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FNetForNextSentencePrediction
|
||||||
|
|
||||||
|
[[autodoc]] FNetForNextSentencePrediction
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FNetForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FNetForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FNetForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] FNetForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FNetForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] FNetForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FNetForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FNetForQuestionAnswering
|
||||||
|
- forward
|
||||||
@@ -1,121 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
FNet
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The FNet model was proposed in `FNet: Mixing Tokens with Fourier Transforms <https://arxiv.org/abs/2105.03824>`__ by
|
|
||||||
James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. The model replaces the self-attention layer in a BERT
|
|
||||||
model with a fourier transform which returns only the real parts of the transform. The model is significantly faster
|
|
||||||
than the BERT model because it has fewer parameters and is more memory efficient. The model achieves about 92-97%
|
|
||||||
accuracy of BERT counterparts on GLUE benchmark, and trains much faster than the BERT model. The abstract from the
|
|
||||||
paper is the following:
|
|
||||||
|
|
||||||
*We show that Transformer encoder architectures can be sped up, with limited accuracy costs, by replacing the
|
|
||||||
self-attention sublayers with simple linear transformations that "mix" input tokens. These linear mixers, along with
|
|
||||||
standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text
|
|
||||||
classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder
|
|
||||||
with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE
|
|
||||||
benchmark, but trains 80% faster on GPUs and 70% faster on TPUs at standard 512 input lengths. At longer input lengths,
|
|
||||||
our FNet model is significantly faster: when compared to the "efficient" Transformers on the Long Range Arena
|
|
||||||
benchmark, FNet matches the accuracy of the most accurate models, while outpacing the fastest models across all
|
|
||||||
sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint
|
|
||||||
and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models
|
|
||||||
outperform Transformer counterparts.*
|
|
||||||
|
|
||||||
Tips on usage:
|
|
||||||
|
|
||||||
- The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with
|
|
||||||
maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum
|
|
||||||
sequence length for fine-tuning and inference.
|
|
||||||
|
|
||||||
This model was contributed by `gchhablani <https://huggingface.co/gchhablani>`__. The original code can be found `here
|
|
||||||
<https://github.com/google-research/google-research/tree/master/f_net>`__.
|
|
||||||
|
|
||||||
FNetConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FNetTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
FNetTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FNetModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FNetForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetForPreTraining
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FNetForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FNetForNextSentencePrediction
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetForNextSentencePrediction
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
FNetForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FNetForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FNetForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FNetForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FNetForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
63
docs/source/model_doc/fsmt.mdx
Normal file
63
docs/source/model_doc/fsmt.mdx
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# FSMT
|
||||||
|
|
||||||
|
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
|
||||||
|
@stas00.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
FSMT (FairSeq MachineTranslation) models were introduced in [Facebook FAIR's WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616) by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
|
||||||
|
|
||||||
|
The abstract of the paper is the following:
|
||||||
|
|
||||||
|
*This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two
|
||||||
|
language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from
|
||||||
|
last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling
|
||||||
|
toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes,
|
||||||
|
as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific
|
||||||
|
data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the
|
||||||
|
human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations.
|
||||||
|
This system improves upon our WMT'18 submission by 4.5 BLEU points.*
|
||||||
|
|
||||||
|
This model was contributed by [stas](https://huggingface.co/stas). The original code can be found
|
||||||
|
[here](https://github.com/pytorch/fairseq/tree/master/examples/wmt19).
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
- FSMT uses source and target vocabulary pairs that aren't combined into one. It doesn't share embeddings tokens
|
||||||
|
either. Its tokenizer is very similar to [`XLMTokenizer`] and the main model is derived from
|
||||||
|
[`BartModel`].
|
||||||
|
|
||||||
|
|
||||||
|
## FSMTConfig
|
||||||
|
|
||||||
|
[[autodoc]] FSMTConfig
|
||||||
|
|
||||||
|
## FSMTTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] FSMTTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## FSMTModel
|
||||||
|
|
||||||
|
[[autodoc]] FSMTModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FSMTForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] FSMTForConditionalGeneration
|
||||||
|
- forward
|
||||||
@@ -1,74 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
FSMT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
**DISCLAIMER:** If you see something strange, file a `Github Issue
|
|
||||||
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
|
|
||||||
@stas00.
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
FSMT (FairSeq MachineTranslation) models were introduced in `Facebook FAIR's WMT19 News Translation Task Submission
|
|
||||||
<https://arxiv.org/abs/1907.06616>`__ by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov.
|
|
||||||
|
|
||||||
The abstract of the paper is the following:
|
|
||||||
|
|
||||||
*This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two
|
|
||||||
language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from
|
|
||||||
last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling
|
|
||||||
toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes,
|
|
||||||
as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific
|
|
||||||
data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the
|
|
||||||
human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations.
|
|
||||||
This system improves upon our WMT'18 submission by 4.5 BLEU points.*
|
|
||||||
|
|
||||||
This model was contributed by `stas <https://huggingface.co/stas>`__. The original code can be found here
|
|
||||||
<https://github.com/pytorch/fairseq/tree/master/examples/wmt19>__.
|
|
||||||
|
|
||||||
Implementation Notes
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
- FSMT uses source and target vocabulary pairs that aren't combined into one. It doesn't share embeddings tokens
|
|
||||||
either. Its tokenizer is very similar to :class:`~transformers.XLMTokenizer` and the main model is derived from
|
|
||||||
:class:`~transformers.BartModel`.
|
|
||||||
|
|
||||||
|
|
||||||
FSMTConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FSMTConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FSMTTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FSMTTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
FSMTModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FSMTModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FSMTForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FSMTForConditionalGeneration
|
|
||||||
:members: forward
|
|
||||||
153
docs/source/model_doc/funnel.mdx
Normal file
153
docs/source/model_doc/funnel.mdx
Normal file
@@ -0,0 +1,153 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Funnel Transformer
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Funnel Transformer model was proposed in the paper [Funnel-Transformer: Filtering out Sequential Redundancy for
|
||||||
|
Efficient Language Processing](https://arxiv.org/abs/2006.03236). It is a bidirectional transformer model, like
|
||||||
|
BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks
|
||||||
|
(CNN) in computer vision.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*With the success of language pretraining, it is highly desirable to develop more efficient architectures of good
|
||||||
|
scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the
|
||||||
|
much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only
|
||||||
|
require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which
|
||||||
|
gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More
|
||||||
|
importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further
|
||||||
|
improve the model capacity. In addition, to perform token-level predictions as required by common pretraining
|
||||||
|
objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence
|
||||||
|
via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on
|
||||||
|
a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading
|
||||||
|
comprehension.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers.
|
||||||
|
The base model therefore has a final sequence length that is a quarter of the original one. This model can be used
|
||||||
|
directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
|
||||||
|
tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
|
||||||
|
sequence length as the input.
|
||||||
|
- The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
|
||||||
|
used for [`FunnelModel`], [`FunnelForPreTraining`],
|
||||||
|
[`FunnelForMaskedLM`], [`FunnelForTokenClassification`] and
|
||||||
|
class:*~transformers.FunnelForQuestionAnswering*. The second ones should be used for
|
||||||
|
[`FunnelBaseModel`], [`FunnelForSequenceClassification`] and
|
||||||
|
[`FunnelForMultipleChoice`].
|
||||||
|
|
||||||
|
This model was contributed by [sgugger](https://huggingface.co/sgugger). The original code can be found [here](https://github.com/laiguokun/Funnel-Transformer).
|
||||||
|
|
||||||
|
|
||||||
|
## FunnelConfig
|
||||||
|
|
||||||
|
[[autodoc]] FunnelConfig
|
||||||
|
|
||||||
|
## FunnelTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] FunnelTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## FunnelTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] FunnelTokenizerFast
|
||||||
|
|
||||||
|
## Funnel specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.funnel.modeling_funnel.FunnelForPreTrainingOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.funnel.modeling_tf_funnel.TFFunnelForPreTrainingOutput
|
||||||
|
|
||||||
|
## FunnelBaseModel
|
||||||
|
|
||||||
|
[[autodoc]] FunnelBaseModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FunnelModel
|
||||||
|
|
||||||
|
[[autodoc]] FunnelModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FunnelModelForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] FunnelForPreTraining
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FunnelForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] FunnelForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FunnelForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FunnelForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FunnelForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] FunnelForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FunnelForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] FunnelForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FunnelForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FunnelForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFFunnelBaseModel
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelBaseModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFunnelModel
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFunnelModelForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelForPreTraining
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFunnelForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFunnelForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFunnelForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelForMultipleChoice
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFunnelForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFFunnelForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFFunnelForQuestionAnswering
|
||||||
|
- call
|
||||||
@@ -1,197 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
Funnel Transformer
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The Funnel Transformer model was proposed in the paper `Funnel-Transformer: Filtering out Sequential Redundancy for
|
|
||||||
Efficient Language Processing <https://arxiv.org/abs/2006.03236>`__. It is a bidirectional transformer model, like
|
|
||||||
BERT, but with a pooling operation after each block of layers, a bit like in traditional convolutional neural networks
|
|
||||||
(CNN) in computer vision.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*With the success of language pretraining, it is highly desirable to develop more efficient architectures of good
|
|
||||||
scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the
|
|
||||||
much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only
|
|
||||||
require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which
|
|
||||||
gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More
|
|
||||||
importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further
|
|
||||||
improve the model capacity. In addition, to perform token-level predictions as required by common pretraining
|
|
||||||
objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence
|
|
||||||
via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on
|
|
||||||
a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading
|
|
||||||
comprehension.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers.
|
|
||||||
The base model therefore has a final sequence length that is a quarter of the original one. This model can be used
|
|
||||||
directly for tasks that just require a sentence summary (like sequence classification or multiple choice). For other
|
|
||||||
tasks, the full model is used; this full model has a decoder that upsamples the final hidden states to the same
|
|
||||||
sequence length as the input.
|
|
||||||
- The Funnel Transformer checkpoints are all available with a full version and a base version. The first ones should be
|
|
||||||
used for :class:`~transformers.FunnelModel`, :class:`~transformers.FunnelForPreTraining`,
|
|
||||||
:class:`~transformers.FunnelForMaskedLM`, :class:`~transformers.FunnelForTokenClassification` and
|
|
||||||
class:`~transformers.FunnelForQuestionAnswering`. The second ones should be used for
|
|
||||||
:class:`~transformers.FunnelBaseModel`, :class:`~transformers.FunnelForSequenceClassification` and
|
|
||||||
:class:`~transformers.FunnelForMultipleChoice`.
|
|
||||||
|
|
||||||
This model was contributed by `sgugger <https://huggingface.co/sgugger>`__. The original code can be found `here
|
|
||||||
<https://github.com/laiguokun/Funnel-Transformer>`__.
|
|
||||||
|
|
||||||
|
|
||||||
FunnelConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FunnelTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
FunnelTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
Funnel specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.funnel.modeling_funnel.FunnelForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.funnel.modeling_tf_funnel.TFFunnelForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FunnelBaseModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelBaseModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FunnelModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FunnelModelForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelForPreTraining
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FunnelForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FunnelForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FunnelForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FunnelForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FunnelForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FunnelForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelBaseModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelBaseModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelModelForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelForPreTraining
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelForMultipleChoice
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFFunnelForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFFunnelForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
117
docs/source/model_doc/gpt.mdx
Normal file
117
docs/source/model_doc/gpt.mdx
Normal file
@@ -0,0 +1,117 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# OpenAI GPT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
OpenAI GPT model was proposed in [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
|
||||||
|
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
|
||||||
|
pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
|
||||||
|
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
|
||||||
|
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
|
||||||
|
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
|
||||||
|
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
|
||||||
|
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
|
||||||
|
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
|
||||||
|
approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms
|
||||||
|
discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
|
||||||
|
the state of the art in 9 out of the 12 tasks studied.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
||||||
|
the left.
|
||||||
|
- GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
|
||||||
|
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
|
||||||
|
observed in the *run_generation.py* example script.
|
||||||
|
|
||||||
|
[Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face
|
||||||
|
showcasing the generative capabilities of several models. GPT is one of them.
|
||||||
|
|
||||||
|
This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm).
|
||||||
|
|
||||||
|
Note:
|
||||||
|
|
||||||
|
If you want to reproduce the original tokenization process of the *OpenAI GPT* paper, you will need to install `ftfy`
|
||||||
|
and `SpaCy`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install spacy ftfy==4.4.3
|
||||||
|
python -m spacy download en
|
||||||
|
```
|
||||||
|
|
||||||
|
If you don't install `ftfy` and `SpaCy`, the [`OpenAIGPTTokenizer`] will default to tokenize
|
||||||
|
using BERT's `BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
||||||
|
|
||||||
|
## OpenAIGPTConfig
|
||||||
|
|
||||||
|
[[autodoc]] OpenAIGPTConfig
|
||||||
|
|
||||||
|
## OpenAIGPTTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] OpenAIGPTTokenizer
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## OpenAIGPTTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] OpenAIGPTTokenizerFast
|
||||||
|
|
||||||
|
## OpenAI specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.openai.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
|
||||||
|
|
||||||
|
## OpenAIGPTModel
|
||||||
|
|
||||||
|
[[autodoc]] OpenAIGPTModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## OpenAIGPTLMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] OpenAIGPTLMHeadModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## OpenAIGPTDoubleHeadsModel
|
||||||
|
|
||||||
|
[[autodoc]] OpenAIGPTDoubleHeadsModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## OpenAIGPTForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] OpenAIGPTForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFOpenAIGPTModel
|
||||||
|
|
||||||
|
[[autodoc]] TFOpenAIGPTModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFOpenAIGPTLMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] TFOpenAIGPTLMHeadModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFOpenAIGPTDoubleHeadsModel
|
||||||
|
|
||||||
|
[[autodoc]] TFOpenAIGPTDoubleHeadsModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFOpenAIGPTForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFOpenAIGPTForSequenceClassification
|
||||||
|
- call
|
||||||
@@ -1,147 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
OpenAI GPT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
OpenAI GPT model was proposed in `Improving Language Understanding by Generative Pre-Training
|
|
||||||
<https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf>`__
|
|
||||||
by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. It's a causal (unidirectional) transformer
|
|
||||||
pre-trained using language modeling on a large corpus will long range dependencies, the Toronto Book Corpus.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering,
|
|
||||||
semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant,
|
|
||||||
labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to
|
|
||||||
perform adequately. We demonstrate that large gains on these tasks can be realized by generative pretraining of a
|
|
||||||
language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In
|
|
||||||
contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve
|
|
||||||
effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our
|
|
||||||
approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms
|
|
||||||
discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon
|
|
||||||
the state of the art in 9 out of the 12 tasks studied.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
|
||||||
the left.
|
|
||||||
- GPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
|
|
||||||
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
|
|
||||||
observed in the `run_generation.py` example script.
|
|
||||||
|
|
||||||
`Write With Transformer <https://transformer.huggingface.co/doc/gpt>`__ is a webapp created and hosted by Hugging Face
|
|
||||||
showcasing the generative capabilities of several models. GPT is one of them.
|
|
||||||
|
|
||||||
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
|
|
||||||
<https://github.com/openai/finetune-transformer-lm>`__.
|
|
||||||
|
|
||||||
Note:
|
|
||||||
|
|
||||||
If you want to reproduce the original tokenization process of the `OpenAI GPT` paper, you will need to install ``ftfy``
|
|
||||||
and ``SpaCy``:
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
pip install spacy ftfy==4.4.3
|
|
||||||
python -m spacy download en
|
|
||||||
|
|
||||||
If you don't install ``ftfy`` and ``SpaCy``, the :class:`~transformers.OpenAIGPTTokenizer` will default to tokenize
|
|
||||||
using BERT's :obj:`BasicTokenizer` followed by Byte-Pair Encoding (which should be fine for most usage, don't worry).
|
|
||||||
|
|
||||||
OpenAIGPTConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.OpenAIGPTConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
OpenAIGPTTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.OpenAIGPTTokenizer
|
|
||||||
:members: save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
OpenAIGPTTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.OpenAIGPTTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
OpenAI specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.openai.modeling_openai.OpenAIGPTDoubleHeadsModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
OpenAIGPTModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.OpenAIGPTModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
OpenAIGPTLMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.OpenAIGPTLMHeadModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
OpenAIGPTDoubleHeadsModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.OpenAIGPTDoubleHeadsModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
OpenAIGPTForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.OpenAIGPTForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFOpenAIGPTModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFOpenAIGPTModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFOpenAIGPTLMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFOpenAIGPTLMHeadModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFOpenAIGPTDoubleHeadsModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFOpenAIGPTDoubleHeadsModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
TFOpenAIGPTForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFOpenAIGPTForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
131
docs/source/model_doc/gpt2.mdx
Normal file
131
docs/source/model_doc/gpt2.mdx
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# OpenAI GPT2
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
OpenAI GPT-2 model was proposed in [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec
|
||||||
|
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
|
||||||
|
transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
|
||||||
|
web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
|
||||||
|
text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
|
||||||
|
across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
|
||||||
|
10X the amount of data.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
||||||
|
the left.
|
||||||
|
- GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
|
||||||
|
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
|
||||||
|
observed in the *run_generation.py* example script.
|
||||||
|
- The model can take the *past_key_values* (for PyTorch) or *past* (for TF) as input, which is the previously computed
|
||||||
|
key/value attention pairs. Using this (*past_key_values* or *past*) value prevents the model from re-computing
|
||||||
|
pre-computed values in the context of text generation. For PyTorch, see *past_key_values* argument of the
|
||||||
|
[`GPT2Model.forward`] method, or for TF the *past* argument of the
|
||||||
|
[`TFGPT2Model.call`] method for more information on its usage.
|
||||||
|
- Enabling the *scale_attn_by_inverse_layer_idx* and *reorder_and_upcast_attn* flags will apply the training stability
|
||||||
|
improvements from [Mistral](https://github.com/stanford-crfm/mistral/) (for PyTorch only).
|
||||||
|
|
||||||
|
[Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by
|
||||||
|
Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
|
||||||
|
different sizes: small, medium, large, xl and a distilled version of the small checkpoint: *distilgpt-2*.
|
||||||
|
|
||||||
|
This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://openai.com/blog/better-language-models/).
|
||||||
|
|
||||||
|
|
||||||
|
## GPT2Config
|
||||||
|
|
||||||
|
[[autodoc]] GPT2Config
|
||||||
|
|
||||||
|
## GPT2Tokenizer
|
||||||
|
|
||||||
|
[[autodoc]] GPT2Tokenizer
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## GPT2TokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] GPT2TokenizerFast
|
||||||
|
|
||||||
|
## GPT2 specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
|
||||||
|
|
||||||
|
## GPT2Model
|
||||||
|
|
||||||
|
[[autodoc]] GPT2Model
|
||||||
|
- forward
|
||||||
|
- parallelize
|
||||||
|
- deparallelize
|
||||||
|
|
||||||
|
## GPT2LMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] GPT2LMHeadModel
|
||||||
|
- forward
|
||||||
|
- parallelize
|
||||||
|
- deparallelize
|
||||||
|
|
||||||
|
## GPT2DoubleHeadsModel
|
||||||
|
|
||||||
|
[[autodoc]] GPT2DoubleHeadsModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GPT2ForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] GPT2ForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GPT2ForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] GPT2ForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFGPT2Model
|
||||||
|
|
||||||
|
[[autodoc]] TFGPT2Model
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFGPT2LMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] TFGPT2LMHeadModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFGPT2DoubleHeadsModel
|
||||||
|
|
||||||
|
[[autodoc]] TFGPT2DoubleHeadsModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFGPT2ForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFGPT2ForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFSequenceClassifierOutputWithPast
|
||||||
|
|
||||||
|
[[autodoc]] modeling_tf_outputs.TFSequenceClassifierOutputWithPast
|
||||||
|
|
||||||
|
## FlaxGPT2Model
|
||||||
|
|
||||||
|
[[autodoc]] FlaxGPT2Model
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxGPT2LMHeadModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxGPT2LMHeadModel
|
||||||
|
- __call__
|
||||||
@@ -1,165 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
OpenAI GPT2
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
OpenAI GPT-2 model was proposed in `Language Models are Unsupervised Multitask Learners
|
|
||||||
<https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf>`_ by Alec
|
|
||||||
Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional)
|
|
||||||
transformer pretrained using language modeling on a very large corpus of ~40 GB of text data.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million
|
|
||||||
web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some
|
|
||||||
text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks
|
|
||||||
across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than
|
|
||||||
10X the amount of data.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than
|
|
||||||
the left.
|
|
||||||
- GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next
|
|
||||||
token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be
|
|
||||||
observed in the `run_generation.py` example script.
|
|
||||||
- The model can take the `past_key_values` (for PyTorch) or `past` (for TF) as input, which is the previously computed
|
|
||||||
key/value attention pairs. Using this (`past_key_values` or `past`) value prevents the model from re-computing
|
|
||||||
pre-computed values in the context of text generation. For PyTorch, see `past_key_values` argument of the
|
|
||||||
:meth:`~transformers.GPT2Model.forward` method, or for TF the `past` argument of the
|
|
||||||
:meth:`~transformers.TFGPT2Model.call` method for more information on its usage.
|
|
||||||
- Enabling the `scale_attn_by_inverse_layer_idx` and `reorder_and_upcast_attn` flags will apply the training stability
|
|
||||||
improvements from `Mistral <https://github.com/stanford-crfm/mistral/>`__ (for PyTorch only).
|
|
||||||
|
|
||||||
`Write With Transformer <https://transformer.huggingface.co/doc/gpt2-large>`__ is a webapp created and hosted by
|
|
||||||
Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five
|
|
||||||
different sizes: small, medium, large, xl and a distilled version of the small checkpoint: `distilgpt-2`.
|
|
||||||
|
|
||||||
This model was contributed by `thomwolf <https://huggingface.co/thomwolf>`__. The original code can be found `here
|
|
||||||
<https://openai.com/blog/better-language-models/>`__.
|
|
||||||
|
|
||||||
|
|
||||||
GPT2Config
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2Config
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
GPT2Tokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2Tokenizer
|
|
||||||
:members: save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
GPT2TokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2TokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
GPT2 specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
GPT2Model
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2Model
|
|
||||||
:members: forward, parallelize, deparallelize
|
|
||||||
|
|
||||||
|
|
||||||
GPT2LMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2LMHeadModel
|
|
||||||
:members: forward, parallelize, deparallelize
|
|
||||||
|
|
||||||
|
|
||||||
GPT2DoubleHeadsModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2DoubleHeadsModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
GPT2ForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2ForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
GPT2ForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPT2ForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFGPT2Model
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFGPT2Model
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFGPT2LMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFGPT2LMHeadModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFGPT2DoubleHeadsModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFGPT2DoubleHeadsModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
TFGPT2ForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFGPT2ForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
TFSequenceClassifierOutputWithPast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.modeling_tf_outputs.TFSequenceClassifierOutputWithPast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
FlaxGPT2Model
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxGPT2Model
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxGPT2LMHeadModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxGPT2LMHeadModel
|
|
||||||
:members: __call__
|
|
||||||
72
docs/source/model_doc/gpt_neo.mdx
Normal file
72
docs/source/model_doc/gpt_neo.mdx
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# GPT Neo
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The GPTNeo model was released in the [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) repository by Sid
|
||||||
|
Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the
|
||||||
|
[Pile](https://pile.eleuther.ai/) dataset.
|
||||||
|
|
||||||
|
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of
|
||||||
|
256 tokens.
|
||||||
|
|
||||||
|
This model was contributed by [valhalla](https://huggingface.co/valhalla).
|
||||||
|
|
||||||
|
### Generation
|
||||||
|
|
||||||
|
The `generate()` method can be used to generate text using GPT Neo model.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import GPTNeoForCausalLM, GPT2Tokenizer
|
||||||
|
>>> model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
|
||||||
|
>>> tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
|
||||||
|
|
||||||
|
>>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
|
||||||
|
... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
|
||||||
|
... "researchers was the fact that the unicorns spoke perfect English."
|
||||||
|
|
||||||
|
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
>>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
|
||||||
|
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
## GPTNeoConfig
|
||||||
|
|
||||||
|
[[autodoc]] GPTNeoConfig
|
||||||
|
|
||||||
|
## GPTNeoModel
|
||||||
|
|
||||||
|
[[autodoc]] GPTNeoModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GPTNeoForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] GPTNeoForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GPTNeoForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] GPTNeoForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaxGPTNeoModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxGPTNeoModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxGPTNeoForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] FlaxGPTNeoForCausalLM
|
||||||
|
- __call__
|
||||||
@@ -1,86 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
GPT Neo
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The GPTNeo model was released in the `EleutherAI/gpt-neo <https://github.com/EleutherAI/gpt-neo>`__ repository by Sid
|
|
||||||
Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the
|
|
||||||
`Pile <https://pile.eleuther.ai/>`__ dataset.
|
|
||||||
|
|
||||||
The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of
|
|
||||||
256 tokens.
|
|
||||||
|
|
||||||
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__.
|
|
||||||
|
|
||||||
Generation
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
The :obj:`generate()` method can be used to generate text using GPT Neo model.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import GPTNeoForCausalLM, GPT2Tokenizer
|
|
||||||
>>> model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
|
|
||||||
>>> tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")
|
|
||||||
|
|
||||||
>>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
|
|
||||||
... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
|
|
||||||
... "researchers was the fact that the unicorns spoke perfect English."
|
|
||||||
|
|
||||||
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
>>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
|
|
||||||
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
|
|
||||||
|
|
||||||
|
|
||||||
GPTNeoConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTNeoConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
GPTNeoModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTNeoModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
GPTNeoForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTNeoForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
GPTNeoForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTNeoForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
FlaxGPTNeoModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxGPTNeoModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxGPTNeoForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxGPTNeoForCausalLM
|
|
||||||
:members: __call__
|
|
||||||
124
docs/source/model_doc/gptj.mdx
Normal file
124
docs/source/model_doc/gptj.mdx
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# GPT-J
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The GPT-J model was released in the [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like
|
||||||
|
causal language model trained on [the Pile](https://pile.eleuther.ai/) dataset.
|
||||||
|
|
||||||
|
This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena).
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size CPU
|
||||||
|
RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB of CPU
|
||||||
|
RAM to just load the model. To reduce the CPU RAM usage there are a few options. The `torch_dtype` argument can be
|
||||||
|
used to initialize the model in half-precision. And the `low_cpu_mem_usage` argument can be used to keep the RAM
|
||||||
|
usage to 1x. There is also a [fp16 branch](https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) which stores
|
||||||
|
the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly
|
||||||
|
12.1GB of CPU RAM to load the model.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import GPTJForCausalLM
|
||||||
|
>>> import torch
|
||||||
|
|
||||||
|
>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
- The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
|
||||||
|
optimizer for example makes four copies of the model: model, gradients, average and squared average of the gradients.
|
||||||
|
So it would need at least 4x model size GPU memory, even with mixed precision as gradient updates are in fp32. This
|
||||||
|
is not including the activations and data batches, which would again require some more GPU RAM. So one should explore
|
||||||
|
solutions such as DeepSpeed, to train/fine-tune the model. Another option is to use the original codebase to
|
||||||
|
train/fine-tune the model on TPU and then convert the model to Transformers format for inference. Instructions for
|
||||||
|
that could be found [here](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md)
|
||||||
|
|
||||||
|
- Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer. These extra
|
||||||
|
tokens are added for the sake of efficiency on TPUs. To avoid the mis-match between embedding matrix size and vocab
|
||||||
|
size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens
|
||||||
|
`<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400.
|
||||||
|
|
||||||
|
### Generation
|
||||||
|
|
||||||
|
The [`~generation_utils.GenerationMixin.generate`] method can be used to generate text using GPT-J
|
||||||
|
model.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
>>> model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
|
||||||
|
|
||||||
|
>>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
|
||||||
|
... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
|
||||||
|
... "researchers was the fact that the unicorns spoke perfect English."
|
||||||
|
|
||||||
|
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
>>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
|
||||||
|
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
...or in float16 precision:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import GPTJForCausalLM, AutoTokenizer
|
||||||
|
>>> import torch
|
||||||
|
|
||||||
|
>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
|
||||||
|
|
||||||
|
>>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
|
||||||
|
... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
|
||||||
|
... "researchers was the fact that the unicorns spoke perfect English."
|
||||||
|
|
||||||
|
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
>>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
|
||||||
|
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
## GPTJConfig
|
||||||
|
|
||||||
|
[[autodoc]] GPTJConfig
|
||||||
|
- all
|
||||||
|
|
||||||
|
## GPTJModel
|
||||||
|
|
||||||
|
[[autodoc]] GPTJModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GPTJForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] GPTJForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GPTJForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] GPTJForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GPTJForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] GPTJForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlaxGPTJModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxGPTJModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxGPTJForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] FlaxGPTJForCausalLM
|
||||||
|
- __call__
|
||||||
@@ -1,142 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
GPT-J
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The GPT-J model was released in the `kingoflolz/mesh-transformer-jax
|
|
||||||
<https://github.com/kingoflolz/mesh-transformer-jax>`__ repository by Ben Wang and Aran Komatsuzaki. It is a GPT-2-like
|
|
||||||
causal language model trained on `the Pile <https://pile.eleuther.ai/>`__ dataset.
|
|
||||||
|
|
||||||
This model was contributed by `Stella Biderman <https://huggingface.co/stellaathena>`__.
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- To load `GPT-J <https://huggingface.co/EleutherAI/gpt-j-6B>`__ in float32 one would need at least 2x model size CPU
|
|
||||||
RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB of CPU
|
|
||||||
RAM to just load the model. To reduce the CPU RAM usage there are a few options. The ``torch_dtype`` argument can be
|
|
||||||
used to initialize the model in half-precision. And the ``low_cpu_mem_usage`` argument can be used to keep the RAM
|
|
||||||
usage to 1x. There is also a `fp16 branch <https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16>`__ which stores
|
|
||||||
the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly
|
|
||||||
12.1GB of CPU RAM to load the model.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import GPTJForCausalLM
|
|
||||||
>>> import torch
|
|
||||||
|
|
||||||
>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
|
|
||||||
|
|
||||||
|
|
||||||
- The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
|
|
||||||
optimizer for example makes four copies of the model: model, gradients, average and squared average of the gradients.
|
|
||||||
So it would need at least 4x model size GPU memory, even with mixed precision as gradient updates are in fp32. This
|
|
||||||
is not including the activations and data batches, which would again require some more GPU RAM. So one should explore
|
|
||||||
solutions such as DeepSpeed, to train/fine-tune the model. Another option is to use the original codebase to
|
|
||||||
train/fine-tune the model on TPU and then convert the model to Transformers format for inference. Instructions for
|
|
||||||
that could be found `here <https://github.com/kingoflolz/mesh-transformer-jax/blob/master/howto_finetune.md>`__
|
|
||||||
|
|
||||||
- Although the embedding matrix has a size of 50400, only 50257 entries are used by the GPT-2 tokenizer. These extra
|
|
||||||
tokens are added for the sake of efficiency on TPUs. To avoid the mis-match between embedding matrix size and vocab
|
|
||||||
size, the tokenizer for `GPT-J <https://huggingface.co/EleutherAI/gpt-j-6B>`__ contains 143 extra tokens
|
|
||||||
``<|extratoken_1|>... <|extratoken_143|>``, so the ``vocab_size`` of tokenizer also becomes 50400.
|
|
||||||
|
|
||||||
Generation
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
The :meth:`~transformers.generation_utils.GenerationMixin.generate` method can be used to generate text using GPT-J
|
|
||||||
model.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
||||||
>>> model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
|
|
||||||
|
|
||||||
>>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
|
|
||||||
... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
|
|
||||||
... "researchers was the fact that the unicorns spoke perfect English."
|
|
||||||
|
|
||||||
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
>>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
|
|
||||||
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
|
|
||||||
|
|
||||||
...or in float16 precision:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import GPTJForCausalLM, AutoTokenizer
|
|
||||||
>>> import torch
|
|
||||||
|
|
||||||
>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
|
|
||||||
|
|
||||||
>>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
|
|
||||||
... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
|
|
||||||
... "researchers was the fact that the unicorns spoke perfect English."
|
|
||||||
|
|
||||||
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
>>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
|
|
||||||
>>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
|
|
||||||
|
|
||||||
|
|
||||||
GPTJConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTJConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
GPTJModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTJModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
GPTJForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTJForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
GPTJForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTJForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
GPTJForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.GPTJForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
FlaxGPTJModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxGPTJModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxGPTJForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxGPTJForCausalLM
|
|
||||||
:members: __call__
|
|
||||||
65
docs/source/model_doc/herbert.mdx
Normal file
65
docs/source/model_doc/herbert.mdx
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# HerBERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The HerBERT model was proposed in [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and
|
||||||
|
Ireneusz Gawlik. It is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic
|
||||||
|
masking of whole words.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*In recent years, a series of Transformer-based models unlocked major improvements in general natural language
|
||||||
|
understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which
|
||||||
|
allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of
|
||||||
|
languages. To alleviate this issue, we introduce a comprehensive multi-task benchmark for the Polish language
|
||||||
|
understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing
|
||||||
|
datasets for named entity recognition, question-answering, textual entailment, and others. We also introduce a new
|
||||||
|
sentiment analysis task for the e-commerce domain, named Allegro Reviews (AR). To ensure a common evaluation scheme and
|
||||||
|
promote models that generalize to different NLU tasks, the benchmark includes datasets from varying domains and
|
||||||
|
applications. Additionally, we release HerBERT, a Transformer-based model trained specifically for the Polish language,
|
||||||
|
which has the best average performance and obtains the best results for three out of nine tasks. Finally, we provide an
|
||||||
|
extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
|
||||||
|
models.*
|
||||||
|
|
||||||
|
Examples of use:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import HerbertTokenizer, RobertaModel
|
||||||
|
|
||||||
|
>>> tokenizer = HerbertTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
|
||||||
|
>>> model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
|
||||||
|
|
||||||
|
>>> encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
|
||||||
|
>>> outputs = model(encoded_input)
|
||||||
|
|
||||||
|
>>> # HerBERT can also be loaded using AutoTokenizer and AutoModel:
|
||||||
|
>>> import torch
|
||||||
|
>>> from transformers import AutoModel, AutoTokenizer
|
||||||
|
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
|
||||||
|
>>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
|
||||||
|
```
|
||||||
|
|
||||||
|
This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found
|
||||||
|
[here](https://github.com/allegro/HerBERT).
|
||||||
|
|
||||||
|
|
||||||
|
## HerbertTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] HerbertTokenizer
|
||||||
|
|
||||||
|
## HerbertTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] HerbertTokenizerFast
|
||||||
@@ -1,73 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
HerBERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The HerBERT model was proposed in `KLEJ: Comprehensive Benchmark for Polish Language Understanding
|
|
||||||
<https://www.aclweb.org/anthology/2020.acl-main.111.pdf>`__ by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and
|
|
||||||
Ireneusz Gawlik. It is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic
|
|
||||||
masking of whole words.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*In recent years, a series of Transformer-based models unlocked major improvements in general natural language
|
|
||||||
understanding (NLU) tasks. Such a fast pace of research would not be possible without general NLU benchmarks, which
|
|
||||||
allow for a fair comparison of the proposed methods. However, such benchmarks are available only for a handful of
|
|
||||||
languages. To alleviate this issue, we introduce a comprehensive multi-task benchmark for the Polish language
|
|
||||||
understanding, accompanied by an online leaderboard. It consists of a diverse set of tasks, adopted from existing
|
|
||||||
datasets for named entity recognition, question-answering, textual entailment, and others. We also introduce a new
|
|
||||||
sentiment analysis task for the e-commerce domain, named Allegro Reviews (AR). To ensure a common evaluation scheme and
|
|
||||||
promote models that generalize to different NLU tasks, the benchmark includes datasets from varying domains and
|
|
||||||
applications. Additionally, we release HerBERT, a Transformer-based model trained specifically for the Polish language,
|
|
||||||
which has the best average performance and obtains the best results for three out of nine tasks. Finally, we provide an
|
|
||||||
extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based
|
|
||||||
models.*
|
|
||||||
|
|
||||||
Examples of use:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import HerbertTokenizer, RobertaModel
|
|
||||||
|
|
||||||
>>> tokenizer = HerbertTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
|
|
||||||
>>> model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
|
|
||||||
|
|
||||||
>>> encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
|
|
||||||
>>> outputs = model(encoded_input)
|
|
||||||
|
|
||||||
>>> # HerBERT can also be loaded using AutoTokenizer and AutoModel:
|
|
||||||
>>> import torch
|
|
||||||
>>> from transformers import AutoModel, AutoTokenizer
|
|
||||||
|
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
|
|
||||||
>>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
|
|
||||||
|
|
||||||
|
|
||||||
This model was contributed by `rmroczkowski <https://huggingface.co/rmroczkowski>`__. The original code can be found
|
|
||||||
`here <https://github.com/allegro/HerBERT>`__.
|
|
||||||
|
|
||||||
|
|
||||||
HerbertTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.HerbertTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
HerbertTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.HerbertTokenizerFast
|
|
||||||
:members:
|
|
||||||
71
docs/source/model_doc/hubert.mdx
Normal file
71
docs/source/model_doc/hubert.mdx
Normal file
@@ -0,0 +1,71 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Hubert
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Hubert was proposed in [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan
|
||||||
|
Salakhutdinov, Abdelrahman Mohamed.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are
|
||||||
|
multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training
|
||||||
|
phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we
|
||||||
|
propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an
|
||||||
|
offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our
|
||||||
|
approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined
|
||||||
|
acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised
|
||||||
|
clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means
|
||||||
|
teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the
|
||||||
|
state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h,
|
||||||
|
10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER
|
||||||
|
reduction on the more challenging dev-other and test-other evaluation subsets.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
|
||||||
|
- Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
|
||||||
|
using [`Wav2Vec2CTCTokenizer`].
|
||||||
|
|
||||||
|
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
|
||||||
|
|
||||||
|
|
||||||
|
## HubertConfig
|
||||||
|
|
||||||
|
[[autodoc]] HubertConfig
|
||||||
|
|
||||||
|
## HubertModel
|
||||||
|
|
||||||
|
[[autodoc]] HubertModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## HubertForCTC
|
||||||
|
|
||||||
|
[[autodoc]] HubertForCTC
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## HubertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] HubertForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFHubertModel
|
||||||
|
|
||||||
|
[[autodoc]] TFHubertModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFHubertForCTC
|
||||||
|
|
||||||
|
[[autodoc]] TFHubertForCTC
|
||||||
|
- call
|
||||||
@@ -1,86 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
Hubert
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Hubert was proposed in `HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
|
|
||||||
<https://arxiv.org/abs/2106.07447>`__ by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan
|
|
||||||
Salakhutdinov, Abdelrahman Mohamed.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are
|
|
||||||
multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training
|
|
||||||
phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we
|
|
||||||
propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an
|
|
||||||
offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our
|
|
||||||
approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined
|
|
||||||
acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised
|
|
||||||
clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means
|
|
||||||
teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the
|
|
||||||
state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h,
|
|
||||||
10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER
|
|
||||||
reduction on the more challenging dev-other and test-other evaluation subsets.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
|
|
||||||
- Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
|
|
||||||
using :class:`~transformers.Wav2Vec2CTCTokenizer`.
|
|
||||||
|
|
||||||
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
|
|
||||||
|
|
||||||
|
|
||||||
HubertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.HubertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
HubertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.HubertModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
HubertForCTC
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.HubertForCTC
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
HubertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.HubertForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFHubertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFHubertModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFHubertForCTC
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFHubertForCTC
|
|
||||||
:members: call
|
|
||||||
72
docs/source/model_doc/ibert.mdx
Normal file
72
docs/source/model_doc/ibert.mdx
Normal file
@@ -0,0 +1,72 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# I-BERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The I-BERT model was proposed in [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by
|
||||||
|
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer. It's a quantized version of RoBERTa running
|
||||||
|
inference up to four times faster.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language
|
||||||
|
Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for
|
||||||
|
efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this,
|
||||||
|
previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot
|
||||||
|
efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM
|
||||||
|
processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes
|
||||||
|
the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for
|
||||||
|
nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT
|
||||||
|
inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using
|
||||||
|
RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to
|
||||||
|
the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for
|
||||||
|
INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has
|
||||||
|
been open-sourced.*
|
||||||
|
|
||||||
|
This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT).
|
||||||
|
|
||||||
|
|
||||||
|
## IBertConfig
|
||||||
|
|
||||||
|
[[autodoc]] IBertConfig
|
||||||
|
|
||||||
|
## IBertModel
|
||||||
|
|
||||||
|
[[autodoc]] IBertModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## IBertForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] IBertForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## IBertForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] IBertForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## IBertForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] IBertForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## IBertForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] IBertForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## IBertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] IBertForQuestionAnswering
|
||||||
|
- forward
|
||||||
@@ -1,89 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
I-BERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The I-BERT model was proposed in `I-BERT: Integer-only BERT Quantization <https://arxiv.org/abs/2101.01321>`__ by
|
|
||||||
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney and Kurt Keutzer. It's a quantized version of RoBERTa running
|
|
||||||
inference up to four times faster.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language
|
|
||||||
Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for
|
|
||||||
efficient inference at the edge, and even at the data center. While quantization can be a viable solution for this,
|
|
||||||
previous work on quantizing Transformer based models use floating-point arithmetic during inference, which cannot
|
|
||||||
efficiently utilize integer-only logical units such as the recent Turing Tensor Cores, or traditional integer-only ARM
|
|
||||||
processors. In this work, we propose I-BERT, a novel quantization scheme for Transformer based models that quantizes
|
|
||||||
the entire inference with integer-only arithmetic. Based on lightweight integer-only approximation methods for
|
|
||||||
nonlinear operations, e.g., GELU, Softmax, and Layer Normalization, I-BERT performs an end-to-end integer-only BERT
|
|
||||||
inference without any floating point calculation. We evaluate our approach on GLUE downstream tasks using
|
|
||||||
RoBERTa-Base/Large. We show that for both cases, I-BERT achieves similar (and slightly higher) accuracy as compared to
|
|
||||||
the full-precision baseline. Furthermore, our preliminary implementation of I-BERT shows a speedup of 2.4 - 4.0x for
|
|
||||||
INT8 inference on a T4 GPU system as compared to FP32 inference. The framework has been developed in PyTorch and has
|
|
||||||
been open-sourced.*
|
|
||||||
|
|
||||||
This model was contributed by `kssteven <https://huggingface.co/kssteven>`__. The original code can be found `here
|
|
||||||
<https://github.com/kssteven418/I-BERT>`__.
|
|
||||||
|
|
||||||
|
|
||||||
IBertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.IBertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
IBertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.IBertModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
IBertForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.IBertForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
IBertForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.IBertForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
IBertForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.IBertForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
IBertForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.IBertForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
IBertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.IBertForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
124
docs/source/model_doc/layoutlm.mdx
Normal file
124
docs/source/model_doc/layoutlm.mdx
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# LayoutLM
|
||||||
|
|
||||||
|
<a id='Overview'></a>
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The LayoutLM model was proposed in the paper [LayoutLM: Pre-training of Text and Layout for Document Image
|
||||||
|
Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
|
||||||
|
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
|
||||||
|
information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
|
||||||
|
on several downstream tasks:
|
||||||
|
|
||||||
|
- form understanding: the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset (a collection of 199 annotated
|
||||||
|
forms comprising more than 30,000 words).
|
||||||
|
- receipt understanding: the [SROIE](https://rrc.cvc.uab.es/?ch=13) dataset (a collection of 626 receipts for
|
||||||
|
training and 347 receipts for testing).
|
||||||
|
- document image classification: the [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/) dataset (a collection of
|
||||||
|
400,000 images belonging to one of 16 classes).
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
|
||||||
|
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
|
||||||
|
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
|
||||||
|
the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is
|
||||||
|
beneficial for a great number of real-world document image understanding tasks such as information extraction from
|
||||||
|
scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM.
|
||||||
|
To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for
|
||||||
|
document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form
|
||||||
|
understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
|
||||||
|
(from 93.07 to 94.42).*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are
|
||||||
|
the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
|
||||||
|
as Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) (there's a [Python wrapper](https://pypi.org/project/pytesseract/) available). Each bounding box should be in (x0, y0, x1, y1) format, where
|
||||||
|
(x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
|
||||||
|
position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
|
||||||
|
scale. To normalize, you can use the following function:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def normalize_bbox(bbox, width, height):
|
||||||
|
return [
|
||||||
|
int(1000 * (bbox[0] / width)),
|
||||||
|
int(1000 * (bbox[1] / height)),
|
||||||
|
int(1000 * (bbox[2] / width)),
|
||||||
|
int(1000 * (bbox[3] / height)),
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, `width` and `height` correspond to the width and height of the original document in which the token
|
||||||
|
occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
|
||||||
|
|
||||||
|
width, height = image.size
|
||||||
|
```
|
||||||
|
|
||||||
|
- For a demo which shows how to fine-tune [`LayoutLMForTokenClassification`] on the [FUNSD dataset](https://guillaumejaume.github.io/FUNSD/) (a collection of annotated forms), see [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb).
|
||||||
|
It includes an inference part, which shows how to use Google's Tesseract on a new document.
|
||||||
|
|
||||||
|
This model was contributed by [liminghao1630](https://huggingface.co/liminghao1630). The original code can be found
|
||||||
|
[here](https://github.com/microsoft/unilm/tree/master/layoutlm).
|
||||||
|
|
||||||
|
|
||||||
|
## LayoutLMConfig
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMConfig
|
||||||
|
|
||||||
|
## LayoutLMTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMTokenizer
|
||||||
|
|
||||||
|
## LayoutLMTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMTokenizerFast
|
||||||
|
|
||||||
|
## LayoutLMModel
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMModel
|
||||||
|
|
||||||
|
## LayoutLMForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMForMaskedLM
|
||||||
|
|
||||||
|
## LayoutLMForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMForSequenceClassification
|
||||||
|
|
||||||
|
## LayoutLMForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMForTokenClassification
|
||||||
|
|
||||||
|
## TFLayoutLMModel
|
||||||
|
|
||||||
|
[[autodoc]] TFLayoutLMModel
|
||||||
|
|
||||||
|
## TFLayoutLMForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFLayoutLMForMaskedLM
|
||||||
|
|
||||||
|
## TFLayoutLMForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFLayoutLMForSequenceClassification
|
||||||
|
|
||||||
|
## TFLayoutLMForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFLayoutLMForTokenClassification
|
||||||
@@ -1,161 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
LayoutLM
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
.. _Overview:
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The LayoutLM model was proposed in the paper `LayoutLM: Pre-training of Text and Layout for Document Image
|
|
||||||
Understanding <https://arxiv.org/abs/1912.13318>`__ by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and
|
|
||||||
Ming Zhou. It's a simple but effective pretraining method of text and layout for document image understanding and
|
|
||||||
information extraction tasks, such as form understanding and receipt understanding. It obtains state-of-the-art results
|
|
||||||
on several downstream tasks:
|
|
||||||
|
|
||||||
- form understanding: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a collection of 199 annotated
|
|
||||||
forms comprising more than 30,000 words).
|
|
||||||
- receipt understanding: the `SROIE <https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for
|
|
||||||
training and 347 receipts for testing).
|
|
||||||
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
|
|
||||||
400,000 images belonging to one of 16 classes).
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the
|
|
||||||
widespread use of pretraining models for NLP applications, they almost exclusively focus on text-level manipulation,
|
|
||||||
while neglecting layout and style information that is vital for document image understanding. In this paper, we propose
|
|
||||||
the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is
|
|
||||||
beneficial for a great number of real-world document image understanding tasks such as information extraction from
|
|
||||||
scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM.
|
|
||||||
To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for
|
|
||||||
document-level pretraining. It achieves new state-of-the-art results in several downstream tasks, including form
|
|
||||||
understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification
|
|
||||||
(from 93.07 to 94.42).*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- In addition to `input_ids`, :meth:`~transformer.LayoutLMModel.forward` also expects the input :obj:`bbox`, which are
|
|
||||||
the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such
|
|
||||||
as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python wrapper
|
|
||||||
<https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1) format, where
|
|
||||||
(x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the
|
|
||||||
position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on a 0-1000
|
|
||||||
scale. To normalize, you can use the following function:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
def normalize_bbox(bbox, width, height):
|
|
||||||
return [
|
|
||||||
int(1000 * (bbox[0] / width)),
|
|
||||||
int(1000 * (bbox[1] / height)),
|
|
||||||
int(1000 * (bbox[2] / width)),
|
|
||||||
int(1000 * (bbox[3] / height)),
|
|
||||||
]
|
|
||||||
|
|
||||||
Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
|
|
||||||
occurs. Those can be obtained using the Python Image Library (PIL) library for example, as follows:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
|
|
||||||
|
|
||||||
width, height = image.size
|
|
||||||
|
|
||||||
- For a demo which shows how to fine-tune :class:`LayoutLMForTokenClassification` on the `FUNSD dataset
|
|
||||||
<https://guillaumejaume.github.io/FUNSD/>`__ (a collection of annotated forms), see `this notebook
|
|
||||||
<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb>`__.
|
|
||||||
It includes an inference part, which shows how to use Google's Tesseract on a new document.
|
|
||||||
|
|
||||||
This model was contributed by `liminghao1630 <https://huggingface.co/liminghao1630>`__. The original code can be found
|
|
||||||
`here <https://github.com/microsoft/unilm/tree/master/layoutlm>`_.
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMModel
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMForMaskedLM
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMForSequenceClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMForTokenClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFLayoutLMModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLayoutLMModel
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFLayoutLMForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLayoutLMForMaskedLM
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFLayoutLMForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLayoutLMForSequenceClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
TFLayoutLMForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLayoutLMForTokenClassification
|
|
||||||
:members:
|
|
||||||
287
docs/source/model_doc/layoutlmv2.mdx
Normal file
287
docs/source/model_doc/layoutlmv2.mdx
Normal file
@@ -0,0 +1,287 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# LayoutLMV2
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The LayoutLMV2 model was proposed in [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
|
||||||
|
Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves [LayoutLM](layoutlm) to obtain
|
||||||
|
state-of-the-art results across several document image understanding benchmarks:
|
||||||
|
|
||||||
|
- information extraction from scanned documents: the [FUNSD](https://guillaumejaume.github.io/FUNSD/) dataset (a
|
||||||
|
collection of 199 annotated forms comprising more than 30,000 words), the [CORD](https://github.com/clovaai/cord)
|
||||||
|
dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the [SROIE](https://rrc.cvc.uab.es/?ch=13) dataset (a collection of 626 receipts for training and 347 receipts for testing)
|
||||||
|
and the [Kleister-NDA](https://github.com/applicaai/kleister-nda) dataset (a collection of non-disclosure
|
||||||
|
agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203
|
||||||
|
documents for testing).
|
||||||
|
- document image classification: the [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/) dataset (a collection of
|
||||||
|
400,000 images belonging to one of 16 classes).
|
||||||
|
- document visual question answering: the [DocVQA](https://arxiv.org/abs/2007.00398) dataset (a collection of 50,000
|
||||||
|
questions defined on 12,000+ document images).
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to
|
||||||
|
its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this
|
||||||
|
paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model
|
||||||
|
architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked
|
||||||
|
visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training
|
||||||
|
stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention
|
||||||
|
mechanism into the Transformer architecture, so that the model can fully understand the relative positional
|
||||||
|
relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and
|
||||||
|
achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks,
|
||||||
|
including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852),
|
||||||
|
RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at
|
||||||
|
this https URL.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
|
||||||
|
pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
|
||||||
|
- LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in
|
||||||
|
the self-attention layers. Details can be found on page 5 of the [paper](https://arxiv.org/abs/2012.14740).
|
||||||
|
- Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found [here](https://github.com/NielsRogge/Transformers-Tutorials).
|
||||||
|
- LayoutLMv2 uses Facebook AI's [Detectron2](https://github.com/facebookresearch/detectron2/) package for its visual
|
||||||
|
backbone. See [this link](https://detectron2.readthedocs.io/en/latest/tutorials/install.html) for installation
|
||||||
|
instructions.
|
||||||
|
- In addition to `input_ids`, [`~LayoutLMv2Model.forward`] expects 2 additional inputs, namely
|
||||||
|
`image` and `bbox`. The `image` input corresponds to the original document image in which the text
|
||||||
|
tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of
|
||||||
|
document images, `image` should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
|
||||||
|
`torch.Tensor` or a `Detectron2.structures.ImageList`. You don't need to normalize the channels, as this is
|
||||||
|
done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models
|
||||||
|
in Detectron2 are pre-trained using the BGR format. The `bbox` input are the bounding boxes (i.e. 2D-positions)
|
||||||
|
of the input text tokens. This is identical to [`LayoutLMModel`]. These can be obtained using an
|
||||||
|
external OCR engine such as Google's [Tesseract](https://github.com/tesseract-ocr/tesseract) (there's a [Python
|
||||||
|
wrapper](https://pypi.org/project/pytesseract/) available). Each bounding box should be in (x0, y0, x1, y1)
|
||||||
|
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1)
|
||||||
|
represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on
|
||||||
|
a 0-1000 scale. To normalize, you can use the following function:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def normalize_bbox(bbox, width, height):
|
||||||
|
return [
|
||||||
|
int(1000 * (bbox[0] / width)),
|
||||||
|
int(1000 * (bbox[1] / height)),
|
||||||
|
int(1000 * (bbox[2] / width)),
|
||||||
|
int(1000 * (bbox[3] / height)),
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, `width` and `height` correspond to the width and height of the original document in which the token
|
||||||
|
occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as
|
||||||
|
follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
|
||||||
|
|
||||||
|
width, height = image.size
|
||||||
|
```
|
||||||
|
|
||||||
|
However, this model includes a brand new [`~transformers.LayoutLMv2Processor`] which can be used to directly
|
||||||
|
prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage"
|
||||||
|
section below.
|
||||||
|
|
||||||
|
- Internally, [`~transformers.LayoutLMv2Model`] will send the `image` input through its visual backbone to
|
||||||
|
obtain a lower-resolution feature map, whose shape is equal to the `image_feature_pool_shape` attribute of
|
||||||
|
[`~transformers.LayoutLMv2Config`]. This feature map is then flattened to obtain a sequence of image tokens. As
|
||||||
|
the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text
|
||||||
|
tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a
|
||||||
|
length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states
|
||||||
|
will have a shape of `seq_length` + `image_feature_pool_shape[0]` *
|
||||||
|
`config.image_feature_pool_shape[1]`.
|
||||||
|
- When calling [`~transformers.LayoutLMv2Model.from_pretrained`], a warning will be printed with a long list of
|
||||||
|
parameter names that are not initialized. This is not a problem, as these parameters are batch normalization
|
||||||
|
statistics, which are going to have values when fine-tuning on a custom dataset.
|
||||||
|
- If you want to train the model in a distributed environment, make sure to call [`synchronize_batch_norm`] on the
|
||||||
|
model in order to properly synchronize the batch normalization layers of the visual backbone.
|
||||||
|
|
||||||
|
In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on
|
||||||
|
[LayoutXLM's documentation page](layoutxlm).
|
||||||
|
|
||||||
|
## Usage: LayoutLMv2Processor
|
||||||
|
|
||||||
|
The easiest way to prepare data for the model is to use [`LayoutLMv2Processor`], which internally
|
||||||
|
combines a feature extractor ([`LayoutLMv2FeatureExtractor`]) and a tokenizer
|
||||||
|
([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The feature extractor
|
||||||
|
handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
|
||||||
|
for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
|
||||||
|
modality.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
|
||||||
|
|
||||||
|
feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
|
||||||
|
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
||||||
|
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
|
||||||
|
```
|
||||||
|
|
||||||
|
In short, one can provide a document image (and possibly additional data) to [`LayoutLMv2Processor`],
|
||||||
|
and it will create the inputs expected by the model. Internally, the processor first uses
|
||||||
|
[`LayoutLMv2FeatureExtractor`] to apply OCR on the image to get a list of words and normalized
|
||||||
|
bounding boxes, as well to resize the image to a given size in order to get the `image` input. The words and
|
||||||
|
normalized bounding boxes are then provided to [`LayoutLMv2Tokenizer`] or
|
||||||
|
[`LayoutLMv2TokenizerFast`], which converts them to token-level `input_ids`,
|
||||||
|
`attention_mask`, `token_type_ids`, `bbox`. Optionally, one can provide word labels to the processor,
|
||||||
|
which are turned into token-level `labels`.
|
||||||
|
|
||||||
|
[`LayoutLMv2Processor`] uses [PyTesseract](https://pypi.org/project/pytesseract/), a Python
|
||||||
|
wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
|
||||||
|
choice, and provide the words and normalized boxes yourself. This requires initializing
|
||||||
|
[`LayoutLMv2FeatureExtractor`] with `apply_ocr` set to `False`.
|
||||||
|
|
||||||
|
In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
|
||||||
|
use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
|
||||||
|
|
||||||
|
**Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
|
||||||
|
True**
|
||||||
|
|
||||||
|
This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get
|
||||||
|
the words and normalized bounding boxes.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutLMv2Processor
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
||||||
|
|
||||||
|
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||||
|
encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
|
||||||
|
print(encoding.keys())
|
||||||
|
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
|
||||||
|
|
||||||
|
In case one wants to do OCR themselves, one can initialize the feature extractor with `apply_ocr` set to
|
||||||
|
`False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
|
||||||
|
the processor.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutLMv2Processor
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
||||||
|
|
||||||
|
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||||
|
words = ["hello", "world"]
|
||||||
|
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
||||||
|
encoding = processor(image, words, boxes=boxes, return_tensors="pt")
|
||||||
|
print(encoding.keys())
|
||||||
|
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use case 3: token classification (training), apply_ocr=False**
|
||||||
|
|
||||||
|
For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word
|
||||||
|
labels in order to train a model. The processor will then convert these into token-level `labels`. By default, it
|
||||||
|
will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the
|
||||||
|
`ignore_index` of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
|
||||||
|
initialize the tokenizer with `only_label_first_subword` set to `False`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutLMv2Processor
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
||||||
|
|
||||||
|
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||||
|
words = ["hello", "world"]
|
||||||
|
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
||||||
|
word_labels = [1, 2]
|
||||||
|
encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
|
||||||
|
print(encoding.keys())
|
||||||
|
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use case 4: visual question answering (inference), apply_ocr=True**
|
||||||
|
|
||||||
|
For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the
|
||||||
|
processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutLMv2Processor
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
||||||
|
|
||||||
|
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||||
|
question = "What's his name?"
|
||||||
|
encoding = processor(image, question, return_tensors="pt")
|
||||||
|
print(encoding.keys())
|
||||||
|
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||||
|
```
|
||||||
|
|
||||||
|
**Use case 5: visual question answering (inference), apply_ocr=False**
|
||||||
|
|
||||||
|
For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to
|
||||||
|
perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutLMv2Processor
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
||||||
|
|
||||||
|
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
||||||
|
question = "What's his name?"
|
||||||
|
words = ["hello", "world"]
|
||||||
|
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
||||||
|
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
|
||||||
|
print(encoding.keys())
|
||||||
|
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
||||||
|
```
|
||||||
|
|
||||||
|
## LayoutLMv2Config
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2Config
|
||||||
|
|
||||||
|
## LayoutLMv2FeatureExtractor
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2FeatureExtractor
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## LayoutLMv2Tokenizer
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2Tokenizer
|
||||||
|
- __call__
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## LayoutLMv2TokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2TokenizerFast
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## LayoutLMv2Processor
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2Processor
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## LayoutLMv2Model
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2Model
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LayoutLMv2ForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2ForSequenceClassification
|
||||||
|
|
||||||
|
## LayoutLMv2ForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2ForTokenClassification
|
||||||
|
|
||||||
|
## LayoutLMv2ForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] LayoutLMv2ForQuestionAnswering
|
||||||
@@ -1,313 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
LayoutLMV2
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The LayoutLMV2 model was proposed in `LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
|
|
||||||
<https://arxiv.org/abs/2012.14740>`__ by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,
|
|
||||||
Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. LayoutLMV2 improves `LayoutLM <layoutlm>`__ to obtain
|
|
||||||
state-of-the-art results across several document image understanding benchmarks:
|
|
||||||
|
|
||||||
- information extraction from scanned documents: the `FUNSD <https://guillaumejaume.github.io/FUNSD/>`__ dataset (a
|
|
||||||
collection of 199 annotated forms comprising more than 30,000 words), the `CORD <https://github.com/clovaai/cord>`__
|
|
||||||
dataset (a collection of 800 receipts for training, 100 for validation and 100 for testing), the `SROIE
|
|
||||||
<https://rrc.cvc.uab.es/?ch=13>`__ dataset (a collection of 626 receipts for training and 347 receipts for testing)
|
|
||||||
and the `Kleister-NDA <https://github.com/applicaai/kleister-nda>`__ dataset (a collection of non-disclosure
|
|
||||||
agreements from the EDGAR database, including 254 documents for training, 83 documents for validation, and 203
|
|
||||||
documents for testing).
|
|
||||||
- document image classification: the `RVL-CDIP <https://www.cs.cmu.edu/~aharley/rvl-cdip/>`__ dataset (a collection of
|
|
||||||
400,000 images belonging to one of 16 classes).
|
|
||||||
- document visual question answering: the `DocVQA <https://arxiv.org/abs/2007.00398>`__ dataset (a collection of 50,000
|
|
||||||
questions defined on 12,000+ document images).
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to
|
|
||||||
its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this
|
|
||||||
paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model
|
|
||||||
architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked
|
|
||||||
visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training
|
|
||||||
stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention
|
|
||||||
mechanism into the Transformer architecture, so that the model can fully understand the relative positional
|
|
||||||
relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and
|
|
||||||
achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks,
|
|
||||||
including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852),
|
|
||||||
RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). The pre-trained LayoutLMv2 model is publicly available at
|
|
||||||
this https URL.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during
|
|
||||||
pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning).
|
|
||||||
- LayoutLMv2 adds both a relative 1D attention bias as well as a spatial 2D attention bias to the attention scores in
|
|
||||||
the self-attention layers. Details can be found on page 5 of the `paper <https://arxiv.org/abs/2012.14740>`__.
|
|
||||||
- Demo notebooks on how to use the LayoutLMv2 model on RVL-CDIP, FUNSD, DocVQA, CORD can be found `here
|
|
||||||
<https://github.com/NielsRogge/Transformers-Tutorials>`__.
|
|
||||||
- LayoutLMv2 uses Facebook AI's `Detectron2 <https://github.com/facebookresearch/detectron2/>`__ package for its visual
|
|
||||||
backbone. See `this link <https://detectron2.readthedocs.io/en/latest/tutorials/install.html>`__ for installation
|
|
||||||
instructions.
|
|
||||||
- In addition to :obj:`input_ids`, :meth:`~transformer.LayoutLMv2Model.forward` expects 2 additional inputs, namely
|
|
||||||
:obj:`image` and :obj:`bbox`. The :obj:`image` input corresponds to the original document image in which the text
|
|
||||||
tokens occur. The model expects each document image to be of size 224x224. This means that if you have a batch of
|
|
||||||
document images, :obj:`image` should be a tensor of shape (batch_size, 3, 224, 224). This can be either a
|
|
||||||
:obj:`torch.Tensor` or a :obj:`Detectron2.structures.ImageList`. You don't need to normalize the channels, as this is
|
|
||||||
done by the model. Important to note is that the visual backbone expects BGR channels instead of RGB, as all models
|
|
||||||
in Detectron2 are pre-trained using the BGR format. The :obj:`bbox` input are the bounding boxes (i.e. 2D-positions)
|
|
||||||
of the input text tokens. This is identical to :class:`~transformer.LayoutLMModel`. These can be obtained using an
|
|
||||||
external OCR engine such as Google's `Tesseract <https://github.com/tesseract-ocr/tesseract>`__ (there's a `Python
|
|
||||||
wrapper <https://pypi.org/project/pytesseract/>`__ available). Each bounding box should be in (x0, y0, x1, y1)
|
|
||||||
format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1)
|
|
||||||
represents the position of the lower right corner. Note that one first needs to normalize the bounding boxes to be on
|
|
||||||
a 0-1000 scale. To normalize, you can use the following function:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
def normalize_bbox(bbox, width, height):
|
|
||||||
return [
|
|
||||||
int(1000 * (bbox[0] / width)),
|
|
||||||
int(1000 * (bbox[1] / height)),
|
|
||||||
int(1000 * (bbox[2] / width)),
|
|
||||||
int(1000 * (bbox[3] / height)),
|
|
||||||
]
|
|
||||||
|
|
||||||
Here, :obj:`width` and :obj:`height` correspond to the width and height of the original document in which the token
|
|
||||||
occurs (before resizing the image). Those can be obtained using the Python Image Library (PIL) library for example, as
|
|
||||||
follows:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.")
|
|
||||||
|
|
||||||
width, height = image.size
|
|
||||||
|
|
||||||
However, this model includes a brand new :class:`~transformer.LayoutLMv2Processor` which can be used to directly
|
|
||||||
prepare data for the model (including applying OCR under the hood). More information can be found in the "Usage"
|
|
||||||
section below.
|
|
||||||
|
|
||||||
- Internally, :class:`~transformer.LayoutLMv2Model` will send the :obj:`image` input through its visual backbone to
|
|
||||||
obtain a lower-resolution feature map, whose shape is equal to the :obj:`image_feature_pool_shape` attribute of
|
|
||||||
:class:`~transformer.LayoutLMv2Config`. This feature map is then flattened to obtain a sequence of image tokens. As
|
|
||||||
the size of the feature map is 7x7 by default, one obtains 49 image tokens. These are then concatenated with the text
|
|
||||||
tokens, and send through the Transformer encoder. This means that the last hidden states of the model will have a
|
|
||||||
length of 512 + 49 = 561, if you pad the text tokens up to the max length. More generally, the last hidden states
|
|
||||||
will have a shape of :obj:`seq_length` + :obj:`image_feature_pool_shape[0]` *
|
|
||||||
:obj:`config.image_feature_pool_shape[1]`.
|
|
||||||
- When calling :meth:`~transformer.LayoutLMv2Model.from_pretrained`, a warning will be printed with a long list of
|
|
||||||
parameter names that are not initialized. This is not a problem, as these parameters are batch normalization
|
|
||||||
statistics, which are going to have values when fine-tuning on a custom dataset.
|
|
||||||
- If you want to train the model in a distributed environment, make sure to call :meth:`synchronize_batch_norm` on the
|
|
||||||
model in order to properly synchronize the batch normalization layers of the visual backbone.
|
|
||||||
|
|
||||||
In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on
|
|
||||||
:doc:`LayoutXLM's documentation page <layoutxlm>`.
|
|
||||||
|
|
||||||
Usage: LayoutLMv2Processor
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The easiest way to prepare data for the model is to use :class:`~transformer.LayoutLMv2Processor`, which internally
|
|
||||||
combines a feature extractor (:class:`~transformer.LayoutLMv2FeatureExtractor`) and a tokenizer
|
|
||||||
(:class:`~transformer.LayoutLMv2Tokenizer` or :class:`~transformer.LayoutLMv2TokenizerFast`). The feature extractor
|
|
||||||
handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal
|
|
||||||
for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one
|
|
||||||
modality.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor
|
|
||||||
|
|
||||||
feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
|
|
||||||
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
|
||||||
processor = LayoutLMv2Processor(feature_extractor, tokenizer)
|
|
||||||
|
|
||||||
In short, one can provide a document image (and possibly additional data) to :class:`~transformer.LayoutLMv2Processor`,
|
|
||||||
and it will create the inputs expected by the model. Internally, the processor first uses
|
|
||||||
:class:`~transformer.LayoutLMv2FeatureExtractor` to apply OCR on the image to get a list of words and normalized
|
|
||||||
bounding boxes, as well to resize the image to a given size in order to get the :obj:`image` input. The words and
|
|
||||||
normalized bounding boxes are then provided to :class:`~transformer.LayoutLMv2Tokenizer` or
|
|
||||||
:class:`~transformer.LayoutLMv2TokenizerFast`, which converts them to token-level :obj:`input_ids`,
|
|
||||||
:obj:`attention_mask`, :obj:`token_type_ids`, :obj:`bbox`. Optionally, one can provide word labels to the processor,
|
|
||||||
which are turned into token-level :obj:`labels`.
|
|
||||||
|
|
||||||
:class:`~transformer.LayoutLMv2Processor` uses `PyTesseract <https://pypi.org/project/pytesseract/>`__, a Python
|
|
||||||
wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of
|
|
||||||
choice, and provide the words and normalized boxes yourself. This requires initializing
|
|
||||||
:class:`~transformer.LayoutLMv2FeatureExtractor` with :obj:`apply_ocr` set to :obj:`False`.
|
|
||||||
|
|
||||||
In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these
|
|
||||||
use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).
|
|
||||||
|
|
||||||
**Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr =
|
|
||||||
True**
|
|
||||||
|
|
||||||
This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get
|
|
||||||
the words and normalized bounding boxes.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutLMv2Processor
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
|
||||||
|
|
||||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
|
||||||
encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
|
|
||||||
print(encoding.keys())
|
|
||||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
|
||||||
|
|
||||||
**Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False**
|
|
||||||
|
|
||||||
In case one wants to do OCR themselves, one can initialize the feature extractor with :obj:`apply_ocr` set to
|
|
||||||
:obj:`False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to
|
|
||||||
the processor.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutLMv2Processor
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
|
||||||
|
|
||||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
|
||||||
words = ["hello", "world"]
|
|
||||||
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
|
||||||
encoding = processor(image, words, boxes=boxes, return_tensors="pt")
|
|
||||||
print(encoding.keys())
|
|
||||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
|
||||||
|
|
||||||
**Use case 3: token classification (training), apply_ocr=False**
|
|
||||||
|
|
||||||
For token classification tasks (such as FUNSD, CORD, SROIE, Kleister-NDA), one can also provide the corresponding word
|
|
||||||
labels in order to train a model. The processor will then convert these into token-level :obj:`labels`. By default, it
|
|
||||||
will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the
|
|
||||||
:obj:`ignore_index` of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can
|
|
||||||
initialize the tokenizer with :obj:`only_label_first_subword` set to :obj:`False`.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutLMv2Processor
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
|
||||||
|
|
||||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
|
||||||
words = ["hello", "world"]
|
|
||||||
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
|
||||||
word_labels = [1, 2]
|
|
||||||
encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
|
|
||||||
print(encoding.keys())
|
|
||||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'labels', 'image'])
|
|
||||||
|
|
||||||
**Use case 4: visual question answering (inference), apply_ocr=True**
|
|
||||||
|
|
||||||
For visual question answering tasks (such as DocVQA), you can provide a question to the processor. By default, the
|
|
||||||
processor will apply OCR on the image, and create [CLS] question tokens [SEP] word tokens [SEP].
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutLMv2Processor
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")
|
|
||||||
|
|
||||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
|
||||||
question = "What's his name?"
|
|
||||||
encoding = processor(image, question, return_tensors="pt")
|
|
||||||
print(encoding.keys())
|
|
||||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
|
||||||
|
|
||||||
**Use case 5: visual question answering (inference), apply_ocr=False**
|
|
||||||
|
|
||||||
For visual question answering tasks (such as DocVQA), you can provide a question to the processor. If you want to
|
|
||||||
perform OCR yourself, you can provide your own words and (normalized) bounding boxes to the processor.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutLMv2Processor
|
|
||||||
from PIL import Image
|
|
||||||
|
|
||||||
processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
|
|
||||||
|
|
||||||
image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
|
|
||||||
question = "What's his name?"
|
|
||||||
words = ["hello", "world"]
|
|
||||||
boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
|
|
||||||
encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
|
|
||||||
print(encoding.keys())
|
|
||||||
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
|
|
||||||
|
|
||||||
LayoutLMv2Config
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2Config
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2FeatureExtractor
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2FeatureExtractor
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2Tokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2Tokenizer
|
|
||||||
:members: __call__, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2TokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2TokenizerFast
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2Processor
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2Processor
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2Model
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2Model
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2ForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2ForSequenceClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2ForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2ForTokenClassification
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LayoutLMv2ForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutLMv2ForQuestionAnswering
|
|
||||||
:members:
|
|
||||||
77
docs/source/model_doc/layoutxlm.mdx
Normal file
77
docs/source/model_doc/layoutxlm.mdx
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# LayoutXLM
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
LayoutXLM was proposed in [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
|
||||||
|
Zhang, Furu Wei. It's a multilingual extension of the [LayoutLMv2 model](https://arxiv.org/abs/2012.14740) trained
|
||||||
|
on 53 languages.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document
|
||||||
|
understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In
|
||||||
|
this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to
|
||||||
|
bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also
|
||||||
|
introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in
|
||||||
|
7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled
|
||||||
|
for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
|
||||||
|
cross-lingual pre-trained models on the XFUN dataset.*
|
||||||
|
|
||||||
|
One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutLMv2Model
|
||||||
|
|
||||||
|
model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base')
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that LayoutXLM has its own tokenizer, based on
|
||||||
|
[`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`]. You can initialize it as
|
||||||
|
follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import LayoutXLMTokenizer
|
||||||
|
|
||||||
|
tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base')
|
||||||
|
```
|
||||||
|
|
||||||
|
Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies
|
||||||
|
[`LayoutLMv2FeatureExtractor`] and
|
||||||
|
[`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all
|
||||||
|
data for the model.
|
||||||
|
|
||||||
|
As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks.
|
||||||
|
|
||||||
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm).
|
||||||
|
|
||||||
|
|
||||||
|
## LayoutXLMTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] LayoutXLMTokenizer
|
||||||
|
- __call__
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## LayoutXLMTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] LayoutXLMTokenizerFast
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## LayoutXLMProcessor
|
||||||
|
|
||||||
|
[[autodoc]] LayoutXLMProcessor
|
||||||
|
- __call__
|
||||||
@@ -1,84 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
LayoutXLM
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
LayoutXLM was proposed in `LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
|
|
||||||
<https://arxiv.org/abs/2104.08836>`__ by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
|
|
||||||
Zhang, Furu Wei. It's a multilingual extension of the `LayoutLMv2 model <https://arxiv.org/abs/2012.14740>`__ trained
|
|
||||||
on 53 languages.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document
|
|
||||||
understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In
|
|
||||||
this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to
|
|
||||||
bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also
|
|
||||||
introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in
|
|
||||||
7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled
|
|
||||||
for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA
|
|
||||||
cross-lingual pre-trained models on the XFUN dataset.*
|
|
||||||
|
|
||||||
One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutLMv2Model
|
|
||||||
|
|
||||||
model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base')
|
|
||||||
|
|
||||||
Note that LayoutXLM has its own tokenizer, based on
|
|
||||||
:class:`~transformers.LayoutXLMTokenizer`/:class:`~transformers.LayoutXLMTokenizerFast`. You can initialize it as
|
|
||||||
follows:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import LayoutXLMTokenizer
|
|
||||||
|
|
||||||
tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base')
|
|
||||||
|
|
||||||
Similar to LayoutLMv2, you can use :class:`~transformers.LayoutXLMProcessor` (which internally applies
|
|
||||||
:class:`~transformers.LayoutLMv2FeatureExtractor` and
|
|
||||||
:class:`~transformers.LayoutXLMTokenizer`/:class:`~transformers.LayoutXLMTokenizerFast` in sequence) to prepare all
|
|
||||||
data for the model.
|
|
||||||
|
|
||||||
As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to :doc:`LayoutLMv2's documentation page
|
|
||||||
<layoutlmv2>` for all tips, code examples and notebooks.
|
|
||||||
|
|
||||||
This model was contributed by `nielsr <https://huggingface.co/nielsr>`__. The original code can be found `here
|
|
||||||
<https://github.com/microsoft/unilm>`__.
|
|
||||||
|
|
||||||
|
|
||||||
LayoutXLMTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutXLMTokenizer
|
|
||||||
:members: __call__, build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
LayoutXLMTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutXLMTokenizerFast
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
LayoutXLMProcessor
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LayoutXLMProcessor
|
|
||||||
:members: __call__
|
|
||||||
117
docs/source/model_doc/led.mdx
Normal file
117
docs/source/model_doc/led.mdx
Normal file
@@ -0,0 +1,117 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# LED
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The LED model was proposed in [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz
|
||||||
|
Beltagy, Matthew E. Peters, Arman Cohan.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
|
||||||
|
quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
|
||||||
|
mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
|
||||||
|
longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
|
||||||
|
windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
|
||||||
|
evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
|
||||||
|
contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
|
||||||
|
pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
|
||||||
|
WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting
|
||||||
|
long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
|
||||||
|
dataset.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- [`LEDForConditionalGeneration`] is an extension of
|
||||||
|
[`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with
|
||||||
|
*Longformer*'s *chunked self-attention* layer. [`LEDTokenizer`] is an alias of
|
||||||
|
[`BartTokenizer`].
|
||||||
|
- LED works very well on long-range *sequence-to-sequence* tasks where the `input_ids` largely exceed a length of
|
||||||
|
1024 tokens.
|
||||||
|
- LED pads the `input_ids` to be a multiple of `config.attention_window` if required. Therefore a small speed-up is
|
||||||
|
gained, when [`LEDTokenizer`] is used with the `pad_to_multiple_of` argument.
|
||||||
|
- LED makes use of *global attention* by means of the `global_attention_mask` (see
|
||||||
|
[`LongformerModel`]). For summarization, it is advised to put *global attention* only on the first
|
||||||
|
`<s>` token. For question answering, it is advised to put *global attention* on all tokens of the question.
|
||||||
|
- To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by executing
|
||||||
|
`model.gradient_checkpointing_enable()`.
|
||||||
|
- A notebook showing how to evaluate LED, can be accessed [here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing).
|
||||||
|
- A notebook showing how to fine-tune LED, can be accessed [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing).
|
||||||
|
|
||||||
|
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten).
|
||||||
|
|
||||||
|
|
||||||
|
## LEDConfig
|
||||||
|
|
||||||
|
[[autodoc]] LEDConfig
|
||||||
|
|
||||||
|
## LEDTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] LEDTokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## LEDTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] LEDTokenizerFast
|
||||||
|
|
||||||
|
## LED specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_led.LEDEncoderBaseModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_led.LEDSeq2SeqModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_led.LEDSeq2SeqLMOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_led.LEDSeq2SeqSequenceClassifierOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_led.LEDSeq2SeqQuestionAnsweringModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_tf_led.TFLEDEncoderBaseModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
|
||||||
|
|
||||||
|
## LEDModel
|
||||||
|
|
||||||
|
[[autodoc]] LEDModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LEDForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] LEDForConditionalGeneration
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LEDForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] LEDForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LEDForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] LEDForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFLEDModel
|
||||||
|
|
||||||
|
[[autodoc]] TFLEDModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFLEDForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] TFLEDForConditionalGeneration
|
||||||
|
- call
|
||||||
@@ -1,150 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
LED
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The LED model was proposed in `Longformer: The Long-Document Transformer <https://arxiv.org/abs/2004.05150>`__ by Iz
|
|
||||||
Beltagy, Matthew E. Peters, Arman Cohan.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
|
|
||||||
quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
|
|
||||||
mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
|
|
||||||
longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
|
|
||||||
windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
|
|
||||||
evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
|
|
||||||
contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
|
|
||||||
pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
|
|
||||||
WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting
|
|
||||||
long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization
|
|
||||||
dataset.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- :class:`~transformers.LEDForConditionalGeneration` is an extension of
|
|
||||||
:class:`~transformers.BartForConditionalGeneration` exchanging the traditional *self-attention* layer with
|
|
||||||
*Longformer*'s *chunked self-attention* layer. :class:`~transformers.LEDTokenizer` is an alias of
|
|
||||||
:class:`~transformers.BartTokenizer`.
|
|
||||||
- LED works very well on long-range *sequence-to-sequence* tasks where the ``input_ids`` largely exceed a length of
|
|
||||||
1024 tokens.
|
|
||||||
- LED pads the ``input_ids`` to be a multiple of ``config.attention_window`` if required. Therefore a small speed-up is
|
|
||||||
gained, when :class:`~transformers.LEDTokenizer` is used with the ``pad_to_multiple_of`` argument.
|
|
||||||
- LED makes use of *global attention* by means of the ``global_attention_mask`` (see
|
|
||||||
:class:`~transformers.LongformerModel`). For summarization, it is advised to put *global attention* only on the first
|
|
||||||
``<s>`` token. For question answering, it is advised to put *global attention* on all tokens of the question.
|
|
||||||
- To fine-tune LED on all 16384, it is necessary to enable *gradient checkpointing* by executing
|
|
||||||
``model.gradient_checkpointing_enable()``.
|
|
||||||
- A notebook showing how to evaluate LED, can be accessed `here
|
|
||||||
<https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing>`__.
|
|
||||||
- A notebook showing how to fine-tune LED, can be accessed `here
|
|
||||||
<https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing>`__.
|
|
||||||
|
|
||||||
This model was contributed by `patrickvonplaten <https://huggingface.co/patrickvonplaten>`__.
|
|
||||||
|
|
||||||
|
|
||||||
LEDConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LEDConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LEDTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LEDTokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
LEDTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LEDTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LED specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_led.LEDEncoderBaseModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqLMOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqSequenceClassifierOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_led.LEDSeq2SeqQuestionAnsweringModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDEncoderBaseModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
LEDModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LEDModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LEDForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LEDForConditionalGeneration
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LEDForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LEDForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LEDForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LEDForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFLEDModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLEDModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFLEDForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLEDForConditionalGeneration
|
|
||||||
:members: call
|
|
||||||
184
docs/source/model_doc/longformer.mdx
Normal file
184
docs/source/model_doc/longformer.mdx
Normal file
@@ -0,0 +1,184 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Longformer
|
||||||
|
|
||||||
|
**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Longformer model was presented in [Longformer: The Long-Document Transformer](https://arxiv.org/pdf/2004.05150.pdf) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
|
||||||
|
quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
|
||||||
|
mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
|
||||||
|
longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
|
||||||
|
windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
|
||||||
|
evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
|
||||||
|
contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
|
||||||
|
pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
|
||||||
|
WikiHop and TriviaQA.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- Since the Longformer is based on RoBERTa, it doesn't have `token_type_ids`. You don't need to indicate which
|
||||||
|
token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or
|
||||||
|
`</s>`).
|
||||||
|
|
||||||
|
This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer).
|
||||||
|
|
||||||
|
## Longformer Self Attention
|
||||||
|
|
||||||
|
Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
|
||||||
|
attend "locally" to each other meaning that each token attends to its \\(\frac{1}{2} w\\) previous tokens and
|
||||||
|
\\(\frac{1}{2} w\\) succeding tokens with \\(w\\) being the window length as defined in
|
||||||
|
`config.attention_window`. Note that `config.attention_window` can be of type `List` to define a
|
||||||
|
different \\(w\\) for each layer. A selected few tokens attend "globally" to all other tokens, as it is
|
||||||
|
conventionally done for all tokens in `BertSelfAttention`.
|
||||||
|
|
||||||
|
Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
|
||||||
|
that every "locally" attending token not only attends to tokens within its window \\(w\\), but also to all "globally"
|
||||||
|
attending tokens so that global attention is *symmetric*.
|
||||||
|
|
||||||
|
The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
|
||||||
|
`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
|
||||||
|
`global_attention_mask`:
|
||||||
|
|
||||||
|
- 0: the token attends "locally",
|
||||||
|
- 1: the token attends "globally".
|
||||||
|
|
||||||
|
For more information please also refer to [`~LongformerModel.forward`] method.
|
||||||
|
|
||||||
|
Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually
|
||||||
|
represents the memory and time bottleneck, can be reduced from \\(\mathcal{O}(n_s \times n_s)\\) to
|
||||||
|
\\(\mathcal{O}(n_s \times w)\\), with \\(n_s\\) being the sequence length and \\(w\\) being the average window
|
||||||
|
size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of
|
||||||
|
"locally" attending tokens.
|
||||||
|
|
||||||
|
For more information, please refer to the official [paper](https://arxiv.org/pdf/2004.05150.pdf).
|
||||||
|
|
||||||
|
|
||||||
|
## Training
|
||||||
|
|
||||||
|
[`LongformerForMaskedLM`] is trained the exact same way [`RobertaForMaskedLM`] is
|
||||||
|
trained and should be used as follows:
|
||||||
|
|
||||||
|
```python
|
||||||
|
input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
|
||||||
|
mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
|
||||||
|
|
||||||
|
loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
## LongformerConfig
|
||||||
|
|
||||||
|
[[autodoc]] LongformerConfig
|
||||||
|
|
||||||
|
## LongformerTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] LongformerTokenizer
|
||||||
|
|
||||||
|
## LongformerTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] LongformerTokenizerFast
|
||||||
|
|
||||||
|
## Longformer specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_longformer.LongformerBaseModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_longformer.LongformerBaseModelOutputWithPooling
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_longformer.LongformerMaskedLMOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_longformer.LongformerQuestionAnsweringModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_longformer.LongformerSequenceClassifierOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_longformer.LongformerMultipleChoiceModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_longformer.LongformerTokenClassifierOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerMaskedLMOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerSequenceClassifierOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerMultipleChoiceModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
|
||||||
|
|
||||||
|
## LongformerModel
|
||||||
|
|
||||||
|
[[autodoc]] LongformerModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LongformerForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] LongformerForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LongformerForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] LongformerForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LongformerForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] LongformerForMultipleChoice
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LongformerForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] LongformerForTokenClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LongformerForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] LongformerForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFLongformerModel
|
||||||
|
|
||||||
|
[[autodoc]] TFLongformerModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFLongformerForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] TFLongformerForMaskedLM
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFLongformerForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] TFLongformerForQuestionAnswering
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFLongformerForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFLongformerForSequenceClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFLongformerForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] TFLongformerForTokenClassification
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFLongformerForMultipleChoice
|
||||||
|
|
||||||
|
[[autodoc]] TFLongformerForMultipleChoice
|
||||||
|
- call
|
||||||
@@ -1,239 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
Longformer
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a `Github Issue
|
|
||||||
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__.
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The Longformer model was presented in `Longformer: The Long-Document Transformer
|
|
||||||
<https://arxiv.org/pdf/2004.05150.pdf>`__ by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Transformer-based models are unable to process long sequences due to their self-attention operation, which scales
|
|
||||||
quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention
|
|
||||||
mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or
|
|
||||||
longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local
|
|
||||||
windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we
|
|
||||||
evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In
|
|
||||||
contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our
|
|
||||||
pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on
|
|
||||||
WikiHop and TriviaQA.*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- Since the Longformer is based on RoBERTa, it doesn't have :obj:`token_type_ids`. You don't need to indicate which
|
|
||||||
token belongs to which segment. Just separate your segments with the separation token :obj:`tokenizer.sep_token` (or
|
|
||||||
:obj:`</s>`).
|
|
||||||
|
|
||||||
This model was contributed by `beltagy <https://huggingface.co/beltagy>`__. The Authors' code can be found `here
|
|
||||||
<https://github.com/allenai/longformer>`__.
|
|
||||||
|
|
||||||
Longformer Self Attention
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only
|
|
||||||
attend "locally" to each other meaning that each token attends to its :math:`\frac{1}{2} w` previous tokens and
|
|
||||||
:math:`\frac{1}{2} w` succeding tokens with :math:`w` being the window length as defined in
|
|
||||||
:obj:`config.attention_window`. Note that :obj:`config.attention_window` can be of type :obj:`List` to define a
|
|
||||||
different :math:`w` for each layer. A selected few tokens attend "globally" to all other tokens, as it is
|
|
||||||
conventionally done for all tokens in :obj:`BertSelfAttention`.
|
|
||||||
|
|
||||||
Note that "locally" and "globally" attending tokens are projected by different query, key and value matrices. Also note
|
|
||||||
that every "locally" attending token not only attends to tokens within its window :math:`w`, but also to all "globally"
|
|
||||||
attending tokens so that global attention is *symmetric*.
|
|
||||||
|
|
||||||
The user can define which tokens attend "locally" and which tokens attend "globally" by setting the tensor
|
|
||||||
:obj:`global_attention_mask` at run-time appropriately. All Longformer models employ the following logic for
|
|
||||||
:obj:`global_attention_mask`:
|
|
||||||
|
|
||||||
- 0: the token attends "locally",
|
|
||||||
- 1: the token attends "globally".
|
|
||||||
|
|
||||||
For more information please also refer to :meth:`~transformers.LongformerModel.forward` method.
|
|
||||||
|
|
||||||
Using Longformer self attention, the memory and time complexity of the query-key matmul operation, which usually
|
|
||||||
represents the memory and time bottleneck, can be reduced from :math:`\mathcal{O}(n_s \times n_s)` to
|
|
||||||
:math:`\mathcal{O}(n_s \times w)`, with :math:`n_s` being the sequence length and :math:`w` being the average window
|
|
||||||
size. It is assumed that the number of "globally" attending tokens is insignificant as compared to the number of
|
|
||||||
"locally" attending tokens.
|
|
||||||
|
|
||||||
For more information, please refer to the official `paper <https://arxiv.org/pdf/2004.05150.pdf>`__.
|
|
||||||
|
|
||||||
|
|
||||||
Training
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
:class:`~transformers.LongformerForMaskedLM` is trained the exact same way :class:`~transformers.RobertaForMaskedLM` is
|
|
||||||
trained and should be used as follows:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
|
|
||||||
mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
|
|
||||||
|
|
||||||
loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
|
|
||||||
|
|
||||||
|
|
||||||
LongformerConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LongformerTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LongformerTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
Longformer specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerBaseModelOutputWithPooling
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMaskedLMOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerQuestionAnsweringModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerSequenceClassifierOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerMultipleChoiceModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_longformer.LongformerTokenClassifierOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerBaseModelOutputWithPooling
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMaskedLMOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerQuestionAnsweringModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerSequenceClassifierOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerMultipleChoiceModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
LongformerModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LongformerForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LongformerForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerForSequenceClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LongformerForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerForMultipleChoice
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LongformerForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerForTokenClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LongformerForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LongformerForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFLongformerModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLongformerModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFLongformerForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLongformerForMaskedLM
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFLongformerForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLongformerForQuestionAnswering
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFLongformerForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLongformerForSequenceClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFLongformerForTokenClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLongformerForTokenClassification
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFLongformerForMultipleChoice
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLongformerForMultipleChoice
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
151
docs/source/model_doc/luke.mdx
Normal file
151
docs/source/model_doc/luke.mdx
Normal file
@@ -0,0 +1,151 @@
|
|||||||
|
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# LUKE
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The LUKE model was proposed in [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.
|
||||||
|
It is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps
|
||||||
|
improve performance on various downstream tasks involving reasoning about entities such as named entity recognition,
|
||||||
|
extractive and cloze-style question answering, entity typing, and relation classification.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Entity representations are useful in natural language tasks involving entities. In this paper, we propose new
|
||||||
|
pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed
|
||||||
|
model treats words and entities in a given text as independent tokens, and outputs contextualized representations of
|
||||||
|
them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves
|
||||||
|
predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also
|
||||||
|
propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the
|
||||||
|
transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model
|
||||||
|
achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains
|
||||||
|
state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification),
|
||||||
|
CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question
|
||||||
|
answering).*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- This implementation is the same as [`RobertaModel`] with the addition of entity embeddings as well
|
||||||
|
as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
|
||||||
|
- LUKE treats entities as input tokens; therefore, it takes `entity_ids`, `entity_attention_mask`,
|
||||||
|
`entity_token_type_ids` and `entity_position_ids` as extra input. You can obtain those using
|
||||||
|
[`LukeTokenizer`].
|
||||||
|
- [`LukeTokenizer`] takes `entities` and `entity_spans` (character-based start and end
|
||||||
|
positions of the entities in the input text) as extra input. `entities` typically consist of [MASK] entities or
|
||||||
|
Wikipedia entities. The brief description when inputting these entities are as follows:
|
||||||
|
|
||||||
|
- *Inputting [MASK] entities to compute entity representations*: The [MASK] entity is used to mask entities to be
|
||||||
|
predicted during pretraining. When LUKE receives the [MASK] entity, it tries to predict the original entity by
|
||||||
|
gathering the information about the entity from the input text. Therefore, the [MASK] entity can be used to address
|
||||||
|
downstream tasks requiring the information of entities in text such as entity typing, relation classification, and
|
||||||
|
named entity recognition.
|
||||||
|
- *Inputting Wikipedia entities to compute knowledge-enhanced token representations*: LUKE learns rich information
|
||||||
|
(or knowledge) about Wikipedia entities during pretraining and stores the information in its entity embedding. By
|
||||||
|
using Wikipedia entities as input tokens, LUKE outputs token representations enriched by the information stored in
|
||||||
|
the embeddings of these entities. This is particularly effective for tasks requiring real-world knowledge, such as
|
||||||
|
question answering.
|
||||||
|
|
||||||
|
- There are three head models for the former use case:
|
||||||
|
|
||||||
|
- [`LukeForEntityClassification`], for tasks to classify a single entity in an input text such as
|
||||||
|
entity typing, e.g. the [Open Entity dataset](https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html).
|
||||||
|
This model places a linear head on top of the output entity representation.
|
||||||
|
- [`LukeForEntityPairClassification`], for tasks to classify the relationship between two entities
|
||||||
|
such as relation classification, e.g. the [TACRED dataset](https://nlp.stanford.edu/projects/tacred/). This
|
||||||
|
model places a linear head on top of the concatenated output representation of the pair of given entities.
|
||||||
|
- [`LukeForEntitySpanClassification`], for tasks to classify the sequence of entity spans, such as
|
||||||
|
named entity recognition (NER). This model places a linear head on top of the output entity representations. You
|
||||||
|
can address NER using this model by inputting all possible entity spans in the text to the model.
|
||||||
|
|
||||||
|
[`LukeTokenizer`] has a `task` argument, which enables you to easily create an input to these
|
||||||
|
head models by specifying `task="entity_classification"`, `task="entity_pair_classification"`, or
|
||||||
|
`task="entity_span_classification"`. Please refer to the example code of each head models.
|
||||||
|
|
||||||
|
A demo notebook on how to fine-tune [`LukeForEntityPairClassification`] for relation
|
||||||
|
classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE).
|
||||||
|
|
||||||
|
There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
|
||||||
|
the HuggingFace implementation of LUKE. They can be found [here](https://github.com/studio-ousia/luke/tree/master/notebooks).
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
|
||||||
|
|
||||||
|
>>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
|
||||||
|
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
|
||||||
|
|
||||||
|
# Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
|
||||||
|
>>> text = "Beyoncé lives in Los Angeles."
|
||||||
|
>>> entity_spans = [(0, 7)] # character-based entity span corresponding to "Beyoncé"
|
||||||
|
>>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
|
||||||
|
>>> outputs = model(**inputs)
|
||||||
|
>>> word_last_hidden_state = outputs.last_hidden_state
|
||||||
|
>>> entity_last_hidden_state = outputs.entity_last_hidden_state
|
||||||
|
|
||||||
|
# Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
|
||||||
|
>>> entities = ["Beyoncé", "Los Angeles"] # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
|
||||||
|
>>> entity_spans = [(0, 7), (17, 28)] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
|
||||||
|
>>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
|
||||||
|
>>> outputs = model(**inputs)
|
||||||
|
>>> word_last_hidden_state = outputs.last_hidden_state
|
||||||
|
>>> entity_last_hidden_state = outputs.entity_last_hidden_state
|
||||||
|
|
||||||
|
# Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
|
||||||
|
>>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
|
||||||
|
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
|
||||||
|
>>> entity_spans = [(0, 7), (17, 28)] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
|
||||||
|
>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
|
||||||
|
>>> outputs = model(**inputs)
|
||||||
|
>>> logits = outputs.logits
|
||||||
|
>>> predicted_class_idx = int(logits[0].argmax())
|
||||||
|
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
|
||||||
|
```
|
||||||
|
|
||||||
|
This model was contributed by [ikuyamada](https://huggingface.co/ikuyamada) and [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/studio-ousia/luke).
|
||||||
|
|
||||||
|
|
||||||
|
## LukeConfig
|
||||||
|
|
||||||
|
[[autodoc]] LukeConfig
|
||||||
|
|
||||||
|
## LukeTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] LukeTokenizer
|
||||||
|
- __call__
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## LukeModel
|
||||||
|
|
||||||
|
[[autodoc]] LukeModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LukeForMaskedLM
|
||||||
|
|
||||||
|
[[autodoc]] LukeForMaskedLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LukeForEntityClassification
|
||||||
|
|
||||||
|
[[autodoc]] LukeForEntityClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LukeForEntityPairClassification
|
||||||
|
|
||||||
|
[[autodoc]] LukeForEntityPairClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LukeForEntitySpanClassification
|
||||||
|
|
||||||
|
[[autodoc]] LukeForEntitySpanClassification
|
||||||
|
- forward
|
||||||
@@ -1,168 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
LUKE
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The LUKE model was proposed in `LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
|
|
||||||
<https://arxiv.org/abs/2010.01057>`_ by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda and Yuji Matsumoto.
|
|
||||||
It is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps
|
|
||||||
improve performance on various downstream tasks involving reasoning about entities such as named entity recognition,
|
|
||||||
extractive and cloze-style question answering, entity typing, and relation classification.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Entity representations are useful in natural language tasks involving entities. In this paper, we propose new
|
|
||||||
pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed
|
|
||||||
model treats words and entities in a given text as independent tokens, and outputs contextualized representations of
|
|
||||||
them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves
|
|
||||||
predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also
|
|
||||||
propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the
|
|
||||||
transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model
|
|
||||||
achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains
|
|
||||||
state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification),
|
|
||||||
CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question
|
|
||||||
answering).*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- This implementation is the same as :class:`~transformers.RobertaModel` with the addition of entity embeddings as well
|
|
||||||
as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities.
|
|
||||||
- LUKE treats entities as input tokens; therefore, it takes :obj:`entity_ids`, :obj:`entity_attention_mask`,
|
|
||||||
:obj:`entity_token_type_ids` and :obj:`entity_position_ids` as extra input. You can obtain those using
|
|
||||||
:class:`~transformers.LukeTokenizer`.
|
|
||||||
- :class:`~transformers.LukeTokenizer` takes :obj:`entities` and :obj:`entity_spans` (character-based start and end
|
|
||||||
positions of the entities in the input text) as extra input. :obj:`entities` typically consist of [MASK] entities or
|
|
||||||
Wikipedia entities. The brief description when inputting these entities are as follows:
|
|
||||||
|
|
||||||
- *Inputting [MASK] entities to compute entity representations*: The [MASK] entity is used to mask entities to be
|
|
||||||
predicted during pretraining. When LUKE receives the [MASK] entity, it tries to predict the original entity by
|
|
||||||
gathering the information about the entity from the input text. Therefore, the [MASK] entity can be used to address
|
|
||||||
downstream tasks requiring the information of entities in text such as entity typing, relation classification, and
|
|
||||||
named entity recognition.
|
|
||||||
- *Inputting Wikipedia entities to compute knowledge-enhanced token representations*: LUKE learns rich information
|
|
||||||
(or knowledge) about Wikipedia entities during pretraining and stores the information in its entity embedding. By
|
|
||||||
using Wikipedia entities as input tokens, LUKE outputs token representations enriched by the information stored in
|
|
||||||
the embeddings of these entities. This is particularly effective for tasks requiring real-world knowledge, such as
|
|
||||||
question answering.
|
|
||||||
|
|
||||||
- There are three head models for the former use case:
|
|
||||||
|
|
||||||
- :class:`~transformers.LukeForEntityClassification`, for tasks to classify a single entity in an input text such as
|
|
||||||
entity typing, e.g. the `Open Entity dataset <https://www.cs.utexas.edu/~eunsol/html_pages/open_entity.html>`__.
|
|
||||||
This model places a linear head on top of the output entity representation.
|
|
||||||
- :class:`~transformers.LukeForEntityPairClassification`, for tasks to classify the relationship between two entities
|
|
||||||
such as relation classification, e.g. the `TACRED dataset <https://nlp.stanford.edu/projects/tacred/>`__. This
|
|
||||||
model places a linear head on top of the concatenated output representation of the pair of given entities.
|
|
||||||
- :class:`~transformers.LukeForEntitySpanClassification`, for tasks to classify the sequence of entity spans, such as
|
|
||||||
named entity recognition (NER). This model places a linear head on top of the output entity representations. You
|
|
||||||
can address NER using this model by inputting all possible entity spans in the text to the model.
|
|
||||||
|
|
||||||
:class:`~transformers.LukeTokenizer` has a ``task`` argument, which enables you to easily create an input to these
|
|
||||||
head models by specifying ``task="entity_classification"``, ``task="entity_pair_classification"``, or
|
|
||||||
``task="entity_span_classification"``. Please refer to the example code of each head models.
|
|
||||||
|
|
||||||
A demo notebook on how to fine-tune :class:`~transformers.LukeForEntityPairClassification` for relation
|
|
||||||
classification can be found `here <https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE>`__.
|
|
||||||
|
|
||||||
There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with
|
|
||||||
the HuggingFace implementation of LUKE. They can be found `here
|
|
||||||
<https://github.com/studio-ousia/luke/tree/master/notebooks>`__.
|
|
||||||
|
|
||||||
Example:
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification
|
|
||||||
|
|
||||||
>>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
|
|
||||||
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
|
|
||||||
|
|
||||||
# Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
|
|
||||||
>>> text = "Beyoncé lives in Los Angeles."
|
|
||||||
>>> entity_spans = [(0, 7)] # character-based entity span corresponding to "Beyoncé"
|
|
||||||
>>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
|
|
||||||
>>> outputs = model(**inputs)
|
|
||||||
>>> word_last_hidden_state = outputs.last_hidden_state
|
|
||||||
>>> entity_last_hidden_state = outputs.entity_last_hidden_state
|
|
||||||
|
|
||||||
# Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
|
|
||||||
>>> entities = ["Beyoncé", "Los Angeles"] # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
|
|
||||||
>>> entity_spans = [(0, 7), (17, 28)] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
|
|
||||||
>>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
|
|
||||||
>>> outputs = model(**inputs)
|
|
||||||
>>> word_last_hidden_state = outputs.last_hidden_state
|
|
||||||
>>> entity_last_hidden_state = outputs.entity_last_hidden_state
|
|
||||||
|
|
||||||
# Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
|
|
||||||
>>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
|
|
||||||
>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
|
|
||||||
>>> entity_spans = [(0, 7), (17, 28)] # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
|
|
||||||
>>> inputs = tokenizer(text, entity_spans=entity_spans, return_tensors="pt")
|
|
||||||
>>> outputs = model(**inputs)
|
|
||||||
>>> logits = outputs.logits
|
|
||||||
>>> predicted_class_idx = int(logits[0].argmax())
|
|
||||||
>>> print("Predicted class:", model.config.id2label[predicted_class_idx])
|
|
||||||
|
|
||||||
This model was contributed by `ikuyamada <https://huggingface.co/ikuyamada>`__ and `nielsr
|
|
||||||
<https://huggingface.co/nielsr>`__. The original code can be found `here <https://github.com/studio-ousia/luke>`__.
|
|
||||||
|
|
||||||
|
|
||||||
LukeConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LukeConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LukeTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LukeTokenizer
|
|
||||||
:members: __call__, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
LukeModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LukeModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
LukeForMaskedLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LukeForMaskedLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LukeForEntityClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LukeForEntityClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LukeForEntityPairClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LukeForEntityPairClassification
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
LukeForEntitySpanClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LukeForEntitySpanClassification
|
|
||||||
:members: forward
|
|
||||||
102
docs/source/model_doc/lxmert.mdx
Normal file
102
docs/source/model_doc/lxmert.mdx
Normal file
@@ -0,0 +1,102 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# LXMERT
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The LXMERT model was proposed in [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
|
||||||
|
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
|
||||||
|
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
|
||||||
|
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
|
||||||
|
consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly,
|
||||||
|
the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality
|
||||||
|
Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we
|
||||||
|
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
|
||||||
|
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
|
||||||
|
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
|
||||||
|
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
||||||
|
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
|
||||||
|
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
|
||||||
|
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
|
||||||
|
pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR, and improve the previous
|
||||||
|
best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel
|
||||||
|
model components and pretraining strategies significantly contribute to our strong results; and also present several
|
||||||
|
attention visualizations for the different encoders*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
|
||||||
|
will work.
|
||||||
|
- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the
|
||||||
|
cross-modality layer, so they contain information from both modalities. To access a modality that only attends to
|
||||||
|
itself, select the vision/language hidden states from the first input in the tuple.
|
||||||
|
- The bidirectional cross-modality encoder attention only returns attention values when the language modality is used
|
||||||
|
as the input and the vision modality is used as the context vector. Further, while the cross-modality encoder
|
||||||
|
contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
|
||||||
|
both self attention outputs are disregarded.
|
||||||
|
|
||||||
|
This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). The original code can be found [here](https://github.com/airsplay/lxmert).
|
||||||
|
|
||||||
|
|
||||||
|
## LxmertConfig
|
||||||
|
|
||||||
|
[[autodoc]] LxmertConfig
|
||||||
|
|
||||||
|
## LxmertTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] LxmertTokenizer
|
||||||
|
|
||||||
|
## LxmertTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] LxmertTokenizerFast
|
||||||
|
|
||||||
|
## Lxmert specific outputs
|
||||||
|
|
||||||
|
[[autodoc]] models.lxmert.modeling_lxmert.LxmertModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
|
||||||
|
|
||||||
|
[[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
|
||||||
|
|
||||||
|
## LxmertModel
|
||||||
|
|
||||||
|
[[autodoc]] LxmertModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LxmertForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] LxmertForPreTraining
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## LxmertForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] LxmertForQuestionAnswering
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFLxmertModel
|
||||||
|
|
||||||
|
[[autodoc]] TFLxmertModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFLxmertForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] TFLxmertForPreTraining
|
||||||
|
- call
|
||||||
@@ -1,128 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
LXMERT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers
|
|
||||||
<https://arxiv.org/abs/1908.07490>`__ by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders
|
|
||||||
(one for the vision modality, one for the language modality, and then one to fuse both modalities) pretrained using a
|
|
||||||
combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked
|
|
||||||
visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives. The pretraining
|
|
||||||
consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly,
|
|
||||||
the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality
|
|
||||||
Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we
|
|
||||||
build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language
|
|
||||||
encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language
|
|
||||||
semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative
|
|
||||||
pretraining tasks: masked language modeling, masked object prediction (feature regression and label classification),
|
|
||||||
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and
|
|
||||||
cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the state-of-the-art
|
|
||||||
results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our
|
|
||||||
pretrained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR, and improve the previous
|
|
||||||
best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel
|
|
||||||
model components and pretraining strategies significantly contribute to our strong results; and also present several
|
|
||||||
attention visualizations for the different encoders*
|
|
||||||
|
|
||||||
Tips:
|
|
||||||
|
|
||||||
- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features
|
|
||||||
will work.
|
|
||||||
- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the
|
|
||||||
cross-modality layer, so they contain information from both modalities. To access a modality that only attends to
|
|
||||||
itself, select the vision/language hidden states from the first input in the tuple.
|
|
||||||
- The bidirectional cross-modality encoder attention only returns attention values when the language modality is used
|
|
||||||
as the input and the vision modality is used as the context vector. Further, while the cross-modality encoder
|
|
||||||
contains self-attention for each respective modality and cross-attention, only the cross attention is returned and
|
|
||||||
both self attention outputs are disregarded.
|
|
||||||
|
|
||||||
This model was contributed by `eltoto1219 <https://huggingface.co/eltoto1219>`__. The original code can be found `here
|
|
||||||
<https://github.com/airsplay/lxmert>`__.
|
|
||||||
|
|
||||||
|
|
||||||
LxmertConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LxmertConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LxmertTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LxmertTokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LxmertTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LxmertTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
Lxmert specific outputs
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.lxmert.modeling_lxmert.LxmertForQuestionAnsweringOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertModelOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
.. autoclass:: transformers.models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
LxmertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LxmertModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
LxmertForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LxmertForPreTraining
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
LxmertForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.LxmertForQuestionAnswering
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFLxmertModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLxmertModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
TFLxmertForPreTraining
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFLxmertForPreTraining
|
|
||||||
:members: call
|
|
||||||
116
docs/source/model_doc/m2m_100.mdx
Normal file
116
docs/source/model_doc/m2m_100.mdx
Normal file
@@ -0,0 +1,116 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# M2M100
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The M2M100 model was proposed in [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
|
||||||
|
Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy
|
||||||
|
Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Existing work in translation demonstrated the potential of massively multilingual machine translation by training a
|
||||||
|
single model able to translate between any pair of languages. However, much of this work is English-Centric by training
|
||||||
|
only on data which was translated from or to English. While this is supported by large sources of training data, it
|
||||||
|
does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation
|
||||||
|
model that can translate directly between any pair of 100 languages. We build and open source a training dataset that
|
||||||
|
covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how
|
||||||
|
to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters
|
||||||
|
to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly
|
||||||
|
translating between non-English directions while performing competitively to the best single systems of WMT. We
|
||||||
|
open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.*
|
||||||
|
|
||||||
|
This model was contributed by [valhalla](https://huggingface.co/valhalla).
|
||||||
|
|
||||||
|
|
||||||
|
### Training and Generation
|
||||||
|
|
||||||
|
M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
|
||||||
|
multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
|
||||||
|
source and target text. The source text format is `[lang_code] X [eos]`, where `lang_code` is source language
|
||||||
|
id for source text and target language id for target text, with `X` being the source or target text.
|
||||||
|
|
||||||
|
The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the
|
||||||
|
examples. To install `sentencepiece` run `pip install sentencepiece`.
|
||||||
|
|
||||||
|
- Supervised Training
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
|
||||||
|
|
||||||
|
model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
|
||||||
|
tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="fr")
|
||||||
|
|
||||||
|
src_text = "Life is like a box of chocolates."
|
||||||
|
tgt_text = "La vie est comme une boîte de chocolat."
|
||||||
|
|
||||||
|
model_inputs = tokenizer(src_text, return_tensors="pt")
|
||||||
|
with tokenizer.as_target_tokenizer():
|
||||||
|
labels = tokenizer(tgt_text, return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
loss = model(**model_inputs, labels=labels) # forward pass
|
||||||
|
```
|
||||||
|
|
||||||
|
- Generation
|
||||||
|
|
||||||
|
M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id
|
||||||
|
being forced as the first generated token. To force the target language id as the first generated token, pass the
|
||||||
|
*forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between
|
||||||
|
Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
|
||||||
|
|
||||||
|
>>> hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
|
||||||
|
>>> chinese_text = "生活就像一盒巧克力。"
|
||||||
|
|
||||||
|
>>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
|
||||||
|
>>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
|
||||||
|
|
||||||
|
>>> # translate Hindi to French
|
||||||
|
>>> tokenizer.src_lang = "hi"
|
||||||
|
>>> encoded_hi = tokenizer(hi_text, return_tensors="pt")
|
||||||
|
>>> generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
|
||||||
|
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
||||||
|
"La vie est comme une boîte de chocolat."
|
||||||
|
|
||||||
|
>>> # translate Chinese to English
|
||||||
|
>>> tokenizer.src_lang = "zh"
|
||||||
|
>>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")
|
||||||
|
>>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
|
||||||
|
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
||||||
|
"Life is like a box of chocolate."
|
||||||
|
```
|
||||||
|
|
||||||
|
## M2M100Config
|
||||||
|
|
||||||
|
[[autodoc]] M2M100Config
|
||||||
|
|
||||||
|
## M2M100Tokenizer
|
||||||
|
|
||||||
|
[[autodoc]] M2M100Tokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
- get_special_tokens_mask
|
||||||
|
- create_token_type_ids_from_sequences
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## M2M100Model
|
||||||
|
|
||||||
|
[[autodoc]] M2M100Model
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## M2M100ForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] M2M100ForConditionalGeneration
|
||||||
|
- forward
|
||||||
@@ -1,130 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
M2M100
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Overview
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The M2M100 model was proposed in `Beyond English-Centric Multilingual Machine Translation
|
|
||||||
<https://arxiv.org/abs/2010.11125>`__ by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky,
|
|
||||||
Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy
|
|
||||||
Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
|
||||||
|
|
||||||
The abstract from the paper is the following:
|
|
||||||
|
|
||||||
*Existing work in translation demonstrated the potential of massively multilingual machine translation by training a
|
|
||||||
single model able to translate between any pair of languages. However, much of this work is English-Centric by training
|
|
||||||
only on data which was translated from or to English. While this is supported by large sources of training data, it
|
|
||||||
does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation
|
|
||||||
model that can translate directly between any pair of 100 languages. We build and open source a training dataset that
|
|
||||||
covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how
|
|
||||||
to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters
|
|
||||||
to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly
|
|
||||||
translating between non-English directions while performing competitively to the best single systems of WMT. We
|
|
||||||
open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.*
|
|
||||||
|
|
||||||
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__.
|
|
||||||
|
|
||||||
|
|
||||||
Training and Generation
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is
|
|
||||||
multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the
|
|
||||||
source and target text. The source text format is :obj:`[lang_code] X [eos]`, where :obj:`lang_code` is source language
|
|
||||||
id for source text and target language id for target text, with :obj:`X` being the source or target text.
|
|
||||||
|
|
||||||
The :class:`~transformers.M2M100Tokenizer` depends on :obj:`sentencepiece` so be sure to install it before running the
|
|
||||||
examples. To install :obj:`sentencepiece` run ``pip install sentencepiece``.
|
|
||||||
|
|
||||||
- Supervised Training
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer
|
|
||||||
|
|
||||||
model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
|
|
||||||
tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="fr")
|
|
||||||
|
|
||||||
src_text = "Life is like a box of chocolates."
|
|
||||||
tgt_text = "La vie est comme une boîte de chocolat."
|
|
||||||
|
|
||||||
model_inputs = tokenizer(src_text, return_tensors="pt")
|
|
||||||
with tokenizer.as_target_tokenizer():
|
|
||||||
labels = tokenizer(tgt_text, return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
loss = model(**model_inputs, labels=labels) # forward pass
|
|
||||||
|
|
||||||
|
|
||||||
- Generation
|
|
||||||
|
|
||||||
M2M100 uses the :obj:`eos_token_id` as the :obj:`decoder_start_token_id` for generation with the target language id
|
|
||||||
being forced as the first generated token. To force the target language id as the first generated token, pass the
|
|
||||||
`forced_bos_token_id` parameter to the `generate` method. The following example shows how to translate between
|
|
||||||
Hindi to French and Chinese to English using the `facebook/m2m100_418M` checkpoint.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
|
|
||||||
|
|
||||||
>>> hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
|
|
||||||
>>> chinese_text = "生活就像一盒巧克力。"
|
|
||||||
|
|
||||||
>>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
|
|
||||||
>>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
|
|
||||||
|
|
||||||
>>> # translate Hindi to French
|
|
||||||
>>> tokenizer.src_lang = "hi"
|
|
||||||
>>> encoded_hi = tokenizer(hi_text, return_tensors="pt")
|
|
||||||
>>> generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
|
|
||||||
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
||||||
"La vie est comme une boîte de chocolat."
|
|
||||||
|
|
||||||
>>> # translate Chinese to English
|
|
||||||
>>> tokenizer.src_lang = "zh"
|
|
||||||
>>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")
|
|
||||||
>>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
|
|
||||||
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
||||||
"Life is like a box of chocolate."
|
|
||||||
|
|
||||||
|
|
||||||
M2M100Config
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.M2M100Config
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
M2M100Tokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.M2M100Tokenizer
|
|
||||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
||||||
create_token_type_ids_from_sequences, save_vocabulary
|
|
||||||
|
|
||||||
|
|
||||||
M2M100Model
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.M2M100Model
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
M2M100ForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.M2M100ForConditionalGeneration
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
191
docs/source/model_doc/marian.mdx
Normal file
191
docs/source/model_doc/marian.mdx
Normal file
@@ -0,0 +1,191 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# MarianMT
|
||||||
|
|
||||||
|
**Bugs:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title)
|
||||||
|
and assign @patrickvonplaten.
|
||||||
|
|
||||||
|
Translations should be similar, but not identical to output in the test set linked to in each model card.
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
- Each model is about 298 MB on disk, there are more than 1,000 models.
|
||||||
|
- The list of supported language pairs can be found [here](https://huggingface.co/Helsinki-NLP).
|
||||||
|
- Models were originally trained by [Jörg Tiedemann](https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann) using the [Marian](https://marian-nmt.github.io/) C++ library, which supports fast training and translation.
|
||||||
|
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
|
||||||
|
in a model card.
|
||||||
|
- The 80 opus models that require BPE preprocessing are not supported.
|
||||||
|
- The modeling code is the same as [`BartForConditionalGeneration`] with a few minor modifications:
|
||||||
|
|
||||||
|
- static (sinusoid) positional embeddings (`MarianConfig.static_position_embeddings=True`)
|
||||||
|
- no layernorm_embedding (`MarianConfig.normalize_embedding=False`)
|
||||||
|
- the model starts generating with `pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
|
||||||
|
`<s/>`),
|
||||||
|
- Code to bulk convert models can be found in `convert_marian_to_pytorch.py`.
|
||||||
|
- This model was contributed by [sshleifer](https://huggingface.co/sshleifer).
|
||||||
|
|
||||||
|
## Naming
|
||||||
|
|
||||||
|
- All model names use the following format: `Helsinki-NLP/opus-mt-{src}-{tgt}`
|
||||||
|
- The language codes used to name models are inconsistent. Two digit codes can usually be found [here](https://developers.google.com/admin-sdk/directory/v1/languages), three digit codes require googling "language
|
||||||
|
code {code}".
|
||||||
|
- Codes formatted like `es_AR` are usually `code_{region}`. That one is Spanish from Argentina.
|
||||||
|
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
|
||||||
|
group use a combination of ISO-639-5 codes and ISO-639-2 codes.
|
||||||
|
|
||||||
|
|
||||||
|
## Examples
|
||||||
|
|
||||||
|
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
|
||||||
|
fine-tuning experiments and integration tests.
|
||||||
|
- [Fine-tune on GPU](https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh)
|
||||||
|
- [Fine-tune on GPU with pytorch-lightning](https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh)
|
||||||
|
|
||||||
|
## Multilingual Models
|
||||||
|
|
||||||
|
- All model names use the following format: `Helsinki-NLP/opus-mt-{src}-{tgt}`:
|
||||||
|
- If a model can output multiple languages, and you should specify a language code by prepending the desired output
|
||||||
|
language to the `src_text`.
|
||||||
|
- You can see a models's supported language codes in its model card, under target constituents, like in [opus-mt-en-roa](https://huggingface.co/Helsinki-NLP/opus-mt-en-roa).
|
||||||
|
- Note that if a model is only multilingual on the source side, like `Helsinki-NLP/opus-mt-roa-en`, no language
|
||||||
|
codes are required.
|
||||||
|
|
||||||
|
New multi-lingual models from the [Tatoeba-Challenge repo](https://github.com/Helsinki-NLP/Tatoeba-Challenge)
|
||||||
|
require 3 character language codes:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MarianMTModel, MarianTokenizer
|
||||||
|
>>> src_text = [
|
||||||
|
... '>>fra<< this is a sentence in english that we want to translate to french',
|
||||||
|
... '>>por<< This should go to portuguese',
|
||||||
|
... '>>esp<< And this to Spanish'
|
||||||
|
>>> ]
|
||||||
|
|
||||||
|
>>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
|
||||||
|
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
|
||||||
|
>>> print(tokenizer.supported_language_codes)
|
||||||
|
['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
|
||||||
|
|
||||||
|
>>> model = MarianMTModel.from_pretrained(model_name)
|
||||||
|
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
|
||||||
|
>>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
|
||||||
|
["c'est une phrase en anglais que nous voulons traduire en français",
|
||||||
|
'Isto deve ir para o português.',
|
||||||
|
'Y esto al español']
|
||||||
|
```
|
||||||
|
|
||||||
|
Here is the code to see all available pretrained models on the hub:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from huggingface_hub import list_models
|
||||||
|
model_list = list_models()
|
||||||
|
org = "Helsinki-NLP"
|
||||||
|
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
|
||||||
|
suffix = [x.split('/')[1] for x in model_ids]
|
||||||
|
old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Old Style Multi-Lingual Models
|
||||||
|
|
||||||
|
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
|
||||||
|
group:
|
||||||
|
|
||||||
|
```python
|
||||||
|
['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
|
||||||
|
'Helsinki-NLP/opus-mt-ROMANCE-en',
|
||||||
|
'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
|
||||||
|
'Helsinki-NLP/opus-mt-de-ZH',
|
||||||
|
'Helsinki-NLP/opus-mt-en-CELTIC',
|
||||||
|
'Helsinki-NLP/opus-mt-en-ROMANCE',
|
||||||
|
'Helsinki-NLP/opus-mt-es-NORWAY',
|
||||||
|
'Helsinki-NLP/opus-mt-fi-NORWAY',
|
||||||
|
'Helsinki-NLP/opus-mt-fi-ZH',
|
||||||
|
'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
|
||||||
|
'Helsinki-NLP/opus-mt-sv-NORWAY',
|
||||||
|
'Helsinki-NLP/opus-mt-sv-ZH']
|
||||||
|
GROUP_MEMBERS = {
|
||||||
|
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
|
||||||
|
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
|
||||||
|
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
|
||||||
|
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
|
||||||
|
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
|
||||||
|
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
|
||||||
|
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Example of translating english to many romance languages, using old-style 2 character language codes
|
||||||
|
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MarianMTModel, MarianTokenizer
|
||||||
|
>>> src_text = [
|
||||||
|
... '>>fr<< this is a sentence in english that we want to translate to french',
|
||||||
|
... '>>pt<< This should go to portuguese',
|
||||||
|
... '>>es<< And this to Spanish'
|
||||||
|
>>> ]
|
||||||
|
|
||||||
|
>>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
|
||||||
|
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
|
||||||
|
|
||||||
|
>>> model = MarianMTModel.from_pretrained(model_name)
|
||||||
|
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
|
||||||
|
>>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
|
||||||
|
["c'est une phrase en anglais que nous voulons traduire en français",
|
||||||
|
'Isto deve ir para o português.',
|
||||||
|
'Y esto al español']
|
||||||
|
```
|
||||||
|
|
||||||
|
## MarianConfig
|
||||||
|
|
||||||
|
[[autodoc]] MarianConfig
|
||||||
|
|
||||||
|
## MarianTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] MarianTokenizer
|
||||||
|
- as_target_tokenizer
|
||||||
|
|
||||||
|
## MarianModel
|
||||||
|
|
||||||
|
[[autodoc]] MarianModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## MarianMTModel
|
||||||
|
|
||||||
|
[[autodoc]] MarianMTModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## MarianForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] MarianForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFMarianModel
|
||||||
|
|
||||||
|
[[autodoc]] TFMarianModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFMarianMTModel
|
||||||
|
|
||||||
|
[[autodoc]] TFMarianMTModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxMarianModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxMarianModel
|
||||||
|
- __call__
|
||||||
|
|
||||||
|
## FlaxMarianMTModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxMarianMTModel
|
||||||
|
- __call__
|
||||||
@@ -1,232 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
MarianMT
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
**Bugs:** If you see something strange, file a `Github Issue
|
|
||||||
<https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title>`__
|
|
||||||
and assign @patrickvonplaten.
|
|
||||||
|
|
||||||
Translations should be similar, but not identical to output in the test set linked to in each model card.
|
|
||||||
|
|
||||||
Implementation Notes
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
- Each model is about 298 MB on disk, there are more than 1,000 models.
|
|
||||||
- The list of supported language pairs can be found `here <https://huggingface.co/Helsinki-NLP>`__.
|
|
||||||
- Models were originally trained by `Jörg Tiedemann
|
|
||||||
<https://researchportal.helsinki.fi/en/persons/j%C3%B6rg-tiedemann>`__ using the `Marian
|
|
||||||
<https://marian-nmt.github.io/>`__ C++ library, which supports fast training and translation.
|
|
||||||
- All models are transformer encoder-decoders with 6 layers in each component. Each model's performance is documented
|
|
||||||
in a model card.
|
|
||||||
- The 80 opus models that require BPE preprocessing are not supported.
|
|
||||||
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
|
|
||||||
|
|
||||||
- static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
|
|
||||||
- no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
|
|
||||||
- the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
|
|
||||||
:obj:`<s/>`),
|
|
||||||
- Code to bulk convert models can be found in ``convert_marian_to_pytorch.py``.
|
|
||||||
- This model was contributed by `sshleifer <https://huggingface.co/sshleifer>`__.
|
|
||||||
|
|
||||||
Naming
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`
|
|
||||||
- The language codes used to name models are inconsistent. Two digit codes can usually be found `here
|
|
||||||
<https://developers.google.com/admin-sdk/directory/v1/languages>`__, three digit codes require googling "language
|
|
||||||
code {code}".
|
|
||||||
- Codes formatted like :obj:`es_AR` are usually :obj:`code_{region}`. That one is Spanish from Argentina.
|
|
||||||
- The models were converted in two stages. The first 1000 models use ISO-639-2 codes to identify languages, the second
|
|
||||||
group use a combination of ISO-639-5 codes and ISO-639-2 codes.
|
|
||||||
|
|
||||||
|
|
||||||
Examples
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
- Since Marian models are smaller than many other translation models available in the library, they can be useful for
|
|
||||||
fine-tuning experiments and integration tests.
|
|
||||||
- `Fine-tune on GPU
|
|
||||||
<https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh>`__
|
|
||||||
- `Fine-tune on GPU with pytorch-lightning
|
|
||||||
<https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh>`__
|
|
||||||
|
|
||||||
Multilingual Models
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
- All model names use the following format: :obj:`Helsinki-NLP/opus-mt-{src}-{tgt}`:
|
|
||||||
- If a model can output multiple languages, and you should specify a language code by prepending the desired output
|
|
||||||
language to the :obj:`src_text`.
|
|
||||||
- You can see a models's supported language codes in its model card, under target constituents, like in `opus-mt-en-roa
|
|
||||||
<https://huggingface.co/Helsinki-NLP/opus-mt-en-roa>`__.
|
|
||||||
- Note that if a model is only multilingual on the source side, like :obj:`Helsinki-NLP/opus-mt-roa-en`, no language
|
|
||||||
codes are required.
|
|
||||||
|
|
||||||
New multi-lingual models from the `Tatoeba-Challenge repo <https://github.com/Helsinki-NLP/Tatoeba-Challenge>`__
|
|
||||||
require 3 character language codes:
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
>>> from transformers import MarianMTModel, MarianTokenizer
|
|
||||||
>>> src_text = [
|
|
||||||
... '>>fra<< this is a sentence in english that we want to translate to french',
|
|
||||||
... '>>por<< This should go to portuguese',
|
|
||||||
... '>>esp<< And this to Spanish'
|
|
||||||
>>> ]
|
|
||||||
|
|
||||||
>>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
|
|
||||||
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
|
|
||||||
>>> print(tokenizer.supported_language_codes)
|
|
||||||
['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
|
|
||||||
|
|
||||||
>>> model = MarianMTModel.from_pretrained(model_name)
|
|
||||||
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
|
|
||||||
>>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
|
|
||||||
["c'est une phrase en anglais que nous voulons traduire en français",
|
|
||||||
'Isto deve ir para o português.',
|
|
||||||
'Y esto al español']
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Here is the code to see all available pretrained models on the hub:
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
from huggingface_hub import list_models
|
|
||||||
model_list = list_models()
|
|
||||||
org = "Helsinki-NLP"
|
|
||||||
model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
|
|
||||||
suffix = [x.split('/')[1] for x in model_ids]
|
|
||||||
old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Old Style Multi-Lingual Models
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
|
|
||||||
group:
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
|
|
||||||
'Helsinki-NLP/opus-mt-ROMANCE-en',
|
|
||||||
'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
|
|
||||||
'Helsinki-NLP/opus-mt-de-ZH',
|
|
||||||
'Helsinki-NLP/opus-mt-en-CELTIC',
|
|
||||||
'Helsinki-NLP/opus-mt-en-ROMANCE',
|
|
||||||
'Helsinki-NLP/opus-mt-es-NORWAY',
|
|
||||||
'Helsinki-NLP/opus-mt-fi-NORWAY',
|
|
||||||
'Helsinki-NLP/opus-mt-fi-ZH',
|
|
||||||
'Helsinki-NLP/opus-mt-fi_nb_no_nn_ru_sv_en-SAMI',
|
|
||||||
'Helsinki-NLP/opus-mt-sv-NORWAY',
|
|
||||||
'Helsinki-NLP/opus-mt-sv-ZH']
|
|
||||||
GROUP_MEMBERS = {
|
|
||||||
'ZH': ['cmn', 'cn', 'yue', 'ze_zh', 'zh_cn', 'zh_CN', 'zh_HK', 'zh_tw', 'zh_TW', 'zh_yue', 'zhs', 'zht', 'zh'],
|
|
||||||
'ROMANCE': ['fr', 'fr_BE', 'fr_CA', 'fr_FR', 'wa', 'frp', 'oc', 'ca', 'rm', 'lld', 'fur', 'lij', 'lmo', 'es', 'es_AR', 'es_CL', 'es_CO', 'es_CR', 'es_DO', 'es_EC', 'es_ES', 'es_GT', 'es_HN', 'es_MX', 'es_NI', 'es_PA', 'es_PE', 'es_PR', 'es_SV', 'es_UY', 'es_VE', 'pt', 'pt_br', 'pt_BR', 'pt_PT', 'gl', 'lad', 'an', 'mwl', 'it', 'it_IT', 'co', 'nap', 'scn', 'vec', 'sc', 'ro', 'la'],
|
|
||||||
'NORTH_EU': ['de', 'nl', 'fy', 'af', 'da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
|
|
||||||
'SCANDINAVIA': ['da', 'fo', 'is', 'no', 'nb', 'nn', 'sv'],
|
|
||||||
'SAMI': ['se', 'sma', 'smj', 'smn', 'sms'],
|
|
||||||
'NORWAY': ['nb_NO', 'nb', 'nn_NO', 'nn', 'nog', 'no_nb', 'no'],
|
|
||||||
'CELTIC': ['ga', 'cy', 'br', 'gd', 'kw', 'gv']
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Example of translating english to many romance languages, using old-style 2 character language codes
|
|
||||||
|
|
||||||
|
|
||||||
.. code-block::python
|
|
||||||
|
|
||||||
>>> from transformers import MarianMTModel, MarianTokenizer
|
|
||||||
>>> src_text = [
|
|
||||||
... '>>fr<< this is a sentence in english that we want to translate to french',
|
|
||||||
... '>>pt<< This should go to portuguese',
|
|
||||||
... '>>es<< And this to Spanish'
|
|
||||||
>>> ]
|
|
||||||
|
|
||||||
>>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
|
|
||||||
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
|
|
||||||
|
|
||||||
>>> model = MarianMTModel.from_pretrained(model_name)
|
|
||||||
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
|
|
||||||
>>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
|
|
||||||
["c'est une phrase en anglais que nous voulons traduire en français",
|
|
||||||
'Isto deve ir para o português.',
|
|
||||||
'Y esto al español']
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
MarianConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MarianConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MarianTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MarianTokenizer
|
|
||||||
:members: as_target_tokenizer
|
|
||||||
|
|
||||||
|
|
||||||
MarianModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MarianModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
MarianMTModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MarianMTModel
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
MarianForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MarianForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFMarianModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFMarianModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFMarianMTModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFMarianMTModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxMarianModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxMarianModel
|
|
||||||
:members: __call__
|
|
||||||
|
|
||||||
|
|
||||||
FlaxMarianMTModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxMarianMTModel
|
|
||||||
:members: __call__
|
|
||||||
230
docs/source/model_doc/mbart.mdx
Normal file
230
docs/source/model_doc/mbart.mdx
Normal file
@@ -0,0 +1,230 @@
|
|||||||
|
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# MBart and MBart-50
|
||||||
|
|
||||||
|
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign
|
||||||
|
@patrickvonplaten
|
||||||
|
|
||||||
|
## Overview of MBart
|
||||||
|
|
||||||
|
The MBart model was presented in [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
|
||||||
|
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||||
|
|
||||||
|
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
|
||||||
|
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
|
||||||
|
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
|
||||||
|
on the encoder, decoder, or reconstructing parts of the text.
|
||||||
|
|
||||||
|
This model was contributed by [valhalla](https://huggingface.co/valhalla). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/mbart)
|
||||||
|
|
||||||
|
### Training of MBart
|
||||||
|
|
||||||
|
MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
|
||||||
|
model is multilingual it expects the sequences in a different format. A special language id token is added in both the
|
||||||
|
source and target text. The source text format is `X [eos, src_lang_code]` where `X` is the source text. The
|
||||||
|
target text format is `[tgt_lang_code] X [eos]`. `bos` is never used.
|
||||||
|
|
||||||
|
The regular [`~MBartTokenizer.__call__`] will encode source text format, and it should be wrapped
|
||||||
|
inside the context manager [`~MBartTokenizer.as_target_tokenizer`] to encode target text format.
|
||||||
|
|
||||||
|
- Supervised training
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
|
||||||
|
|
||||||
|
>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
|
||||||
|
>>> example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
|
||||||
|
>>> expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
|
||||||
|
|
||||||
|
>>> inputs = tokenizer(example_english_phrase, return_tensors="pt")
|
||||||
|
>>> with tokenizer.as_target_tokenizer():
|
||||||
|
... labels = tokenizer(expected_translation_romanian, return_tensors="pt")
|
||||||
|
|
||||||
|
>>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
|
||||||
|
>>> # forward pass
|
||||||
|
>>> model(**inputs, labels=batch['labels'])
|
||||||
|
```
|
||||||
|
|
||||||
|
- Generation
|
||||||
|
|
||||||
|
While generating the target text set the `decoder_start_token_id` to the target language id. The following
|
||||||
|
example shows how to translate English to Romanian using the *facebook/mbart-large-en-ro* model.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
|
||||||
|
|
||||||
|
>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
|
||||||
|
>>> article = "UN Chief Says There Is No Military Solution in Syria"
|
||||||
|
>>> inputs = tokenizer(article, return_tensors="pt")
|
||||||
|
>>> translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
|
||||||
|
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
|
||||||
|
"Şeful ONU declară că nu există o soluţie militară în Siria"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Overview of MBart-50
|
||||||
|
|
||||||
|
MBart-50 was introduced in the *Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
|
||||||
|
<https://arxiv.org/abs/2008.00401>* paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
|
||||||
|
Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original *mbart-large-cc25* checkpoint by extendeding
|
||||||
|
its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
|
||||||
|
languages.
|
||||||
|
|
||||||
|
According to the abstract
|
||||||
|
|
||||||
|
*Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
|
||||||
|
direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
|
||||||
|
can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
|
||||||
|
average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
|
||||||
|
improving 9.3 BLEU on average over bilingual baselines from scratch.*
|
||||||
|
|
||||||
|
|
||||||
|
### Training of MBart-50
|
||||||
|
|
||||||
|
The text format for MBart-50 is slightly different from mBART. For MBart-50 the language id token is used as a prefix
|
||||||
|
for both source and target text i.e the text format is `[lang_code] X [eos]`, where `lang_code` is source
|
||||||
|
language id for source text and target language id for target text, with `X` being the source or target text
|
||||||
|
respectively.
|
||||||
|
|
||||||
|
|
||||||
|
MBart-50 has its own tokenizer [`MBart50Tokenizer`].
|
||||||
|
|
||||||
|
- Supervised training
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
|
||||||
|
|
||||||
|
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
|
||||||
|
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
|
||||||
|
|
||||||
|
src_text = " UN Chief Says There Is No Military Solution in Syria"
|
||||||
|
tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"
|
||||||
|
|
||||||
|
model_inputs = tokenizer(src_text, return_tensors="pt")
|
||||||
|
with tokenizer.as_target_tokenizer():
|
||||||
|
labels = tokenizer(tgt_text, return_tensors="pt").input_ids
|
||||||
|
|
||||||
|
model(**model_inputs, labels=labels) # forward pass
|
||||||
|
```
|
||||||
|
|
||||||
|
- Generation
|
||||||
|
|
||||||
|
To generate using the mBART-50 multilingual translation models, `eos_token_id` is used as the
|
||||||
|
`decoder_start_token_id` and the target language id is forced as the first generated token. To force the
|
||||||
|
target language id as the first generated token, pass the *forced_bos_token_id* parameter to the *generate* method.
|
||||||
|
The following example shows how to translate between Hindi to French and Arabic to English using the
|
||||||
|
*facebook/mbart-50-large-many-to-many* checkpoint.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
|
||||||
|
|
||||||
|
article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
|
||||||
|
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
|
||||||
|
|
||||||
|
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
|
||||||
|
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
|
||||||
|
|
||||||
|
# translate Hindi to French
|
||||||
|
tokenizer.src_lang = "hi_IN"
|
||||||
|
encoded_hi = tokenizer(article_hi, return_tensors="pt")
|
||||||
|
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
|
||||||
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
||||||
|
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
|
||||||
|
|
||||||
|
# translate Arabic to English
|
||||||
|
tokenizer.src_lang = "ar_AR"
|
||||||
|
encoded_ar = tokenizer(article_ar, return_tensors="pt")
|
||||||
|
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
|
||||||
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
||||||
|
# => "The Secretary-General of the United Nations says there is no military solution in Syria."
|
||||||
|
```
|
||||||
|
|
||||||
|
## MBartConfig
|
||||||
|
|
||||||
|
[[autodoc]] MBartConfig
|
||||||
|
|
||||||
|
## MBartTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] MBartTokenizer
|
||||||
|
- as_target_tokenizer
|
||||||
|
- build_inputs_with_special_tokens
|
||||||
|
|
||||||
|
## MBartTokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] MBartTokenizerFast
|
||||||
|
|
||||||
|
## MBart50Tokenizer
|
||||||
|
|
||||||
|
[[autodoc]] MBart50Tokenizer
|
||||||
|
|
||||||
|
## MBart50TokenizerFast
|
||||||
|
|
||||||
|
[[autodoc]] MBart50TokenizerFast
|
||||||
|
|
||||||
|
## MBartModel
|
||||||
|
|
||||||
|
[[autodoc]] MBartModel
|
||||||
|
|
||||||
|
## MBartForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] MBartForConditionalGeneration
|
||||||
|
|
||||||
|
## MBartForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] MBartForQuestionAnswering
|
||||||
|
|
||||||
|
## MBartForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] MBartForSequenceClassification
|
||||||
|
|
||||||
|
## MBartForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] MBartForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TFMBartModel
|
||||||
|
|
||||||
|
[[autodoc]] TFMBartModel
|
||||||
|
- call
|
||||||
|
|
||||||
|
## TFMBartForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] TFMBartForConditionalGeneration
|
||||||
|
- call
|
||||||
|
|
||||||
|
## FlaxMBartModel
|
||||||
|
|
||||||
|
[[autodoc]] FlaxMBartModel
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxMBartForConditionalGeneration
|
||||||
|
|
||||||
|
[[autodoc]] FlaxMBartForConditionalGeneration
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxMBartForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] FlaxMBartForSequenceClassification
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
|
|
||||||
|
## FlaxMBartForQuestionAnswering
|
||||||
|
|
||||||
|
[[autodoc]] FlaxMBartForQuestionAnswering
|
||||||
|
- __call__
|
||||||
|
- encode
|
||||||
|
- decode
|
||||||
@@ -1,270 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
||||||
the License. You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
||||||
specific language governing permissions and limitations under the License.
|
|
||||||
|
|
||||||
MBart and MBart-50
|
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
**DISCLAIMER:** If you see something strange, file a `Github Issue
|
|
||||||
<https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`__ and assign
|
|
||||||
@patrickvonplaten
|
|
||||||
|
|
||||||
Overview of MBart
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The MBart model was presented in `Multilingual Denoising Pre-training for Neural Machine Translation
|
|
||||||
<https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan
|
|
||||||
Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
|
||||||
|
|
||||||
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual
|
|
||||||
corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete
|
|
||||||
sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only
|
|
||||||
on the encoder, decoder, or reconstructing parts of the text.
|
|
||||||
|
|
||||||
This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The Authors' code can be found `here
|
|
||||||
<https://github.com/pytorch/fairseq/tree/master/examples/mbart>`__
|
|
||||||
|
|
||||||
Training of MBart
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
|
|
||||||
model is multilingual it expects the sequences in a different format. A special language id token is added in both the
|
|
||||||
source and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The
|
|
||||||
target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
|
|
||||||
|
|
||||||
The regular :meth:`~transformers.MBartTokenizer.__call__` will encode source text format, and it should be wrapped
|
|
||||||
inside the context manager :meth:`~transformers.MBartTokenizer.as_target_tokenizer` to encode target text format.
|
|
||||||
|
|
||||||
- Supervised training
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
|
|
||||||
|
|
||||||
>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX", tgt_lang="ro_RO")
|
|
||||||
>>> example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
|
|
||||||
>>> expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
|
|
||||||
|
|
||||||
>>> inputs = tokenizer(example_english_phrase, return_tensors="pt")
|
|
||||||
>>> with tokenizer.as_target_tokenizer():
|
|
||||||
... labels = tokenizer(expected_translation_romanian, return_tensors="pt")
|
|
||||||
|
|
||||||
>>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
|
|
||||||
>>> # forward pass
|
|
||||||
>>> model(**inputs, labels=batch['labels'])
|
|
||||||
|
|
||||||
- Generation
|
|
||||||
|
|
||||||
While generating the target text set the :obj:`decoder_start_token_id` to the target language id. The following
|
|
||||||
example shows how to translate English to Romanian using the `facebook/mbart-large-en-ro` model.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
|
|
||||||
|
|
||||||
>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
|
|
||||||
>>> article = "UN Chief Says There Is No Military Solution in Syria"
|
|
||||||
>>> inputs = tokenizer(article, return_tensors="pt")
|
|
||||||
>>> translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
|
|
||||||
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
|
|
||||||
"Şeful ONU declară că nu există o soluţie militară în Siria"
|
|
||||||
|
|
||||||
|
|
||||||
Overview of MBart-50
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
MBart-50 was introduced in the `Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
|
|
||||||
<https://arxiv.org/abs/2008.00401>` paper by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav
|
|
||||||
Chaudhary, Jiatao Gu, Angela Fan. MBart-50 is created using the original `mbart-large-cc25` checkpoint by extendeding
|
|
||||||
its embedding layers with randomly initialized vectors for an extra set of 25 language tokens and then pretrained on 50
|
|
||||||
languages.
|
|
||||||
|
|
||||||
According to the abstract
|
|
||||||
|
|
||||||
*Multilingual translation models can be created through multilingual finetuning. Instead of finetuning on one
|
|
||||||
direction, a pretrained model is finetuned on many directions at the same time. It demonstrates that pretrained models
|
|
||||||
can be extended to incorporate additional languages without loss of performance. Multilingual finetuning improves on
|
|
||||||
average 1 BLEU over the strongest baselines (being either multilingual from scratch or bilingual finetuning) while
|
|
||||||
improving 9.3 BLEU on average over bilingual baselines from scratch.*
|
|
||||||
|
|
||||||
|
|
||||||
Training of MBart-50
|
|
||||||
_______________________________________________________________________________________________________________________
|
|
||||||
|
|
||||||
The text format for MBart-50 is slightly different from mBART. For MBart-50 the language id token is used as a prefix
|
|
||||||
for both source and target text i.e the text format is :obj:`[lang_code] X [eos]`, where :obj:`lang_code` is source
|
|
||||||
language id for source text and target language id for target text, with :obj:`X` being the source or target text
|
|
||||||
respectively.
|
|
||||||
|
|
||||||
|
|
||||||
MBart-50 has its own tokenizer :class:`~transformers.MBart50Tokenizer`.
|
|
||||||
|
|
||||||
- Supervised training
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
|
|
||||||
|
|
||||||
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
|
|
||||||
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
|
|
||||||
|
|
||||||
src_text = " UN Chief Says There Is No Military Solution in Syria"
|
|
||||||
tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"
|
|
||||||
|
|
||||||
model_inputs = tokenizer(src_text, return_tensors="pt")
|
|
||||||
with tokenizer.as_target_tokenizer():
|
|
||||||
labels = tokenizer(tgt_text, return_tensors="pt").input_ids
|
|
||||||
|
|
||||||
model(**model_inputs, labels=labels) # forward pass
|
|
||||||
|
|
||||||
|
|
||||||
- Generation
|
|
||||||
|
|
||||||
To generate using the mBART-50 multilingual translation models, :obj:`eos_token_id` is used as the
|
|
||||||
:obj:`decoder_start_token_id` and the target language id is forced as the first generated token. To force the
|
|
||||||
target language id as the first generated token, pass the `forced_bos_token_id` parameter to the `generate` method.
|
|
||||||
The following example shows how to translate between Hindi to French and Arabic to English using the
|
|
||||||
`facebook/mbart-50-large-many-to-many` checkpoint.
|
|
||||||
|
|
||||||
.. code-block::
|
|
||||||
|
|
||||||
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
|
|
||||||
|
|
||||||
article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
|
|
||||||
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
|
|
||||||
|
|
||||||
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
|
|
||||||
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
|
|
||||||
|
|
||||||
# translate Hindi to French
|
|
||||||
tokenizer.src_lang = "hi_IN"
|
|
||||||
encoded_hi = tokenizer(article_hi, return_tensors="pt")
|
|
||||||
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
|
|
||||||
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
||||||
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
|
|
||||||
|
|
||||||
# translate Arabic to English
|
|
||||||
tokenizer.src_lang = "ar_AR"
|
|
||||||
encoded_ar = tokenizer(article_ar, return_tensors="pt")
|
|
||||||
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
|
|
||||||
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
||||||
# => "The Secretary-General of the United Nations says there is no military solution in Syria."
|
|
||||||
|
|
||||||
|
|
||||||
MBartConfig
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartConfig
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MBartTokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartTokenizer
|
|
||||||
:members: as_target_tokenizer, build_inputs_with_special_tokens
|
|
||||||
|
|
||||||
|
|
||||||
MBartTokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartTokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MBart50Tokenizer
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBart50Tokenizer
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MBart50TokenizerFast
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBart50TokenizerFast
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MBartModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartModel
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MBartForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartForConditionalGeneration
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MBartForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartForQuestionAnswering
|
|
||||||
:members:
|
|
||||||
|
|
||||||
|
|
||||||
MBartForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartForSequenceClassification
|
|
||||||
|
|
||||||
|
|
||||||
MBartForCausalLM
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.MBartForCausalLM
|
|
||||||
:members: forward
|
|
||||||
|
|
||||||
|
|
||||||
TFMBartModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFMBartModel
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
TFMBartForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.TFMBartForConditionalGeneration
|
|
||||||
:members: call
|
|
||||||
|
|
||||||
|
|
||||||
FlaxMBartModel
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxMBartModel
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxMBartForConditionalGeneration
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxMBartForConditionalGeneration
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxMBartForSequenceClassification
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxMBartForSequenceClassification
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
|
|
||||||
|
|
||||||
FlaxMBartForQuestionAnswering
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. autoclass:: transformers.FlaxMBartForQuestionAnswering
|
|
||||||
:members: __call__, encode, decode
|
|
||||||
Reference in New Issue
Block a user