[Doctests] Fix all T5 doc tests (#16646)
* [Doctests] Fix all T5 doc tests * make style * Update docs/source/en/model_doc/t5.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply Sylvains comments * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
f7196f2e63
commit
b24201fa44
@@ -32,9 +32,9 @@ NLP, we release our dataset, pre-trained models, and code.*
|
||||
Tips:
|
||||
|
||||
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which
|
||||
each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
|
||||
different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
|
||||
for summarization: *summarize: ...*.
|
||||
each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a
|
||||
different prefix to the input corresponding to each task, e.g., for translation: *translate English to German: ...*,
|
||||
for summarization: *summarize: ...*.
|
||||
|
||||
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
|
||||
|
||||
@@ -83,130 +83,140 @@ language modeling head on top of the decoder.
|
||||
|
||||
- Unsupervised denoising training
|
||||
|
||||
In this setup, spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
|
||||
the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
|
||||
sentinel token represents a unique mask token for this sentence and should start with `<extra_id_0>`,
|
||||
`<extra_id_1>`, ... up to `<extra_id_99>`. As a default, 100 sentinel tokens are available in
|
||||
[`T5Tokenizer`].
|
||||
In this setup, spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) and
|
||||
the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. Each
|
||||
sentinel token represents a unique mask token for this sentence and should start with `<extra_id_0>`,
|
||||
`<extra_id_1>`, ... up to `<extra_id_99>`. As a default, 100 sentinel tokens are available in
|
||||
[`T5Tokenizer`].
|
||||
|
||||
For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
|
||||
processed as follows:
|
||||
For instance, the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be
|
||||
processed as follows:
|
||||
|
||||
```python
|
||||
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
```python
|
||||
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
|
||||
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
|
||||
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
|
||||
# the forward function automatically creates the correct decoder_input_ids
|
||||
loss = model(input_ids=input_ids, labels=labels).loss
|
||||
```
|
||||
>>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
|
||||
>>> labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
|
||||
|
||||
If you're interested in pre-training T5 on a new corpus, check out the [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling) script in the Examples
|
||||
directory.
|
||||
>>> # the forward function automatically creates the correct decoder_input_ids
|
||||
>>> loss = model(input_ids=input_ids, labels=labels).loss
|
||||
>>> loss.item()
|
||||
3.7837
|
||||
```
|
||||
|
||||
If you're interested in pre-training T5 on a new corpus, check out the [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling) script in the Examples
|
||||
directory.
|
||||
|
||||
- Supervised training
|
||||
|
||||
In this setup, the input sequence and output sequence are a standard sequence-to-sequence input-output mapping.
|
||||
Suppose that we want to fine-tune the model for translation for example, and we have a training example: the input
|
||||
sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar.", then they should be prepared for
|
||||
the model as follows:
|
||||
In this setup, the input sequence and output sequence are a standard sequence-to-sequence input-output mapping.
|
||||
Suppose that we want to fine-tune the model for translation for example, and we have a training example: the input
|
||||
sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar.", then they should be prepared for
|
||||
the model as follows:
|
||||
|
||||
```python
|
||||
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
```python
|
||||
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
|
||||
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
|
||||
labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids
|
||||
# the forward function automatically creates the correct decoder_input_ids
|
||||
loss = model(input_ids=input_ids, labels=labels).loss
|
||||
```
|
||||
>>> input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
|
||||
>>> labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids
|
||||
|
||||
As you can see, only 2 inputs are required for the model in order to compute a loss: `input_ids` (which are the
|
||||
`input_ids` of the encoded input sequence) and `labels` (which are the `input_ids` of the encoded
|
||||
target sequence). The model will automatically create the `decoder_input_ids` based on the `labels`, by
|
||||
shifting them one position to the right and prepending the `config.decoder_start_token_id`, which for T5 is
|
||||
equal to 0 (i.e. the id of the pad token). Also note the task prefix: we prepend the input sequence with 'translate
|
||||
English to German: ' before encoding it. This will help in improving the performance, as this task prefix was used
|
||||
during T5's pre-training.
|
||||
>>> # the forward function automatically creates the correct decoder_input_ids
|
||||
>>> loss = model(input_ids=input_ids, labels=labels).loss
|
||||
>>> loss.item()
|
||||
0.2542
|
||||
```
|
||||
|
||||
However, the example above only shows a single training example. In practice, one trains deep learning models in
|
||||
batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one
|
||||
typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the
|
||||
input and output sequences respectively (otherwise they are truncated). These should be carefully set depending on
|
||||
the task.
|
||||
As you can see, only 2 inputs are required for the model in order to compute a loss: `input_ids` (which are the
|
||||
`input_ids` of the encoded input sequence) and `labels` (which are the `input_ids` of the encoded
|
||||
target sequence). The model will automatically create the `decoder_input_ids` based on the `labels`, by
|
||||
shifting them one position to the right and prepending the `config.decoder_start_token_id`, which for T5 is
|
||||
equal to 0 (i.e. the id of the pad token). Also note the task prefix: we prepend the input sequence with 'translate
|
||||
English to German: ' before encoding it. This will help in improving the performance, as this task prefix was used
|
||||
during T5's pre-training.
|
||||
|
||||
In addition, we must make sure that padding token id's of the `labels` are not taken into account by the loss
|
||||
function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the `ignore_index`
|
||||
of the `CrossEntropyLoss`. In Flax, one can use the `decoder_attention_mask` to ignore padded tokens from
|
||||
the loss (see the [Flax summarization script](https://github.com/huggingface/transformers/tree/main/examples/flax/summarization) for details). We also pass
|
||||
`attention_mask` as additional input to the model, which makes sure that padding tokens of the inputs are
|
||||
ignored. The code example below illustrates all of this.
|
||||
However, the example above only shows a single training example. In practice, one trains deep learning models in
|
||||
batches. This entails that we must pad/truncate examples to the same length. For encoder-decoder models, one
|
||||
typically defines a `max_source_length` and `max_target_length`, which determine the maximum length of the
|
||||
input and output sequences respectively (otherwise they are truncated). These should be carefully set depending on
|
||||
the task.
|
||||
|
||||
```python
|
||||
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
import torch
|
||||
In addition, we must make sure that padding token id's of the `labels` are not taken into account by the loss
|
||||
function. In PyTorch and Tensorflow, this can be done by replacing them with -100, which is the `ignore_index`
|
||||
of the `CrossEntropyLoss`. In Flax, one can use the `decoder_attention_mask` to ignore padded tokens from
|
||||
the loss (see the [Flax summarization script](https://github.com/huggingface/transformers/tree/main/examples/flax/summarization) for details). We also pass
|
||||
`attention_mask` as additional input to the model, which makes sure that padding tokens of the inputs are
|
||||
ignored. The code example below illustrates all of this.
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
```python
|
||||
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
>>> import torch
|
||||
|
||||
# the following 2 hyperparameters are task-specific
|
||||
max_source_length = 512
|
||||
max_target_length = 128
|
||||
>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
|
||||
# Suppose we have the following 2 training examples:
|
||||
input_sequence_1 = "Welcome to NYC"
|
||||
output_sequence_1 = "Bienvenue à NYC"
|
||||
>>> # the following 2 hyperparameters are task-specific
|
||||
>>> max_source_length = 512
|
||||
>>> max_target_length = 128
|
||||
|
||||
input_sequence_2 = "HuggingFace is a company"
|
||||
output_sequence_2 = "HuggingFace est une entreprise"
|
||||
>>> # Suppose we have the following 2 training examples:
|
||||
>>> input_sequence_1 = "Welcome to NYC"
|
||||
>>> output_sequence_1 = "Bienvenue à NYC"
|
||||
|
||||
# encode the inputs
|
||||
task_prefix = "translate English to French: "
|
||||
input_sequences = [input_sequence_1, input_sequence_2]
|
||||
encoding = tokenizer(
|
||||
[task_prefix + sequence for sequence in input_sequences],
|
||||
padding="longest",
|
||||
max_length=max_source_length,
|
||||
truncation=True,
|
||||
return_tensors="pt",
|
||||
)
|
||||
input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
|
||||
>>> input_sequence_2 = "HuggingFace is a company"
|
||||
>>> output_sequence_2 = "HuggingFace est une entreprise"
|
||||
|
||||
# encode the targets
|
||||
target_encoding = tokenizer(
|
||||
[output_sequence_1, output_sequence_2], padding="longest", max_length=max_target_length, truncation=True
|
||||
)
|
||||
labels = target_encoding.input_ids
|
||||
>>> # encode the inputs
|
||||
>>> task_prefix = "translate English to French: "
|
||||
>>> input_sequences = [input_sequence_1, input_sequence_2]
|
||||
|
||||
# replace padding token id's of the labels by -100
|
||||
labels = torch.tensor(labels)
|
||||
labels[labels == tokenizer.pad_token_id] = -100
|
||||
>>> encoding = tokenizer(
|
||||
... [task_prefix + sequence for sequence in input_sequences],
|
||||
... padding="longest",
|
||||
... max_length=max_source_length,
|
||||
... truncation=True,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
|
||||
# forward pass
|
||||
loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
|
||||
```
|
||||
>>> input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
|
||||
|
||||
>>> # encode the targets
|
||||
>>> target_encoding = tokenizer(
|
||||
... [output_sequence_1, output_sequence_2], padding="longest", max_length=max_target_length, truncation=True
|
||||
... )
|
||||
>>> labels = target_encoding.input_ids
|
||||
|
||||
>>> # replace padding token id's of the labels by -100 so it's ignored by the loss
|
||||
>>> labels = torch.tensor(labels)
|
||||
>>> labels[labels == tokenizer.pad_token_id] = -100
|
||||
|
||||
>>> # forward pass
|
||||
>>> loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
|
||||
>>> loss.item()
|
||||
0.188
|
||||
```
|
||||
|
||||
Additional training tips:
|
||||
|
||||
- T5 models need a slightly higher learning rate than the default one set in the `Trainer` when using the AdamW
|
||||
optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question
|
||||
answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer.
|
||||
optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question
|
||||
answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer.
|
||||
|
||||
- According to [this forum post](https://discuss.huggingface.co/t/t5-finetuning-tips/684), task prefixes matter when
|
||||
(1) doing multi-task training (2) your task is similar or related to one of the supervised tasks used in T5's
|
||||
pre-training mixture (see Appendix D of the [paper](https://arxiv.org/pdf/1910.10683.pdf) for the task prefixes
|
||||
used).
|
||||
According to [this forum post](https://discuss.huggingface.co/t/t5-finetuning-tips/684), task prefixes matter when
|
||||
(1) doing multi-task training (2) your task is similar or related to one of the supervised tasks used in T5's
|
||||
pre-training mixture (see Appendix D of the [paper](https://arxiv.org/pdf/1910.10683.pdf) for the task prefixes
|
||||
used).
|
||||
|
||||
- If training on TPU, it is recommended to pad all examples of the dataset to the same length or make use of
|
||||
*pad_to_multiple_of* to have a small number of predefined bucket sizes to fit all examples in. Dynamically padding
|
||||
batches to the longest example is not recommended on TPU as it triggers a recompilation for every batch shape that is
|
||||
encountered during training thus significantly slowing down the training. only padding up to the longest example in a
|
||||
batch) leads to very slow training on TPU.
|
||||
If training on TPU, it is recommended to pad all examples of the dataset to the same length or make use of
|
||||
*pad_to_multiple_of* to have a small number of predefined bucket sizes to fit all examples in. Dynamically padding
|
||||
batches to the longest example is not recommended on TPU as it triggers a recompilation for every batch shape that is
|
||||
encountered during training thus significantly slowing down the training. only padding up to the longest example in a
|
||||
batch) leads to very slow training on TPU.
|
||||
|
||||
<a id='inference'></a>
|
||||
|
||||
@@ -219,15 +229,15 @@ There's also [this blog post](https://huggingface.co/blog/encoder-decoder#encode
|
||||
generation works in general in encoder-decoder models.
|
||||
|
||||
```python
|
||||
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
|
||||
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
|
||||
outputs = model.generate(input_ids)
|
||||
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
# Das Haus ist wunderbar.
|
||||
>>> input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
|
||||
>>> outputs = model.generate(input_ids)
|
||||
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
||||
Das Haus ist wunderbar.
|
||||
```
|
||||
|
||||
Note that T5 uses the `pad_token_id` as the `decoder_start_token_id`, so when doing generation without using
|
||||
@@ -236,31 +246,47 @@ Note that T5 uses the `pad_token_id` as the `decoder_start_token_id`, so when do
|
||||
The example above only shows a single example. You can also do batched inference, like so:
|
||||
|
||||
```python
|
||||
from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
|
||||
tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
|
||||
# when generating, we will use the logits of right-most token to predict the next token
|
||||
# so the padding should be on the left
|
||||
tokenizer.padding_side = "left"
|
||||
tokenizer.pad_token = tokenizer.eos_token # to avoid an error
|
||||
>>> task_prefix = "translate English to German: "
|
||||
>>> sentences = [
|
||||
... "The house is wonderful.",
|
||||
... "I like to work in NYC.",
|
||||
>>> ] # use different length sentences to test batching
|
||||
>>> inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)
|
||||
|
||||
task_prefix = "translate English to German: "
|
||||
sentences = ["The house is wonderful.", "I like to work in NYC."] # use different length sentences to test batching
|
||||
inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)
|
||||
>>> output_sequences = model.generate(
|
||||
... input_ids=inputs["input_ids"],
|
||||
... attention_mask=inputs["attention_mask"],
|
||||
... do_sample=False, # disable sampling to test if batching affects output
|
||||
... )
|
||||
|
||||
output_sequences = model.generate(
|
||||
input_ids=inputs["input_ids"],
|
||||
attention_mask=inputs["attention_mask"],
|
||||
do_sample=False, # disable sampling to test if batching affects output
|
||||
)
|
||||
|
||||
print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
|
||||
|
||||
# ['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']
|
||||
>>> print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
|
||||
['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']
|
||||
```
|
||||
|
||||
Because T5 has been trained with the span-mask denoising objective,
|
||||
it can be used to predict the sentinel (masked-out) tokens during inference.
|
||||
The predicted tokens will then be placed between the sentinel tokens.
|
||||
|
||||
```python
|
||||
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
|
||||
|
||||
>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
|
||||
>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
|
||||
|
||||
>>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
|
||||
|
||||
>>> sequence_ids = model.generate(input_ids)
|
||||
>>> sequences = tokenizer.batch_decode(sequence_ids)
|
||||
>>> sequences
|
||||
['<pad> <extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']
|
||||
```
|
||||
|
||||
|
||||
<a id='scripts'></a>
|
||||
|
||||
## Performance
|
||||
|
||||
Reference in New Issue
Block a user