[Doctests] Fix all T5 doc tests (#16646)

* [Doctests] Fix all T5 doc tests * make style * Update docs/source/en/model_doc/t5.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply Sylvains comments * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-04-13 11:36:54 +02:00
parent f7196f2e63
commit b24201fa44
4 changed files with 234 additions and 144 deletions
--- a/docs/source/en/model_doc/byt5.mdx
+++ b/docs/source/en/model_doc/byt5.mdx
@@ -48,37 +48,98 @@ fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
 ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
 ```python
-from transformers import T5ForConditionalGeneration
+>>> from transformers import T5ForConditionalGeneration
-import torch
+>>> import torch
-model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
+>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
-input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
+>>> num_special_tokens = 3
-labels = (
+>>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
-    torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3
+>>> # => Need to shift utf-8 character encodings by 3 before passing ids to model.
 )  # add 3 for special tokens
-loss = model(input_ids, labels=labels).loss  # forward pass
+>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens
 >>> labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens
 >>> loss = model(input_ids, labels=labels).loss
 >>> loss.item()
 2.66
 ```
 For batched inference and training it is however recommended to make use of the tokenizer:
 ```python
-from transformers import T5ForConditionalGeneration, AutoTokenizer
+>>> from transformers import T5ForConditionalGeneration, AutoTokenizer
-model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
+>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
-tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
+>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")
-model_inputs = tokenizer(
+>>> model_inputs = tokenizer(
-    ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
+...     ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
-)
+... )
-labels = tokenizer(
+>>> labels_dict = tokenizer(
-    ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
+...     ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
-).input_ids
+... )
 >>> labels = labels_dict.input_ids
-loss = model(**model_inputs, labels=labels).loss  # forward pass
+>>> loss = model(**model_inputs, labels=labels).loss
 >>> loss.item()
 17.9
 ```
 Similar to [T5](t5), ByT5 was trained on the span-mask denoising task. However, 
 since the model works directly on characters, the pretraining task is a bit 
 different. Let's corrupt some characters of the 
 input sentence `"The dog chases a ball in the park."` and ask ByT5 to predict them 
 for us.
 ```python
 >>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 >>> import torch
 >>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-base")
 >>> model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-base")
 >>> input_ids_prompt = "The dog chases a ball in the park."
 >>> input_ids = tokenizer(input_ids_prompt).input_ids
 >>> # Note that we cannot add "{extra_id_...}" to the string directly
 >>> # as the Byte tokenizer would incorrectly merge the tokens
 >>> # For ByT5, we need to work directly on the character level
 >>> # Contrary to T5, ByT5 does not use sentinel tokens for masking, but instead
 >>> # uses final utf character ids.
 >>> # UTF-8 is represented by 8 bits and ByT5 has 3 special tokens.
 >>> # => There are 2**8+2 = 259 input ids and mask tokens count down from index 258.
 >>> # => mask to "The dog [258]a ball [257]park."
 >>> input_ids = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]])
 >>> input_ids
 tensor([[ 87, 107, 104,  35, 103, 114, 106,  35, 258,  35, 100,  35, 101, 100, 111, 111, 257,  35, 115, 100, 117, 110,  49,   1]])
 >>> # ByT5 produces only one char at a time so we need to produce many more output characters here -> set `max_length=100`.
 >>> output_ids = model.generate(input_ids, max_length=100)[0].tolist()
 >>> output_ids
 [0, 258, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118, 257,  35, 108, 113,  35, 119, 107, 104,  35, 103, 108, 118, 102, 114, 256, 108, 113,  35, 119, 107, 104, 35, 115, 100, 117, 110,  49,  35,  87, 107, 104,  35, 103, 114, 106, 35, 108, 118,  35, 119, 107, 104,  35, 114, 113, 104,  35, 122, 107, 114,  35, 103, 114, 104, 118,  35, 100,  35, 101, 100, 111, 111,  35, 108, 113, 255,  35, 108, 113,  35, 119, 107, 104,  35, 115, 100, 117, 110,  49]
 >>> # ^- Note how 258 descends to 257, 256, 255
 >>> # Now we need to split on the sentinel tokens, let's write a short loop for this
 >>> output_ids_list = []
 >>> start_token = 0
 >>> sentinel_token = 258
 >>> while sentinel_token in output_ids:
 ...     split_idx = output_ids.index(sentinel_token)
 ...     output_ids_list.append(output_ids[start_token:split_idx])
 ...     start_token = split_idx
 ...     sentinel_token -= 1
 >>> output_ids_list.append(output_ids[start_token:])
 >>> output_string = tokenizer.batch_decode(output_ids_list)
 >>> output_string
 ['<pad>', 'is the one who does', ' in the disco', 'in the park. The dog is the one who does a ball in', ' in the park.']
 ```
 ## ByT5Tokenizer
 [[autodoc]] ByT5Tokenizer
--- a/docs/source/en/model_doc/t5.mdx
+++ b/docs/source/en/model_doc/t5.mdx
@@ -93,15 +93,18 @@ language modeling head on top of the decoder.
 processed as follows:
 ```python
-  from transformers import T5Tokenizer, T5ForConditionalGeneration
+>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
-  tokenizer = T5Tokenizer.from_pretrained("t5-small")
+>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
-  model = T5ForConditionalGeneration.from_pretrained("t5-small")
+>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
-  input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
+>>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
-  labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
+>>> labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
-  # the forward function automatically creates the correct decoder_input_ids
+
-  loss = model(input_ids=input_ids, labels=labels).loss
+>>> # the forward function automatically creates the correct decoder_input_ids
 >>> loss = model(input_ids=input_ids, labels=labels).loss
 >>> loss.item()
 3.7837
 ```
 If you're interested in pre-training T5 on a new corpus, check out the [run_t5_mlm_flax.py](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling) script in the Examples
@@ -115,15 +118,18 @@ language modeling head on top of the decoder.
 the model as follows:
 ```python
-  from transformers import T5Tokenizer, T5ForConditionalGeneration
+>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
-  tokenizer = T5Tokenizer.from_pretrained("t5-small")
+>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
-  model = T5ForConditionalGeneration.from_pretrained("t5-small")
+>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
-  input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
+>>> input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
-  labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids
+>>> labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids
-  # the forward function automatically creates the correct decoder_input_ids
+
-  loss = model(input_ids=input_ids, labels=labels).loss
+>>> # the forward function automatically creates the correct decoder_input_ids
 >>> loss = model(input_ids=input_ids, labels=labels).loss
 >>> loss.item()
 0.2542
 ```
 As you can see, only 2 inputs are required for the model in order to compute a loss: `input_ids` (which are the
@@ -148,47 +154,51 @@ language modeling head on top of the decoder.
 ignored. The code example below illustrates all of this.
 ```python
-  from transformers import T5Tokenizer, T5ForConditionalGeneration
+>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
-  import torch
+>>> import torch
-  tokenizer = T5Tokenizer.from_pretrained("t5-small")
+>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
-  model = T5ForConditionalGeneration.from_pretrained("t5-small")
+>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
-  # the following 2 hyperparameters are task-specific
+>>> # the following 2 hyperparameters are task-specific
-  max_source_length = 512
+>>> max_source_length = 512
-  max_target_length = 128
+>>> max_target_length = 128
-  # Suppose we have the following 2 training examples:
+>>> # Suppose we have the following 2 training examples:
-  input_sequence_1 = "Welcome to NYC"
+>>> input_sequence_1 = "Welcome to NYC"
-  output_sequence_1 = "Bienvenue à NYC"
+>>> output_sequence_1 = "Bienvenue à NYC"
-  input_sequence_2 = "HuggingFace is a company"
+>>> input_sequence_2 = "HuggingFace is a company"
-  output_sequence_2 = "HuggingFace est une entreprise"
+>>> output_sequence_2 = "HuggingFace est une entreprise"
-  # encode the inputs
+>>> # encode the inputs
-  task_prefix = "translate English to French: "
+>>> task_prefix = "translate English to French: "
-  input_sequences = [input_sequence_1, input_sequence_2]
+>>> input_sequences = [input_sequence_1, input_sequence_2]
  encoding = tokenizer(
      [task_prefix + sequence for sequence in input_sequences],
      padding="longest",
      max_length=max_source_length,
      truncation=True,
      return_tensors="pt",
  )
  input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
-  # encode the targets
+>>> encoding = tokenizer(
-  target_encoding = tokenizer(
+...     [task_prefix + sequence for sequence in input_sequences],
-      [output_sequence_1, output_sequence_2], padding="longest", max_length=max_target_length, truncation=True
+...     padding="longest",
-  )
+...     max_length=max_source_length,
-  labels = target_encoding.input_ids
+...     truncation=True,
 ...     return_tensors="pt",
 ... )
-  # replace padding token id's of the labels by -100
+>>> input_ids, attention_mask = encoding.input_ids, encoding.attention_mask
  labels = torch.tensor(labels)
  labels[labels == tokenizer.pad_token_id] = -100
-  # forward pass
+>>> # encode the targets
-  loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
+>>> target_encoding = tokenizer(
 ...     [output_sequence_1, output_sequence_2], padding="longest", max_length=max_target_length, truncation=True
 ... )
 >>> labels = target_encoding.input_ids
 >>> # replace padding token id's of the labels by -100 so it's ignored by the loss
 >>> labels = torch.tensor(labels)
 >>> labels[labels == tokenizer.pad_token_id] = -100
 >>> # forward pass
 >>> loss = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels).loss
 >>> loss.item()
 0.188
 ```
 Additional training tips:
@@ -197,12 +207,12 @@ Additional training tips:
 optimizer. Typically, 1e-4 and 3e-4 work well for most problems (classification, summarization, translation, question
 answering, question generation). Note that T5 was pre-trained using the AdaFactor optimizer.
- According to [this forum post](https://discuss.huggingface.co/t/t5-finetuning-tips/684), task prefixes matter when
+According to [this forum post](https://discuss.huggingface.co/t/t5-finetuning-tips/684), task prefixes matter when
 (1) doing multi-task training (2) your task is similar or related to one of the supervised tasks used in T5's
 pre-training mixture (see Appendix D of the [paper](https://arxiv.org/pdf/1910.10683.pdf) for the task prefixes
 used).
- If training on TPU, it is recommended to pad all examples of the dataset to the same length or make use of
+If training on TPU, it is recommended to pad all examples of the dataset to the same length or make use of
 *pad_to_multiple_of* to have a small number of predefined bucket sizes to fit all examples in. Dynamically padding
 batches to the longest example is not recommended on TPU as it triggers a recompilation for every batch shape that is
 encountered during training thus significantly slowing down the training. only padding up to the longest example in a
@@ -219,15 +229,15 @@ There's also [this blog post](https://huggingface.co/blog/encoder-decoder#encode
 generation works in general in encoder-decoder models.
 ```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration
+>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
-tokenizer = T5Tokenizer.from_pretrained("t5-small")
+>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
-model = T5ForConditionalGeneration.from_pretrained("t5-small")
+>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
-input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
+>>> input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
-outputs = model.generate(input_ids)
+>>> outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-# Das Haus ist wunderbar.
+Das Haus ist wunderbar.
 ```
 Note that T5 uses the `pad_token_id` as the `decoder_start_token_id`, so when doing generation without using
@@ -236,31 +246,47 @@ Note that T5 uses the `pad_token_id` as the `decoder_start_token_id`, so when do
 The example above only shows a single example. You can also do batched inference, like so:
 ```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration
+>>> from transformers import T5Tokenizer, T5ForConditionalGeneration
-tokenizer = T5Tokenizer.from_pretrained("t5-small")
+>>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
-model = T5ForConditionalGeneration.from_pretrained("t5-small")
+>>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
-# when generating, we will use the logits of right-most token to predict the next token
+>>> task_prefix = "translate English to German: "
-# so the padding should be on the left
+>>> sentences = [
-tokenizer.padding_side = "left"
+...     "The house is wonderful.",
-tokenizer.pad_token = tokenizer.eos_token  # to avoid an error
+...     "I like to work in NYC.",
 >>> ]  # use different length sentences to test batching
 >>> inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)
-task_prefix = "translate English to German: "
+>>> output_sequences = model.generate(
-sentences = ["The house is wonderful.", "I like to work in NYC."]  # use different length sentences to test batching
+...     input_ids=inputs["input_ids"],
-inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)
+...     attention_mask=inputs["attention_mask"],
 ...     do_sample=False,  # disable sampling to test if batching affects output
 ... )
-output_sequences = model.generate(
+>>> print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
-    input_ids=inputs["input_ids"],
+['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']
    attention_mask=inputs["attention_mask"],
    do_sample=False,  # disable sampling to test if batching affects output
 )
 print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
 # ['Das Haus ist wunderbar.', 'Ich arbeite gerne in NYC.']
 ```
 Because T5 has been trained with the span-mask denoising objective,
 it can be used to predict the sentinel (masked-out) tokens during inference.
 The predicted tokens will then be placed between the sentinel tokens.
 ```python
 >>> from transformers import T5Tokenizer, T5ForConditionalGeneration
 >>> tokenizer = T5Tokenizer.from_pretrained("t5-small")
 >>> model = T5ForConditionalGeneration.from_pretrained("t5-small")
 >>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
 >>> sequence_ids = model.generate(input_ids)
 >>> sequences = tokenizer.batch_decode(sequence_ids)
 >>> sequences
 ['<pad> <extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']
 ```
 <a id='scripts'></a>
 ## Performance
--- a/docs/source/en/model_doc/t5v1.1.mdx
+++ b/docs/source/en/model_doc/t5v1.1.mdx
@@ -20,9 +20,9 @@ repository by Colin Raffel et al. It's an improved version of the original T5 mo
 One can directly plug in the weights of T5v1.1 into a T5 model, like so:
 ```python
-from transformers import T5ForConditionalGeneration
+>>> from transformers import T5ForConditionalGeneration
-model = T5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
+>>> model = T5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
 ```
 T5 Version 1.1 includes the following improvements compared to the original T5 model:
--- a/utils/documentation_tests.txt
+++ b/utils/documentation_tests.txt
@@ -1,6 +1,9 @@
 docs/source/en/quicktour.mdx
 docs/source/en/task_summary.mdx
 docs/source/en/model_doc/speech_to_text.mdx
 docs/source/en/model_doc/t5.mdx
 docs/source/en/model_doc/t5v1_1.mdx
 docs/source/en/model_doc/byt5.mdx
 docs/source/en/model_doc/tapex.mdx
 src/transformers/generation_utils.py
 src/transformers/models/bart/modeling_bart.py