[T5] Add training documenation (#3507)
* Add clear description of how to train T5 * correct docstring in T5 * correct typo * correct docstring format * update t5 model docs * implement collins feedback * fix typo and add more explanation for sentinal tokens * delete unnecessary todos
This commit is contained in:
committed by
GitHub
parent
33ef7002e1
commit
5b44e0a31b
@@ -16,13 +16,45 @@ To facilitate future work on transfer learning for NLP, we release our dataset,
|
||||
|
||||
The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .
|
||||
|
||||
Training
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
|
||||
This means that for training we always need an input sequence and a target sequence.
|
||||
The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* perprended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``lm_labels``. The PAD token is hereby used as the start-sequence token.
|
||||
T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
|
||||
|
||||
- Unsupervised denoising training
|
||||
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens)
|
||||
and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens.
|
||||
Each sentinel tokens represents a unique mask token for this sentence and should start with ``<extra_id_1>``, ``<extrac_id_2>``, ... up to ``<extra_id_100>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
|
||||
*E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows:
|
||||
|
||||
::
|
||||
|
||||
input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')
|
||||
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
|
||||
# the forward function automatically creates the correct decoder_input_ids
|
||||
model(input_ids=input_ids, lm_labels=lm_labels)
|
||||
|
||||
- Supervised training
|
||||
In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
|
||||
In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should
|
||||
be processed as follows:
|
||||
|
||||
::
|
||||
|
||||
input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
|
||||
lm_labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
|
||||
# the forward function automatically creates the correct decoder_input_ids
|
||||
model(input_ids=input_ids, lm_labels=lm_labels)
|
||||
|
||||
Tips
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised
|
||||
and supervised tasks and which each task is cast as a sequence to sequence task.
|
||||
Therefore T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
|
||||
For more information about the which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
|
||||
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generating the decoder output.
|
||||
and supervised tasks and for which each task is converted into a text-to-text format.
|
||||
T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
|
||||
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
|
||||
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
|
||||
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user