Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
This commit is contained in:
@@ -57,8 +57,8 @@ The tokenizer takes care of splitting the sequence into tokens available in the
|
||||
>>> tokenized_sequence = tokenizer.tokenize(sequence)
|
||||
|
||||
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
|
||||
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix is
|
||||
added for "RA" and "M":
|
||||
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
|
||||
is added for "RA" and "M":
|
||||
|
||||
.. code-block::
|
||||
|
||||
@@ -66,8 +66,8 @@ added for "RA" and "M":
|
||||
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
|
||||
|
||||
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
|
||||
the sentence to the tokenizer, which leverages the Rust implementation of
|
||||
`huggingface/tokenizers <https://github.com/huggingface/tokenizers>`__ for peak performance.
|
||||
the sentence to the tokenizer, which leverages the Rust implementation of `huggingface/tokenizers
|
||||
<https://github.com/huggingface/tokenizers>`__ for peak performance.
|
||||
|
||||
.. code-block::
|
||||
|
||||
@@ -105,8 +105,8 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
|
||||
Attention mask
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The attention mask is an optional argument used when batching sequences together. This argument indicates to the
|
||||
model which tokens should be attended to, and which should not.
|
||||
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
|
||||
which tokens should be attended to, and which should not.
|
||||
|
||||
For example, consider these two sequences:
|
||||
|
||||
@@ -145,10 +145,10 @@ We can see that 0s have been added on the right of the first sentence to make it
|
||||
>>> padded_sequences["input_ids"]
|
||||
[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
|
||||
|
||||
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating
|
||||
the position of the padded indices so that the model does not attend to them. For the
|
||||
:class:`~transformers.BertTokenizer`, :obj:`1` indicates a value that should be attended to, while :obj:`0` indicates
|
||||
a padded value. This attention mask is in the dictionary returned by the tokenizer under the key "attention_mask":
|
||||
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
|
||||
position of the padded indices so that the model does not attend to them. For the :class:`~transformers.BertTokenizer`,
|
||||
:obj:`1` indicates a value that should be attended to, while :obj:`0` indicates a padded value. This attention mask is
|
||||
in the dictionary returned by the tokenizer under the key "attention_mask":
|
||||
|
||||
.. code-block::
|
||||
|
||||
@@ -161,15 +161,16 @@ Token Type IDs
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Some models' purpose is to do sequence classification or question answering. These require two different sequences to
|
||||
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``)
|
||||
tokens. For example, the BERT model builds its two sequence input as such:
|
||||
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
|
||||
classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
|
||||
such:
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
|
||||
|
||||
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two arguments (and
|
||||
not a list, like before) like this:
|
||||
We can use our tokenizer to automatically generate such a sentence by passing the two sequences to ``tokenizer`` as two
|
||||
arguments (and not a list, like before) like this:
|
||||
|
||||
.. code-block::
|
||||
|
||||
@@ -189,8 +190,8 @@ which will return:
|
||||
[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
|
||||
|
||||
This is enough for some models to understand where one sequence ends and where another begins. However, other models,
|
||||
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary
|
||||
mask identifying the two types of sequence in the model.
|
||||
such as BERT, also deploy token type IDs (also called segment IDs). They are represented as a binary mask identifying
|
||||
the two types of sequence in the model.
|
||||
|
||||
The tokenizer returns this mask as the "token_type_ids" entry:
|
||||
|
||||
@@ -209,14 +210,15 @@ Some models, like :class:`~transformers.XLNetModel` use an additional token repr
|
||||
Position IDs
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Contrary to RNNs that have the position of each token embedded within them,
|
||||
transformers are unaware of the position of each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in the list of tokens.
|
||||
Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
|
||||
each token. Therefore, the position IDs (``position_ids``) are used by the model to identify each token's position in
|
||||
the list of tokens.
|
||||
|
||||
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as absolute
|
||||
positional embeddings.
|
||||
They are an optional parameter. If no ``position_ids`` is passed to the model, the IDs are automatically created as
|
||||
absolute positional embeddings.
|
||||
|
||||
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models
|
||||
use other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
|
||||
Absolute positional embeddings are selected in the range ``[0, config.max_position_embeddings - 1]``. Some models use
|
||||
other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
|
||||
|
||||
.. _labels:
|
||||
|
||||
@@ -224,43 +226,41 @@ Labels
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
|
||||
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between
|
||||
its predictions and the expected value (the label).
|
||||
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
|
||||
predictions and the expected value (the label).
|
||||
|
||||
These labels are different according to the model head, for example:
|
||||
|
||||
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects
|
||||
a tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
|
||||
- For sequence classification models (e.g., :class:`~transformers.BertForSequenceClassification`), the model expects a
|
||||
tensor of dimension :obj:`(batch_size)` with each value of the batch corresponding to the expected label of the
|
||||
entire sequence.
|
||||
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects
|
||||
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
|
||||
individual token.
|
||||
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects
|
||||
a tensor of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each
|
||||
individual token: the labels being the token ID for the masked token, and values to be ignored for the rest (usually
|
||||
-100).
|
||||
- For token classification models (e.g., :class:`~transformers.BertForTokenClassification`), the model expects a tensor
|
||||
of dimension :obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual
|
||||
token.
|
||||
- For masked language modeling (e.g., :class:`~transformers.BertForMaskedLM`), the model expects a tensor of dimension
|
||||
:obj:`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
|
||||
labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
|
||||
- For sequence to sequence tasks,(e.g., :class:`~transformers.BartForConditionalGeneration`,
|
||||
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension
|
||||
:obj:`(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each
|
||||
input sequence. During training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder
|
||||
attention masks internally. They usually do not need to be supplied. This does not apply to models leveraging the
|
||||
Encoder-Decoder framework.
|
||||
See the documentation of each model for more information on each specific model's labels.
|
||||
:class:`~transformers.MBartForConditionalGeneration`), the model expects a tensor of dimension :obj:`(batch_size,
|
||||
tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
|
||||
training, both `BART` and `T5` will make the appropriate `decoder_input_ids` and decoder attention masks internally.
|
||||
They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
|
||||
the documentation of each model for more information on each specific model's labels.
|
||||
|
||||
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer models,
|
||||
simply outputting features.
|
||||
The base models (e.g., :class:`~transformers.BertModel`) do not accept labels, as these are the base transformer
|
||||
models, simply outputting features.
|
||||
|
||||
.. _decoder-input-ids:
|
||||
|
||||
Decoder input IDs
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder.
|
||||
These inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually
|
||||
built in a way specific to each model.
|
||||
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
|
||||
inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
|
||||
way specific to each model.
|
||||
|
||||
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`.
|
||||
In such models, passing the :obj:`labels` is the preferred way to handle training.
|
||||
Most encoder-decoder models (BART, T5) create their :obj:`decoder_input_ids` on their own from the :obj:`labels`. In
|
||||
such models, passing the :obj:`labels` is the preferred way to handle training.
|
||||
|
||||
Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
|
||||
|
||||
@@ -270,18 +270,18 @@ Feed Forward Chunking
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
|
||||
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g.,
|
||||
for ``bert-base-uncased``).
|
||||
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
|
||||
``bert-base-uncased``).
|
||||
|
||||
For an input of size ``[batch_size, sequence_length]``, the memory required to store the intermediate feed forward
|
||||
embeddings ``[batch_size, sequence_length, config.intermediate_size]`` can account for a large fraction of the memory
|
||||
use. The authors of `Reformer: The Efficient Transformer <https://arxiv.org/abs/2001.04451>`_ noticed that since the
|
||||
computation is independent of the ``sequence_length`` dimension, it is mathematically equivalent to compute the output
|
||||
embeddings of both feed forward layers ``[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n``
|
||||
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with
|
||||
``n = sequence_length``, which trades increased computation time against reduced memory use, but yields a
|
||||
mathematically **equivalent** result.
|
||||
individually and concat them afterward to ``[batch_size, sequence_length, config.hidden_size]`` with ``n =
|
||||
sequence_length``, which trades increased computation time against reduced memory use, but yields a mathematically
|
||||
**equivalent** result.
|
||||
|
||||
For models employing the function :func:`~.transformers.apply_chunking_to_forward`, the ``chunk_size`` defines the
|
||||
number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
|
||||
complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done.
|
||||
complexity. If ``chunk_size`` is set to 0, no feed forward chunking is done.
|
||||
|
||||
Reference in New Issue
Block a user