reorganize glossary (#20010)
This commit is contained in:
@@ -12,108 +12,10 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Glossary
|
||||
|
||||
## General terms
|
||||
|
||||
- autoencoding models: see MLM
|
||||
- autoregressive models: see CLM
|
||||
- CLM: causal language modeling, a pretraining task where the model reads the texts in order and has to predict the
|
||||
next word. It's usually done by reading the whole sentence but using a mask inside the model to hide the future
|
||||
tokens at a certain timestep.
|
||||
- deep learning: machine learning algorithms which uses neural networks with several layers.
|
||||
- MLM: masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done
|
||||
by masking some tokens randomly, and has to predict the original text.
|
||||
- multimodal: a task that combines texts with another kind of inputs (for instance images).
|
||||
- NLG: natural language generation, all tasks related to generating text (for instance talk with transformers,
|
||||
translation).
|
||||
- NLP: natural language processing, a generic way to say "deal with texts".
|
||||
- NLU: natural language understanding, all tasks related to understanding what is in a text (for instance classifying
|
||||
the whole text, individual words).
|
||||
- pretrained model: a model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods
|
||||
involve a self-supervised objective, which can be reading the text and trying to predict the next word (see CLM) or
|
||||
masking some words and trying to predict them (see MLM).
|
||||
- RNN: recurrent neural network, a type of model that uses a loop over a layer to process texts.
|
||||
- self-attention: each element of the input finds out which other elements of the input they should attend to.
|
||||
- seq2seq or sequence-to-sequence: models that generate a new sequence from an input, like translation models, or
|
||||
summarization models (such as [Bart](model_doc/bart) or [T5](model_doc/t5)).
|
||||
- token: a part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords)
|
||||
or a punctuation symbol.
|
||||
- transformer: self-attention based deep learning model architecture.
|
||||
|
||||
## Model inputs
|
||||
|
||||
Every model is different yet bears similarities with the others. Therefore most models use the same inputs, which are
|
||||
detailed here alongside usage examples.
|
||||
|
||||
|
||||
|
||||
### Input IDs
|
||||
|
||||
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
|
||||
numerical representations of tokens building the sequences that will be used as input by the model*.
|
||||
|
||||
<Youtube id="VFp38yj8h3A"/>
|
||||
|
||||
Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
|
||||
tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenizer:
|
||||
|
||||
```python
|
||||
>>> from transformers import BertTokenizer
|
||||
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
|
||||
>>> sequence = "A Titan RTX has 24GB of VRAM"
|
||||
```
|
||||
|
||||
The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
|
||||
|
||||
```python
|
||||
>>> tokenized_sequence = tokenizer.tokenize(sequence)
|
||||
```
|
||||
|
||||
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
|
||||
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
|
||||
is added for "RA" and "M":
|
||||
|
||||
```python
|
||||
>>> print(tokenized_sequence)
|
||||
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
|
||||
```
|
||||
|
||||
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
|
||||
the sentence to the tokenizer, which leverages the Rust implementation of [🤗 Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.
|
||||
|
||||
```python
|
||||
>>> inputs = tokenizer(sequence)
|
||||
```
|
||||
|
||||
The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The
|
||||
token indices are under the key "input_ids":
|
||||
|
||||
```python
|
||||
>>> encoded_sequence = inputs["input_ids"]
|
||||
>>> print(encoded_sequence)
|
||||
[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
|
||||
```
|
||||
|
||||
Note that the tokenizer automatically adds "special tokens" (if the associated model relies on them) which are special
|
||||
IDs the model sometimes uses.
|
||||
|
||||
If we decode the previous sequence of ids,
|
||||
|
||||
```python
|
||||
>>> decoded_sequence = tokenizer.decode(encoded_sequence)
|
||||
```
|
||||
|
||||
we will see
|
||||
|
||||
```python
|
||||
>>> print(decoded_sequence)
|
||||
[CLS] A Titan RTX has 24GB of VRAM [SEP]
|
||||
```
|
||||
|
||||
because this is the way a [`BertModel`] is going to expect its inputs.
|
||||
|
||||
This glossary defines general machine learning and 🤗 Transformers terms to help you better understand the
|
||||
documentation.
|
||||
|
||||
## A
|
||||
|
||||
### Attention mask
|
||||
|
||||
@@ -162,16 +64,236 @@ We can see that 0s have been added on the right of the first sentence to make it
|
||||
```
|
||||
|
||||
This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the
|
||||
position of the padded indices so that the model does not attend to them. For the [`BertTokenizer`],
|
||||
`1` indicates a value that should be attended to, while `0` indicates a padded value. This attention mask is
|
||||
in the dictionary returned by the tokenizer under the key "attention_mask":
|
||||
position of the padded indices so that the model does not attend to them. For the [`BertTokenizer`], `1` indicates a
|
||||
value that should be attended to, while `0` indicates a padded value. This attention mask is in the dictionary returned
|
||||
by the tokenizer under the key "attention_mask":
|
||||
|
||||
```python
|
||||
>>> padded_sequences["attention_mask"]
|
||||
[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
|
||||
```
|
||||
|
||||
### autoencoding models
|
||||
|
||||
see [MLM](#mlm)
|
||||
|
||||
### autoregressive models
|
||||
|
||||
see [CLM](#clm)
|
||||
|
||||
## C
|
||||
|
||||
### CLM
|
||||
|
||||
Causal language modeling, a pretraining task where the model reads the texts in order and has to predict the next word.
|
||||
It's usually done by reading the whole sentence but using a mask inside the model to hide the future tokens at a
|
||||
certain timestep.
|
||||
|
||||
## D
|
||||
|
||||
### Decoder input IDs
|
||||
|
||||
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
|
||||
inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
|
||||
way specific to each model.
|
||||
|
||||
Most encoder-decoder models (BART, T5) create their `decoder_input_ids` on their own from the `labels`. In such models,
|
||||
passing the `labels` is the preferred way to handle training.
|
||||
|
||||
Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
|
||||
|
||||
### deep learning
|
||||
|
||||
Machine learning algorithms which uses neural networks with several layers.
|
||||
|
||||
## F
|
||||
|
||||
### Feed Forward Chunking
|
||||
|
||||
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
|
||||
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
|
||||
`bert-base-uncased`).
|
||||
|
||||
For an input of size `[batch_size, sequence_length]`, the memory required to store the intermediate feed forward
|
||||
embeddings `[batch_size, sequence_length, config.intermediate_size]` can account for a large fraction of the memory
|
||||
use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the
|
||||
computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output
|
||||
embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`
|
||||
individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n =
|
||||
sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
|
||||
**equivalent** result.
|
||||
|
||||
For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the number of output
|
||||
embeddings that are computed in parallel and thus defines the trade-off between memory and time complexity. If
|
||||
`chunk_size` is set to 0, no feed forward chunking is done.
|
||||
|
||||
## I
|
||||
|
||||
### Input IDs
|
||||
|
||||
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
|
||||
numerical representations of tokens building the sequences that will be used as input by the model*.
|
||||
|
||||
<Youtube id="VFp38yj8h3A"/>
|
||||
|
||||
Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
|
||||
tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenizer:
|
||||
|
||||
```python
|
||||
>>> from transformers import BertTokenizer
|
||||
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
|
||||
|
||||
>>> sequence = "A Titan RTX has 24GB of VRAM"
|
||||
```
|
||||
|
||||
The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary.
|
||||
|
||||
```python
|
||||
>>> tokenized_sequence = tokenizer.tokenize(sequence)
|
||||
```
|
||||
|
||||
The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split
|
||||
in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix
|
||||
is added for "RA" and "M":
|
||||
|
||||
```python
|
||||
>>> print(tokenized_sequence)
|
||||
['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
|
||||
```
|
||||
|
||||
These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding
|
||||
the sentence to the tokenizer, which leverages the Rust implementation of [🤗
|
||||
Tokenizers](https://github.com/huggingface/tokenizers) for peak performance.
|
||||
|
||||
```python
|
||||
>>> inputs = tokenizer(sequence)
|
||||
```
|
||||
|
||||
The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The
|
||||
token indices are under the key "input_ids":
|
||||
|
||||
```python
|
||||
>>> encoded_sequence = inputs["input_ids"]
|
||||
>>> print(encoded_sequence)
|
||||
[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
|
||||
```
|
||||
|
||||
Note that the tokenizer automatically adds "special tokens" (if the associated model relies on them) which are special
|
||||
IDs the model sometimes uses.
|
||||
|
||||
If we decode the previous sequence of ids,
|
||||
|
||||
```python
|
||||
>>> decoded_sequence = tokenizer.decode(encoded_sequence)
|
||||
```
|
||||
|
||||
we will see
|
||||
|
||||
```python
|
||||
>>> print(decoded_sequence)
|
||||
[CLS] A Titan RTX has 24GB of VRAM [SEP]
|
||||
```
|
||||
|
||||
because this is the way a [`BertModel`] is going to expect its inputs.
|
||||
|
||||
## L
|
||||
|
||||
### Labels
|
||||
|
||||
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
|
||||
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
|
||||
predictions and the expected value (the label).
|
||||
|
||||
These labels are different according to the model head, for example:
|
||||
|
||||
- For sequence classification models (e.g., [`BertForSequenceClassification`]), the model expects a tensor of dimension
|
||||
`(batch_size)` with each value of the batch corresponding to the expected label of the entire sequence.
|
||||
- For token classification models (e.g., [`BertForTokenClassification`]), the model expects a tensor of dimension
|
||||
`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token.
|
||||
- For masked language modeling (e.g., [`BertForMaskedLM`]), the model expects a tensor of dimension `(batch_size,
|
||||
seq_length)` with each value corresponding to the expected label of each individual token: the labels being the token
|
||||
ID for the masked token, and values to be ignored for the rest (usually -100).
|
||||
- For sequence to sequence tasks,(e.g., [`BartForConditionalGeneration`], [`MBartForConditionalGeneration`]), the model
|
||||
expects a tensor of dimension `(batch_size, tgt_seq_length)` with each value corresponding to the target sequences
|
||||
associated with each input sequence. During training, both *BART* and *T5* will make the appropriate
|
||||
*decoder_input_ids* and decoder attention masks internally. They usually do not need to be supplied. This does not
|
||||
apply to models leveraging the Encoder-Decoder framework. See the documentation of each model for more information on
|
||||
each specific model's labels.
|
||||
|
||||
The base models (e.g., [`BertModel`]) do not accept labels, as these are the base transformer models, simply outputting
|
||||
features.
|
||||
|
||||
## M
|
||||
|
||||
### MLM
|
||||
|
||||
Masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done by
|
||||
masking some tokens randomly, and has to predict the original text.
|
||||
|
||||
### multimodal
|
||||
|
||||
A task that combines texts with another kind of inputs (for instance images).
|
||||
|
||||
## N
|
||||
|
||||
### NLG
|
||||
|
||||
Natural language generation, all tasks related to generating text (for instance talk with transformers, translation).
|
||||
|
||||
### NLP
|
||||
|
||||
Natural language processing, a generic way to say "deal with texts".
|
||||
|
||||
### NLU
|
||||
|
||||
Natural language understanding, all tasks related to understanding what is in a text (for instance classifying the
|
||||
whole text, individual words).
|
||||
|
||||
## P
|
||||
|
||||
### Position IDs
|
||||
|
||||
Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
|
||||
each token. Therefore, the position IDs (`position_ids`) are used by the model to identify each token's position in the
|
||||
list of tokens.
|
||||
|
||||
They are an optional parameter. If no `position_ids` are passed to the model, the IDs are automatically created as
|
||||
absolute positional embeddings.
|
||||
|
||||
Absolute positional embeddings are selected in the range `[0, config.max_position_embeddings - 1]`. Some models use
|
||||
other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
|
||||
|
||||
|
||||
### pretrained model
|
||||
|
||||
A model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods involve a
|
||||
self-supervised objective, which can be reading the text and trying to predict the next word (see CLM) or masking some
|
||||
words and trying to predict them (see MLM).
|
||||
|
||||
## R
|
||||
|
||||
### RNN
|
||||
|
||||
Recurrent neural network, a type of model that uses a loop over a layer to process texts.
|
||||
|
||||
## S
|
||||
|
||||
### self-attention
|
||||
|
||||
Each element of the input finds out which other elements of the input they should attend to.
|
||||
|
||||
### seq2seq or sequence-to-sequence
|
||||
|
||||
Models that generate a new sequence from an input, like translation models, or summarization models (such as
|
||||
[Bart](model_doc/bart) or [T5](model_doc/t5)).
|
||||
|
||||
## T
|
||||
|
||||
### token
|
||||
|
||||
A part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) or a
|
||||
punctuation symbol.
|
||||
|
||||
### Token Type IDs
|
||||
|
||||
@@ -180,8 +302,8 @@ Some models' purpose is to do classification on pairs of sentences or question a
|
||||
<Youtube id="0u3ioSwev3s"/>
|
||||
|
||||
These require two different sequences to be joined in a single "input_ids" entry, which usually is performed with the
|
||||
help of special tokens, such as the classifier (`[CLS]`) and separator (`[SEP]`) tokens. For example, the BERT
|
||||
model builds its two sequence input as such:
|
||||
help of special tokens, such as the classifier (`[CLS]`) and separator (`[SEP]`) tokens. For example, the BERT model
|
||||
builds its two sequence input as such:
|
||||
|
||||
```python
|
||||
>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
|
||||
@@ -219,81 +341,11 @@ The tokenizer returns this mask as the "token_type_ids" entry:
|
||||
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
|
||||
```
|
||||
|
||||
The first sequence, the "context" used for the question, has all its tokens represented by a `0`, whereas the
|
||||
second sequence, corresponding to the "question", has all its tokens represented by a `1`.
|
||||
The first sequence, the "context" used for the question, has all its tokens represented by a `0`, whereas the second
|
||||
sequence, corresponding to the "question", has all its tokens represented by a `1`.
|
||||
|
||||
Some models, like [`XLNetModel`] use an additional token represented by a `2`.
|
||||
|
||||
### transformer
|
||||
|
||||
|
||||
### Position IDs
|
||||
|
||||
Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of
|
||||
each token. Therefore, the position IDs (`position_ids`) are used by the model to identify each token's position in
|
||||
the list of tokens.
|
||||
|
||||
They are an optional parameter. If no `position_ids` are passed to the model, the IDs are automatically created as
|
||||
absolute positional embeddings.
|
||||
|
||||
Absolute positional embeddings are selected in the range `[0, config.max_position_embeddings - 1]`. Some models use
|
||||
other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings.
|
||||
|
||||
|
||||
|
||||
### Labels
|
||||
|
||||
The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels
|
||||
should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its
|
||||
predictions and the expected value (the label).
|
||||
|
||||
These labels are different according to the model head, for example:
|
||||
|
||||
- For sequence classification models (e.g., [`BertForSequenceClassification`]), the model expects a
|
||||
tensor of dimension `(batch_size)` with each value of the batch corresponding to the expected label of the
|
||||
entire sequence.
|
||||
- For token classification models (e.g., [`BertForTokenClassification`]), the model expects a tensor
|
||||
of dimension `(batch_size, seq_length)` with each value corresponding to the expected label of each individual
|
||||
token.
|
||||
- For masked language modeling (e.g., [`BertForMaskedLM`]), the model expects a tensor of dimension
|
||||
`(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the
|
||||
labels being the token ID for the masked token, and values to be ignored for the rest (usually -100).
|
||||
- For sequence to sequence tasks,(e.g., [`BartForConditionalGeneration`],
|
||||
[`MBartForConditionalGeneration`]), the model expects a tensor of dimension `(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During
|
||||
training, both *BART* and *T5* will make the appropriate *decoder_input_ids* and decoder attention masks internally.
|
||||
They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See
|
||||
the documentation of each model for more information on each specific model's labels.
|
||||
|
||||
The base models (e.g., [`BertModel`]) do not accept labels, as these are the base transformer
|
||||
models, simply outputting features.
|
||||
|
||||
|
||||
|
||||
### Decoder input IDs
|
||||
|
||||
This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These
|
||||
inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a
|
||||
way specific to each model.
|
||||
|
||||
Most encoder-decoder models (BART, T5) create their `decoder_input_ids` on their own from the `labels`. In
|
||||
such models, passing the `labels` is the preferred way to handle training.
|
||||
|
||||
Please check each model's docs to see how they handle these input IDs for sequence to sequence training.
|
||||
|
||||
|
||||
### Feed Forward Chunking
|
||||
|
||||
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
|
||||
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
|
||||
`bert-base-uncased`).
|
||||
|
||||
For an input of size `[batch_size, sequence_length]`, the memory required to store the intermediate feed forward
|
||||
embeddings `[batch_size, sequence_length, config.intermediate_size]` can account for a large fraction of the memory
|
||||
use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the
|
||||
computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output
|
||||
embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n`
|
||||
individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically
|
||||
**equivalent** result.
|
||||
|
||||
For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the
|
||||
number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time
|
||||
complexity. If `chunk_size` is set to 0, no feed forward chunking is done.
|
||||
Self-attention based deep learning model architecture.
|
||||
Reference in New Issue
Block a user