From aa39967b2898d1056d51ec3b710468ca95773074 Mon Sep 17 00:00:00 2001 From: Steven Liu <59462357+stevhliu@users.noreply.github.com> Date: Wed, 2 Nov 2022 16:58:17 -0700 Subject: [PATCH] reorganize glossary (#20010) --- docs/source/en/glossary.mdx | 412 ++++++++++++++++++++---------------- 1 file changed, 232 insertions(+), 180 deletions(-) diff --git a/docs/source/en/glossary.mdx b/docs/source/en/glossary.mdx index a61eb86eaa..f362a013c8 100644 --- a/docs/source/en/glossary.mdx +++ b/docs/source/en/glossary.mdx @@ -12,108 +12,10 @@ specific language governing permissions and limitations under the License. # Glossary -## General terms - -- autoencoding models: see MLM -- autoregressive models: see CLM -- CLM: causal language modeling, a pretraining task where the model reads the texts in order and has to predict the - next word. It's usually done by reading the whole sentence but using a mask inside the model to hide the future - tokens at a certain timestep. -- deep learning: machine learning algorithms which uses neural networks with several layers. -- MLM: masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done - by masking some tokens randomly, and has to predict the original text. -- multimodal: a task that combines texts with another kind of inputs (for instance images). -- NLG: natural language generation, all tasks related to generating text (for instance talk with transformers, - translation). -- NLP: natural language processing, a generic way to say "deal with texts". -- NLU: natural language understanding, all tasks related to understanding what is in a text (for instance classifying - the whole text, individual words). -- pretrained model: a model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods - involve a self-supervised objective, which can be reading the text and trying to predict the next word (see CLM) or - masking some words and trying to predict them (see MLM). -- RNN: recurrent neural network, a type of model that uses a loop over a layer to process texts. -- self-attention: each element of the input finds out which other elements of the input they should attend to. -- seq2seq or sequence-to-sequence: models that generate a new sequence from an input, like translation models, or - summarization models (such as [Bart](model_doc/bart) or [T5](model_doc/t5)). -- token: a part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) - or a punctuation symbol. -- transformer: self-attention based deep learning model architecture. - -## Model inputs - -Every model is different yet bears similarities with the others. Therefore most models use the same inputs, which are -detailed here alongside usage examples. - - - -### Input IDs - -The input ids are often the only required parameters to be passed to the model as input. *They are token indices, -numerical representations of tokens building the sequences that will be used as input by the model*. - - - -Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT -tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenizer: - -```python ->>> from transformers import BertTokenizer - ->>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased") - ->>> sequence = "A Titan RTX has 24GB of VRAM" -``` - -The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary. - -```python ->>> tokenized_sequence = tokenizer.tokenize(sequence) -``` - -The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split -in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix -is added for "RA" and "M": - -```python ->>> print(tokenized_sequence) -['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] -``` - -These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding -the sentence to the tokenizer, which leverages the Rust implementation of [🤗 Tokenizers](https://github.com/huggingface/tokenizers) for peak performance. - -```python ->>> inputs = tokenizer(sequence) -``` - -The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The -token indices are under the key "input_ids": - -```python ->>> encoded_sequence = inputs["input_ids"] ->>> print(encoded_sequence) -[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102] -``` - -Note that the tokenizer automatically adds "special tokens" (if the associated model relies on them) which are special -IDs the model sometimes uses. - -If we decode the previous sequence of ids, - -```python ->>> decoded_sequence = tokenizer.decode(encoded_sequence) -``` - -we will see - -```python ->>> print(decoded_sequence) -[CLS] A Titan RTX has 24GB of VRAM [SEP] -``` - -because this is the way a [`BertModel`] is going to expect its inputs. - +This glossary defines general machine learning and 🤗 Transformers terms to help you better understand the +documentation. +## A ### Attention mask @@ -162,16 +64,236 @@ We can see that 0s have been added on the right of the first sentence to make it ``` This can then be converted into a tensor in PyTorch or TensorFlow. The attention mask is a binary tensor indicating the -position of the padded indices so that the model does not attend to them. For the [`BertTokenizer`], -`1` indicates a value that should be attended to, while `0` indicates a padded value. This attention mask is -in the dictionary returned by the tokenizer under the key "attention_mask": +position of the padded indices so that the model does not attend to them. For the [`BertTokenizer`], `1` indicates a +value that should be attended to, while `0` indicates a padded value. This attention mask is in the dictionary returned +by the tokenizer under the key "attention_mask": ```python >>> padded_sequences["attention_mask"] [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]] ``` +### autoencoding models +see [MLM](#mlm) + +### autoregressive models + +see [CLM](#clm) + +## C + +### CLM + +Causal language modeling, a pretraining task where the model reads the texts in order and has to predict the next word. +It's usually done by reading the whole sentence but using a mask inside the model to hide the future tokens at a +certain timestep. + +## D + +### Decoder input IDs + +This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These +inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a +way specific to each model. + +Most encoder-decoder models (BART, T5) create their `decoder_input_ids` on their own from the `labels`. In such models, +passing the `labels` is the preferred way to handle training. + +Please check each model's docs to see how they handle these input IDs for sequence to sequence training. + +### deep learning + +Machine learning algorithms which uses neural networks with several layers. + +## F + +### Feed Forward Chunking + +In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers. +The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for +`bert-base-uncased`). + +For an input of size `[batch_size, sequence_length]`, the memory required to store the intermediate feed forward +embeddings `[batch_size, sequence_length, config.intermediate_size]` can account for a large fraction of the memory +use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the +computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output +embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n` +individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = +sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically +**equivalent** result. + +For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the number of output +embeddings that are computed in parallel and thus defines the trade-off between memory and time complexity. If +`chunk_size` is set to 0, no feed forward chunking is done. + +## I + +### Input IDs + +The input ids are often the only required parameters to be passed to the model as input. *They are token indices, +numerical representations of tokens building the sequences that will be used as input by the model*. + + + +Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT +tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenizer: + +```python +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased") + +>>> sequence = "A Titan RTX has 24GB of VRAM" +``` + +The tokenizer takes care of splitting the sequence into tokens available in the tokenizer vocabulary. + +```python +>>> tokenized_sequence = tokenizer.tokenize(sequence) +``` + +The tokens are either words or subwords. Here for instance, "VRAM" wasn't in the model vocabulary, so it's been split +in "V", "RA" and "M". To indicate those tokens are not separate words but parts of the same word, a double-hash prefix +is added for "RA" and "M": + +```python +>>> print(tokenized_sequence) +['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] +``` + +These tokens can then be converted into IDs which are understandable by the model. This can be done by directly feeding +the sentence to the tokenizer, which leverages the Rust implementation of [🤗 +Tokenizers](https://github.com/huggingface/tokenizers) for peak performance. + +```python +>>> inputs = tokenizer(sequence) +``` + +The tokenizer returns a dictionary with all the arguments necessary for its corresponding model to work properly. The +token indices are under the key "input_ids": + +```python +>>> encoded_sequence = inputs["input_ids"] +>>> print(encoded_sequence) +[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102] +``` + +Note that the tokenizer automatically adds "special tokens" (if the associated model relies on them) which are special +IDs the model sometimes uses. + +If we decode the previous sequence of ids, + +```python +>>> decoded_sequence = tokenizer.decode(encoded_sequence) +``` + +we will see + +```python +>>> print(decoded_sequence) +[CLS] A Titan RTX has 24GB of VRAM [SEP] +``` + +because this is the way a [`BertModel`] is going to expect its inputs. + +## L + +### Labels + +The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels +should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its +predictions and the expected value (the label). + +These labels are different according to the model head, for example: + +- For sequence classification models (e.g., [`BertForSequenceClassification`]), the model expects a tensor of dimension + `(batch_size)` with each value of the batch corresponding to the expected label of the entire sequence. +- For token classification models (e.g., [`BertForTokenClassification`]), the model expects a tensor of dimension + `(batch_size, seq_length)` with each value corresponding to the expected label of each individual token. +- For masked language modeling (e.g., [`BertForMaskedLM`]), the model expects a tensor of dimension `(batch_size, + seq_length)` with each value corresponding to the expected label of each individual token: the labels being the token + ID for the masked token, and values to be ignored for the rest (usually -100). +- For sequence to sequence tasks,(e.g., [`BartForConditionalGeneration`], [`MBartForConditionalGeneration`]), the model + expects a tensor of dimension `(batch_size, tgt_seq_length)` with each value corresponding to the target sequences + associated with each input sequence. During training, both *BART* and *T5* will make the appropriate + *decoder_input_ids* and decoder attention masks internally. They usually do not need to be supplied. This does not + apply to models leveraging the Encoder-Decoder framework. See the documentation of each model for more information on + each specific model's labels. + +The base models (e.g., [`BertModel`]) do not accept labels, as these are the base transformer models, simply outputting +features. + +## M + +### MLM + +Masked language modeling, a pretraining task where the model sees a corrupted version of the texts, usually done by +masking some tokens randomly, and has to predict the original text. + +### multimodal + +A task that combines texts with another kind of inputs (for instance images). + +## N + +### NLG + +Natural language generation, all tasks related to generating text (for instance talk with transformers, translation). + +### NLP + +Natural language processing, a generic way to say "deal with texts". + +### NLU + +Natural language understanding, all tasks related to understanding what is in a text (for instance classifying the +whole text, individual words). + +## P + +### Position IDs + +Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of +each token. Therefore, the position IDs (`position_ids`) are used by the model to identify each token's position in the +list of tokens. + +They are an optional parameter. If no `position_ids` are passed to the model, the IDs are automatically created as +absolute positional embeddings. + +Absolute positional embeddings are selected in the range `[0, config.max_position_embeddings - 1]`. Some models use +other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings. + + +### pretrained model + +A model that has been pretrained on some data (for instance all of Wikipedia). Pretraining methods involve a +self-supervised objective, which can be reading the text and trying to predict the next word (see CLM) or masking some +words and trying to predict them (see MLM). + +## R + +### RNN + +Recurrent neural network, a type of model that uses a loop over a layer to process texts. + +## S + +### self-attention + +Each element of the input finds out which other elements of the input they should attend to. + +### seq2seq or sequence-to-sequence + +Models that generate a new sequence from an input, like translation models, or summarization models (such as +[Bart](model_doc/bart) or [T5](model_doc/t5)). + +## T + +### token + +A part of a sentence, usually a word, but can also be a subword (non-common words are often split in subwords) or a +punctuation symbol. ### Token Type IDs @@ -180,8 +302,8 @@ Some models' purpose is to do classification on pairs of sentences or question a These require two different sequences to be joined in a single "input_ids" entry, which usually is performed with the -help of special tokens, such as the classifier (`[CLS]`) and separator (`[SEP]`) tokens. For example, the BERT -model builds its two sequence input as such: +help of special tokens, such as the classifier (`[CLS]`) and separator (`[SEP]`) tokens. For example, the BERT model +builds its two sequence input as such: ```python >>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP] @@ -219,81 +341,11 @@ The tokenizer returns this mask as the "token_type_ids" entry: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1] ``` -The first sequence, the "context" used for the question, has all its tokens represented by a `0`, whereas the -second sequence, corresponding to the "question", has all its tokens represented by a `1`. +The first sequence, the "context" used for the question, has all its tokens represented by a `0`, whereas the second +sequence, corresponding to the "question", has all its tokens represented by a `1`. Some models, like [`XLNetModel`] use an additional token represented by a `2`. +### transformer - -### Position IDs - -Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of -each token. Therefore, the position IDs (`position_ids`) are used by the model to identify each token's position in -the list of tokens. - -They are an optional parameter. If no `position_ids` are passed to the model, the IDs are automatically created as -absolute positional embeddings. - -Absolute positional embeddings are selected in the range `[0, config.max_position_embeddings - 1]`. Some models use -other types of positional embeddings, such as sinusoidal position embeddings or relative position embeddings. - - - -### Labels - -The labels are an optional argument which can be passed in order for the model to compute the loss itself. These labels -should be the expected prediction of the model: it will use the standard loss in order to compute the loss between its -predictions and the expected value (the label). - -These labels are different according to the model head, for example: - -- For sequence classification models (e.g., [`BertForSequenceClassification`]), the model expects a - tensor of dimension `(batch_size)` with each value of the batch corresponding to the expected label of the - entire sequence. -- For token classification models (e.g., [`BertForTokenClassification`]), the model expects a tensor - of dimension `(batch_size, seq_length)` with each value corresponding to the expected label of each individual - token. -- For masked language modeling (e.g., [`BertForMaskedLM`]), the model expects a tensor of dimension - `(batch_size, seq_length)` with each value corresponding to the expected label of each individual token: the - labels being the token ID for the masked token, and values to be ignored for the rest (usually -100). -- For sequence to sequence tasks,(e.g., [`BartForConditionalGeneration`], - [`MBartForConditionalGeneration`]), the model expects a tensor of dimension `(batch_size, tgt_seq_length)` with each value corresponding to the target sequences associated with each input sequence. During - training, both *BART* and *T5* will make the appropriate *decoder_input_ids* and decoder attention masks internally. - They usually do not need to be supplied. This does not apply to models leveraging the Encoder-Decoder framework. See - the documentation of each model for more information on each specific model's labels. - -The base models (e.g., [`BertModel`]) do not accept labels, as these are the base transformer -models, simply outputting features. - - - -### Decoder input IDs - -This input is specific to encoder-decoder models, and contains the input IDs that will be fed to the decoder. These -inputs should be used for sequence to sequence tasks, such as translation or summarization, and are usually built in a -way specific to each model. - -Most encoder-decoder models (BART, T5) create their `decoder_input_ids` on their own from the `labels`. In -such models, passing the `labels` is the preferred way to handle training. - -Please check each model's docs to see how they handle these input IDs for sequence to sequence training. - - -### Feed Forward Chunking - -In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers. -The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for -`bert-base-uncased`). - -For an input of size `[batch_size, sequence_length]`, the memory required to store the intermediate feed forward -embeddings `[batch_size, sequence_length, config.intermediate_size]` can account for a large fraction of the memory -use. The authors of [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) noticed that since the -computation is independent of the `sequence_length` dimension, it is mathematically equivalent to compute the output -embeddings of both feed forward layers `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n` -individually and concat them afterward to `[batch_size, sequence_length, config.hidden_size]` with `n = sequence_length`, which trades increased computation time against reduced memory use, but yields a mathematically -**equivalent** result. - -For models employing the function [`apply_chunking_to_forward`], the `chunk_size` defines the -number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time -complexity. If `chunk_size` is set to 0, no feed forward chunking is done. +Self-attention based deep learning model architecture. \ No newline at end of file