Add video links to the documentation (#12162)

2021-06-15 06:37:37 -04:00
parent 040283170c
commit a55dc157e3
7 changed files with 167 additions and 26 deletions
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -55,6 +55,12 @@ Input IDs
 The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
 numerical representations of tokens building the sequences that will be used as input by the model*.
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
 tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ tokenizer:
@@ -120,8 +126,15 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
 Attention mask
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
+The attention mask is an optional argument used when batching sequences together.
-which tokens should be attended to, and which should not.
+
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/M6adb1j2jPI" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 This argument indicates to the model which tokens should be attended to, and which should not.
 For example, consider these two sequences:
@@ -175,10 +188,17 @@ in the dictionary returned by the tokenizer under the key "attention_mask":
 Token Type IDs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Some models' purpose is to do sequence classification or question answering. These require two different sequences to
+Some models' purpose is to do classification on pairs of sentences or question answering.
-be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
+
-classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
+.. raw:: html
-such:
+
   <iframe width="560" height="315" src="https://www.youtube.com/embed/0u3ioSwev3s" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 These require two different sequences to be joined in a single "input_ids" entry, which usually is performed with the
 help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT
 model builds its two sequence input as such:
 .. code-block::
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -16,6 +16,12 @@ Model sharing and uploading
 In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
 the `model hub <https://huggingface.co/models>`__.
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/XvSGPZFEjDY" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 .. note::
    You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
@@ -77,6 +83,12 @@ token that you can just copy.
 Directly push your model to the hub
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/Z1-XMy-GNLQ" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a
 finetuned model you saved in :obj:`save_drectory` by calling:
@@ -152,6 +164,12 @@ or
 Use your terminal and git
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/rkCly_cbMBk" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Basic steps
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -28,6 +28,12 @@ Each one of the models in the library falls into one of the following categories
  * :ref:`multimodal-models`
  * :ref:`retrieval-based-models`
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/H39Z_720T5s" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
 previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
 sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those
@@ -54,12 +60,18 @@ Multimodal models mix text inputs with other kinds (e.g. images) and are more sp
 .. _autoregressive-models:
-Autoregressive models
+Decoders or autoregressive models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
 that at each position, the model can only look at the tokens before the attention heads.
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/d_ixlCubqQw" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Original GPT
 -----------------------------------------------------------------------------------------------------------------------
@@ -215,13 +227,19 @@ multiple choice classification and question answering.
 .. _autoencoding-models:
-Autoencoding models
+Encoders or autoencoding models
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
 look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
 corrupted versions.
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/MUqNwgPjJvQ" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 BERT
 -----------------------------------------------------------------------------------------------------------------------
@@ -526,6 +544,12 @@ Sequence-to-sequence models
 As mentioned before, these models keep both the encoder and the decoder of the original transformer.
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/0_4KEb08xrE" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 BART
 -----------------------------------------------------------------------------------------------------------------------
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@@ -39,6 +39,12 @@ To automatically download the vocab used during pretraining or fine-tuning a giv
 Base use
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/Yffk5aydLzg" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
 is its ``__call__``: you just need to feed your sentence to your tokenizer object.
@@ -138,6 +144,12 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
 Preprocessing pairs of sentences
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/0u3ioSwev3s" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
 a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
 is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -28,8 +28,15 @@ will dig a little bit more and see how the library gives you access to those mod
 Getting started on a task with a pipeline
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. 🤗 Transformers
+The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`.
-provides the following tasks out of the box:
+
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/tiZFewofSLM" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 🤗 Transformers provides the following tasks out of the box:
 - Sentiment analysis: is a text positive or negative?
 - Text generation (in English): provide a prompt and the model will generate what follows.
@@ -137,8 +144,15 @@ to share your fine-tuned model on the hub with the community, using :doc:`this t
 Under the hood: pretrained models
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
+Let's now see what happens beneath the hood when using those pipelines.
-using the :obj:`from_pretrained` method:
+
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/AhChOFRegn4" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 As we saw, the model and tokenizer are created using the :obj:`from_pretrained` method:
 .. code-block::
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@@ -13,12 +13,20 @@
 Summary of the tokenizers
 -----------------------------------------------------------------------------------------------------------------------
-On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
+On this page, we will have a closer look at tokenization.
-<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
+
-look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
+.. raw:: html
-text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
+
-tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
+   <iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
-and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
+   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 As we saw in :doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or
 subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is
 straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text).
 More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding
 (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`, and :ref:`SentencePiece <sentencepiece>`, and show examples
 of which tokenizer type is used by which model.
 Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
 type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
@@ -28,8 +36,15 @@ Introduction
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
-For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."`` A simple way of tokenizing
+For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."``
-this text is to split it by spaces, which would give:
+
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/nhJxYji1aho" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 A simple way of tokenizing this text is to split it by spaces, which would give:
 .. code-block::
@@ -69,16 +84,30 @@ Such a big vocabulary size forces the model to have an enormous embedding matrix
 causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
 greater than 50,000, especially if they are pretrained only on a single language.
-So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
+So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?
-character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
+
-the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
+.. raw:: html
-for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
+
-Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
+   <iframe width="560" height="315" src="https://www.youtube.com/embed/ssLq_EK2jLE" title="YouTube video player"
-transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
+   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder
 for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent
 representation for the letter ``"t"`` is much harder than learning a context-independent representation for the word
 ``"today"``. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of
 both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword**
 tokenization.
 Subword tokenization
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/zHvTiHr506c" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
 subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
 considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -27,6 +27,12 @@ negative. For examples of other tasks, refer to the :ref:`additional-resources`
 Preparing the datasets
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/_BZearw7f0w" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 We will use the `🤗 Datasets <https:/github.com/huggingface/datasets/>`__ library to download and preprocess the IMDB
 datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
 to the 🤗 Datasets `documentation <https://huggingface.co/docs/datasets/>`__ or the :doc:`preprocessing` tutorial for
@@ -95,6 +101,12 @@ them by their `full` equivalent to train or evaluate on the full dataset.
 Fine-tuning in PyTorch with the Trainer API
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/nvBXf7s7vTI" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a :class:`~transformers.Trainer`
 API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
 logging, gradient accumulation, and mixed precision.
@@ -200,6 +212,12 @@ See the documentation of :class:`~transformers.TrainingArguments` for more optio
 Fine-tuning with Keras
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/rnTGBy2ax1c" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
 .. code-block:: python
@@ -257,6 +275,12 @@ as a PyTorch model (or vice-versa):
 Fine-tuning in native PyTorch
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 .. raw:: html
   <iframe width="560" height="315" src="https://www.youtube.com/embed/Dh9CL8fyG80" title="YouTube video player"
   frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
   picture-in-picture" allowfullscreen></iframe>
 You might need to restart your notebook at this stage to free some memory, or excute the following code:
 .. code-block:: python