From a55dc157e3667b7c1e8696fbbb5d94df46e7a354 Mon Sep 17 00:00:00 2001
From: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Date: Tue, 15 Jun 2021 06:37:37 -0400
Subject: [PATCH] Add video links to the documentation (#12162)
---
docs/source/glossary.rst | 32 +++++++++++++----
docs/source/model_sharing.rst | 18 ++++++++++
docs/source/model_summary.rst | 28 +++++++++++++--
docs/source/preprocessing.rst | 12 +++++++
docs/source/quicktour.rst | 22 +++++++++---
docs/source/tokenizer_summary.rst | 57 +++++++++++++++++++++++--------
docs/source/training.rst | 24 +++++++++++++
7 files changed, 167 insertions(+), 26 deletions(-)
diff --git a/docs/source/glossary.rst b/docs/source/glossary.rst
index 8080e5916e..d95ed105cf 100644
--- a/docs/source/glossary.rst
+++ b/docs/source/glossary.rst
@@ -55,6 +55,12 @@ Input IDs
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
numerical representations of tokens building the sequences that will be used as input by the model*.
+.. raw:: html
+
+
+
Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
tokenizer, which is a `WordPiece `__ tokenizer:
@@ -120,8 +126,15 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
Attention mask
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
-which tokens should be attended to, and which should not.
+The attention mask is an optional argument used when batching sequences together.
+
+.. raw:: html
+
+
+
+This argument indicates to the model which tokens should be attended to, and which should not.
For example, consider these two sequences:
@@ -175,10 +188,17 @@ in the dictionary returned by the tokenizer under the key "attention_mask":
Token Type IDs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Some models' purpose is to do sequence classification or question answering. These require two different sequences to
-be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
-classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
-such:
+Some models' purpose is to do classification on pairs of sentences or question answering.
+
+.. raw:: html
+
+
+
+These require two different sequences to be joined in a single "input_ids" entry, which usually is performed with the
+help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT
+model builds its two sequence input as such:
.. code-block::
diff --git a/docs/source/model_sharing.rst b/docs/source/model_sharing.rst
index 06bd09f613..1f24e590f8 100644
--- a/docs/source/model_sharing.rst
+++ b/docs/source/model_sharing.rst
@@ -16,6 +16,12 @@ Model sharing and uploading
In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
the `model hub `__.
+.. raw:: html
+
+
+
.. note::
You will need to create an account on `huggingface.co `__ for this.
@@ -77,6 +83,12 @@ token that you can just copy.
Directly push your model to the hub
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. raw:: html
+
+
+
Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a
finetuned model you saved in :obj:`save_drectory` by calling:
@@ -152,6 +164,12 @@ or
Use your terminal and git
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. raw:: html
+
+
+
Basic steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/docs/source/model_summary.rst b/docs/source/model_summary.rst
index af0c190d3f..d76c871fc0 100644
--- a/docs/source/model_summary.rst
+++ b/docs/source/model_summary.rst
@@ -28,6 +28,12 @@ Each one of the models in the library falls into one of the following categories
* :ref:`multimodal-models`
* :ref:`retrieval-based-models`
+.. raw:: html
+
+
+
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
sentence so that the attention heads can only see what was before in the text, and not whatβs after. Although those
@@ -54,12 +60,18 @@ Multimodal models mix text inputs with other kinds (e.g. images) and are more sp
.. _autoregressive-models:
-Autoregressive models
+Decoders or autoregressive models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
that at each position, the model can only look at the tokens before the attention heads.
+.. raw:: html
+
+
+
Original GPT
-----------------------------------------------------------------------------------------------------------------------
@@ -215,13 +227,19 @@ multiple choice classification and question answering.
.. _autoencoding-models:
-Autoencoding models
+Encoders or autoencoding models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
corrupted versions.
+.. raw:: html
+
+
+
BERT
-----------------------------------------------------------------------------------------------------------------------
@@ -526,6 +544,12 @@ Sequence-to-sequence models
As mentioned before, these models keep both the encoder and the decoder of the original transformer.
+.. raw:: html
+
+
+
BART
-----------------------------------------------------------------------------------------------------------------------
diff --git a/docs/source/preprocessing.rst b/docs/source/preprocessing.rst
index 773f84783d..d0af7d2506 100644
--- a/docs/source/preprocessing.rst
+++ b/docs/source/preprocessing.rst
@@ -39,6 +39,12 @@ To automatically download the vocab used during pretraining or fine-tuning a giv
Base use
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. raw:: html
+
+
+
A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
is its ``__call__``: you just need to feed your sentence to your tokenizer object.
@@ -138,6 +144,12 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
Preprocessing pairs of sentences
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. raw:: html
+
+
+
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst
index 0e649b4c58..b8d6889b87 100644
--- a/docs/source/quicktour.rst
+++ b/docs/source/quicktour.rst
@@ -28,8 +28,15 @@ will dig a little bit more and see how the library gives you access to those mod
Getting started on a task with a pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. π€ Transformers
-provides the following tasks out of the box:
+The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`.
+
+.. raw:: html
+
+
+
+π€ Transformers provides the following tasks out of the box:
- Sentiment analysis: is a text positive or negative?
- Text generation (in English): provide a prompt and the model will generate what follows.
@@ -137,8 +144,15 @@ to share your fine-tuned model on the hub with the community, using :doc:`this t
Under the hood: pretrained models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
-using the :obj:`from_pretrained` method:
+Let's now see what happens beneath the hood when using those pipelines.
+
+.. raw:: html
+
+
+
+As we saw, the model and tokenizer are created using the :obj:`from_pretrained` method:
.. code-block::
diff --git a/docs/source/tokenizer_summary.rst b/docs/source/tokenizer_summary.rst
index 44f0d86e6c..31982383b1 100644
--- a/docs/source/tokenizer_summary.rst
+++ b/docs/source/tokenizer_summary.rst
@@ -13,12 +13,20 @@
Summary of the tokenizers
-----------------------------------------------------------------------------------------------------------------------
-On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
-`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
-look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
-text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
-tokenizers used in π€ Transformers: :ref:`Byte-Pair Encoding (BPE) `, :ref:`WordPiece `,
-and :ref:`SentencePiece `, and show examples of which tokenizer type is used by which model.
+On this page, we will have a closer look at tokenization.
+
+.. raw:: html
+
+
+
+As we saw in :doc:`the preprocessing tutorial `, tokenizing a text is splitting it into words or
+subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is
+straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text).
+More specifically, we will look at the three main types of tokenizers used in π€ Transformers: :ref:`Byte-Pair Encoding
+(BPE) `, :ref:`WordPiece `, and :ref:`SentencePiece `, and show examples
+of which tokenizer type is used by which model.
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
@@ -28,8 +36,15 @@ Introduction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
-For instance, let's look at the sentence ``"Don't you love π€ Transformers? We sure do."`` A simple way of tokenizing
-this text is to split it by spaces, which would give:
+For instance, let's look at the sentence ``"Don't you love π€ Transformers? We sure do."``
+
+.. raw:: html
+
+
+
+A simple way of tokenizing this text is to split it by spaces, which would give:
.. code-block::
@@ -69,16 +84,30 @@ Such a big vocabulary size forces the model to have an enormous embedding matrix
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
greater than 50,000, especially if they are pretrained only on a single language.
-So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
-character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
-the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
-for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
-Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
-transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
+So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?
+
+.. raw:: html
+
+
+
+While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder
+for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent
+representation for the letter ``"t"`` is much harder than learning a context-independent representation for the word
+``"today"``. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of
+both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword**
+tokenization.
Subword tokenization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. raw:: html
+
+
+
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
diff --git a/docs/source/training.rst b/docs/source/training.rst
index 7da4062b71..18863f2a47 100644
--- a/docs/source/training.rst
+++ b/docs/source/training.rst
@@ -27,6 +27,12 @@ negative. For examples of other tasks, refer to the :ref:`additional-resources`
Preparing the datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. raw:: html
+
+
+
We will use the `π€ Datasets `__ library to download and preprocess the IMDB
datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
to the π€ Datasets `documentation `__ or the :doc:`preprocessing` tutorial for
@@ -95,6 +101,12 @@ them by their `full` equivalent to train or evaluate on the full dataset.
Fine-tuning in PyTorch with the Trainer API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. raw:: html
+
+
+
Since PyTorch does not provide a training loop, the π€ Transformers library provides a :class:`~transformers.Trainer`
API that is optimized for π€ Transformers models, with a wide range of training options and with built-in features like
logging, gradient accumulation, and mixed precision.
@@ -200,6 +212,12 @@ See the documentation of :class:`~transformers.TrainingArguments` for more optio
Fine-tuning with Keras
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. raw:: html
+
+
+
Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
.. code-block:: python
@@ -257,6 +275,12 @@ as a PyTorch model (or vice-versa):
Fine-tuning in native PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. raw:: html
+
+
+
You might need to restart your notebook at this stage to free some memory, or excute the following code:
.. code-block:: python