Add video links to the documentation (#12162)
This commit is contained in:
@@ -55,6 +55,12 @@ Input IDs
|
|||||||
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
|
The input ids are often the only required parameters to be passed to the model as input. *They are token indices,
|
||||||
numerical representations of tokens building the sequences that will be used as input by the model*.
|
numerical representations of tokens building the sequences that will be used as input by the model*.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
|
Each tokenizer works differently but the underlying mechanism remains the same. Here's an example using the BERT
|
||||||
tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ tokenizer:
|
tokenizer, which is a `WordPiece <https://arxiv.org/pdf/1609.08144.pdf>`__ tokenizer:
|
||||||
|
|
||||||
@@ -120,8 +126,15 @@ because this is the way a :class:`~transformers.BertModel` is going to expect it
|
|||||||
Attention mask
|
Attention mask
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The attention mask is an optional argument used when batching sequences together. This argument indicates to the model
|
The attention mask is an optional argument used when batching sequences together.
|
||||||
which tokens should be attended to, and which should not.
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/M6adb1j2jPI" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
|
This argument indicates to the model which tokens should be attended to, and which should not.
|
||||||
|
|
||||||
For example, consider these two sequences:
|
For example, consider these two sequences:
|
||||||
|
|
||||||
@@ -175,10 +188,17 @@ in the dictionary returned by the tokenizer under the key "attention_mask":
|
|||||||
Token Type IDs
|
Token Type IDs
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Some models' purpose is to do sequence classification or question answering. These require two different sequences to
|
Some models' purpose is to do classification on pairs of sentences or question answering.
|
||||||
be joined in a single "input_ids" entry, which usually is performed with the help of special tokens, such as the
|
|
||||||
classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT model builds its two sequence input as
|
.. raw:: html
|
||||||
such:
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/0u3ioSwev3s" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
|
These require two different sequences to be joined in a single "input_ids" entry, which usually is performed with the
|
||||||
|
help of special tokens, such as the classifier (``[CLS]``) and separator (``[SEP]``) tokens. For example, the BERT
|
||||||
|
model builds its two sequence input as such:
|
||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
|
|||||||
@@ -16,6 +16,12 @@ Model sharing and uploading
|
|||||||
In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
|
In this page, we will show you how to share a model you have trained or fine-tuned on new data with the community on
|
||||||
the `model hub <https://huggingface.co/models>`__.
|
the `model hub <https://huggingface.co/models>`__.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/XvSGPZFEjDY" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
|
You will need to create an account on `huggingface.co <https://huggingface.co/join>`__ for this.
|
||||||
@@ -77,6 +83,12 @@ token that you can just copy.
|
|||||||
Directly push your model to the hub
|
Directly push your model to the hub
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/Z1-XMy-GNLQ" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a
|
Once you have an API token (either stored in the cache or copied and pasted in your notebook), you can directly push a
|
||||||
finetuned model you saved in :obj:`save_drectory` by calling:
|
finetuned model you saved in :obj:`save_drectory` by calling:
|
||||||
|
|
||||||
@@ -152,6 +164,12 @@ or
|
|||||||
Use your terminal and git
|
Use your terminal and git
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/rkCly_cbMBk" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Basic steps
|
Basic steps
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
|||||||
@@ -28,6 +28,12 @@ Each one of the models in the library falls into one of the following categories
|
|||||||
* :ref:`multimodal-models`
|
* :ref:`multimodal-models`
|
||||||
* :ref:`retrieval-based-models`
|
* :ref:`retrieval-based-models`
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/H39Z_720T5s" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
|
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the
|
||||||
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
|
previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full
|
||||||
sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those
|
sentence so that the attention heads can only see what was before in the text, and not what’s after. Although those
|
||||||
@@ -54,12 +60,18 @@ Multimodal models mix text inputs with other kinds (e.g. images) and are more sp
|
|||||||
|
|
||||||
.. _autoregressive-models:
|
.. _autoregressive-models:
|
||||||
|
|
||||||
Autoregressive models
|
Decoders or autoregressive models
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
|
As mentioned before, these models rely on the decoder part of the original transformer and use an attention mask so
|
||||||
that at each position, the model can only look at the tokens before the attention heads.
|
that at each position, the model can only look at the tokens before the attention heads.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/d_ixlCubqQw" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Original GPT
|
Original GPT
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
@@ -215,13 +227,19 @@ multiple choice classification and question answering.
|
|||||||
|
|
||||||
.. _autoencoding-models:
|
.. _autoencoding-models:
|
||||||
|
|
||||||
Autoencoding models
|
Encoders or autoencoding models
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
|
As mentioned before, these models rely on the encoder part of the original transformer and use no mask so the model can
|
||||||
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
|
look at all the tokens in the attention heads. For pretraining, targets are the original sentences and inputs are their
|
||||||
corrupted versions.
|
corrupted versions.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/MUqNwgPjJvQ" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
BERT
|
BERT
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
@@ -526,6 +544,12 @@ Sequence-to-sequence models
|
|||||||
|
|
||||||
As mentioned before, these models keep both the encoder and the decoder of the original transformer.
|
As mentioned before, these models keep both the encoder and the decoder of the original transformer.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/0_4KEb08xrE" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
BART
|
BART
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|||||||
@@ -39,6 +39,12 @@ To automatically download the vocab used during pretraining or fine-tuning a giv
|
|||||||
Base use
|
Base use
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/Yffk5aydLzg" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
|
A :class:`~transformers.PreTrainedTokenizer` has many methods, but the only one you need to remember for preprocessing
|
||||||
is its ``__call__``: you just need to feed your sentence to your tokenizer object.
|
is its ``__call__``: you just need to feed your sentence to your tokenizer object.
|
||||||
|
|
||||||
@@ -138,6 +144,12 @@ can safely ignore it. You can also pass ``verbose=False`` to stop the tokenizer
|
|||||||
Preprocessing pairs of sentences
|
Preprocessing pairs of sentences
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/0u3ioSwev3s" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
|
Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
|
||||||
a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
|
a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
|
||||||
is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
|
is then represented like this: :obj:`[CLS] Sequence A [SEP] Sequence B [SEP]`
|
||||||
|
|||||||
@@ -28,8 +28,15 @@ will dig a little bit more and see how the library gives you access to those mod
|
|||||||
Getting started on a task with a pipeline
|
Getting started on a task with a pipeline
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. 🤗 Transformers
|
The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`.
|
||||||
provides the following tasks out of the box:
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/tiZFewofSLM" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
|
🤗 Transformers provides the following tasks out of the box:
|
||||||
|
|
||||||
- Sentiment analysis: is a text positive or negative?
|
- Sentiment analysis: is a text positive or negative?
|
||||||
- Text generation (in English): provide a prompt and the model will generate what follows.
|
- Text generation (in English): provide a prompt and the model will generate what follows.
|
||||||
@@ -137,8 +144,15 @@ to share your fine-tuned model on the hub with the community, using :doc:`this t
|
|||||||
Under the hood: pretrained models
|
Under the hood: pretrained models
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
|
Let's now see what happens beneath the hood when using those pipelines.
|
||||||
using the :obj:`from_pretrained` method:
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/AhChOFRegn4" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
|
As we saw, the model and tokenizer are created using the :obj:`from_pretrained` method:
|
||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
|
|||||||
@@ -13,12 +13,20 @@
|
|||||||
Summary of the tokenizers
|
Summary of the tokenizers
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
On this page, we will have a closer look at tokenization. As we saw in :doc:`the preprocessing tutorial
|
On this page, we will have a closer look at tokenization.
|
||||||
<preprocessing>`, tokenizing a text is splitting it into words or subwords, which then are converted to ids through a
|
|
||||||
look-up table. Converting words or subwords to ids is straightforward, so in this summary, we will focus on splitting a
|
.. raw:: html
|
||||||
text into words or subwords (i.e. tokenizing a text). More specifically, we will look at the three main types of
|
|
||||||
tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding (BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`,
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A" title="YouTube video player"
|
||||||
and :ref:`SentencePiece <sentencepiece>`, and show examples of which tokenizer type is used by which model.
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
|
As we saw in :doc:`the preprocessing tutorial <preprocessing>`, tokenizing a text is splitting it into words or
|
||||||
|
subwords, which then are converted to ids through a look-up table. Converting words or subwords to ids is
|
||||||
|
straightforward, so in this summary, we will focus on splitting a text into words or subwords (i.e. tokenizing a text).
|
||||||
|
More specifically, we will look at the three main types of tokenizers used in 🤗 Transformers: :ref:`Byte-Pair Encoding
|
||||||
|
(BPE) <byte-pair-encoding>`, :ref:`WordPiece <wordpiece>`, and :ref:`SentencePiece <sentencepiece>`, and show examples
|
||||||
|
of which tokenizer type is used by which model.
|
||||||
|
|
||||||
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
|
Note that on each model page, you can look at the documentation of the associated tokenizer to know which tokenizer
|
||||||
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
|
type was used by the pretrained model. For instance, if we look at :class:`~transformers.BertTokenizer`, we can see
|
||||||
@@ -28,8 +36,15 @@ Introduction
|
|||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
|
Splitting a text into smaller chunks is a task that is harder than it looks, and there are multiple ways of doing so.
|
||||||
For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."`` A simple way of tokenizing
|
For instance, let's look at the sentence ``"Don't you love 🤗 Transformers? We sure do."``
|
||||||
this text is to split it by spaces, which would give:
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/nhJxYji1aho" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
|
A simple way of tokenizing this text is to split it by spaces, which would give:
|
||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
@@ -69,16 +84,30 @@ Such a big vocabulary size forces the model to have an enormous embedding matrix
|
|||||||
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
|
causes both an increased memory and time complexity. In general, transformers models rarely have a vocabulary size
|
||||||
greater than 50,000, especially if they are pretrained only on a single language.
|
greater than 50,000, especially if they are pretrained only on a single language.
|
||||||
|
|
||||||
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters? While
|
So if simple space and punctuation tokenization is unsatisfactory, why not simply tokenize on characters?
|
||||||
character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder for
|
|
||||||
the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent representation
|
.. raw:: html
|
||||||
for the letter ``"t"`` is much harder than learning a context-independent representation for the word ``"today"``.
|
|
||||||
Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of both worlds,
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/ssLq_EK2jLE" title="YouTube video player"
|
||||||
transformers models use a hybrid between word-level and character-level tokenization called **subword** tokenization.
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
|
While character tokenization is very simple and would greatly reduce memory and time complexity it makes it much harder
|
||||||
|
for the model to learn meaningful input representations. *E.g.* learning a meaningful context-independent
|
||||||
|
representation for the letter ``"t"`` is much harder than learning a context-independent representation for the word
|
||||||
|
``"today"``. Therefore, character tokenization is often accompanied by a loss of performance. So to get the best of
|
||||||
|
both worlds, transformers models use a hybrid between word-level and character-level tokenization called **subword**
|
||||||
|
tokenization.
|
||||||
|
|
||||||
Subword tokenization
|
Subword tokenization
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/zHvTiHr506c" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
|
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller
|
||||||
subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
|
subwords, but rare words should be decomposed into meaningful subwords. For instance ``"annoyingly"`` might be
|
||||||
considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
|
considered a rare word and could be decomposed into ``"annoying"`` and ``"ly"``. Both ``"annoying"`` and ``"ly"`` as
|
||||||
|
|||||||
@@ -27,6 +27,12 @@ negative. For examples of other tasks, refer to the :ref:`additional-resources`
|
|||||||
Preparing the datasets
|
Preparing the datasets
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/_BZearw7f0w" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
We will use the `🤗 Datasets <https:/github.com/huggingface/datasets/>`__ library to download and preprocess the IMDB
|
We will use the `🤗 Datasets <https:/github.com/huggingface/datasets/>`__ library to download and preprocess the IMDB
|
||||||
datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
|
datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
|
||||||
to the 🤗 Datasets `documentation <https://huggingface.co/docs/datasets/>`__ or the :doc:`preprocessing` tutorial for
|
to the 🤗 Datasets `documentation <https://huggingface.co/docs/datasets/>`__ or the :doc:`preprocessing` tutorial for
|
||||||
@@ -95,6 +101,12 @@ them by their `full` equivalent to train or evaluate on the full dataset.
|
|||||||
Fine-tuning in PyTorch with the Trainer API
|
Fine-tuning in PyTorch with the Trainer API
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/nvBXf7s7vTI" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a :class:`~transformers.Trainer`
|
Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a :class:`~transformers.Trainer`
|
||||||
API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
|
API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
|
||||||
logging, gradient accumulation, and mixed precision.
|
logging, gradient accumulation, and mixed precision.
|
||||||
@@ -200,6 +212,12 @@ See the documentation of :class:`~transformers.TrainingArguments` for more optio
|
|||||||
Fine-tuning with Keras
|
Fine-tuning with Keras
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/rnTGBy2ax1c" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
|
Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
@@ -257,6 +275,12 @@ as a PyTorch model (or vice-versa):
|
|||||||
Fine-tuning in native PyTorch
|
Fine-tuning in native PyTorch
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<iframe width="560" height="315" src="https://www.youtube.com/embed/Dh9CL8fyG80" title="YouTube video player"
|
||||||
|
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope;
|
||||||
|
picture-in-picture" allowfullscreen></iframe>
|
||||||
|
|
||||||
You might need to restart your notebook at this stage to free some memory, or excute the following code:
|
You might need to restart your notebook at this stage to free some memory, or excute the following code:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|||||||
Reference in New Issue
Block a user