Quick tour (#5145)
* Quicktour part 1 * Update * All done * Typos Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> * Address comments in quick tour * Update docs/source/quicktour.rst Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update from feedback Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This commit is contained in:
@@ -38,6 +38,16 @@ Choose the right framework for every part of a model's lifetime:
|
|||||||
Contents
|
Contents
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
|
The documentation is organized in five parts:
|
||||||
|
|
||||||
|
- **GET STARTED** contains a quick tour, the installation instructions and some useful information about our philosophy
|
||||||
|
and a glossary.
|
||||||
|
- **USING TRANSFORMERS** contains general tutorials on how to use the library.
|
||||||
|
- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
|
||||||
|
- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
|
||||||
|
transformers model
|
||||||
|
- **PACKAGE REFERENCE** contains the documentation of each public class and function.
|
||||||
|
|
||||||
The library currently contains PyTorch and Tensorflow implementations, pre-trained model weights, usage scripts and
|
The library currently contains PyTorch and Tensorflow implementations, pre-trained model weights, usage scripts and
|
||||||
conversion utilities for the following models:
|
conversion utilities for the following models:
|
||||||
|
|
||||||
@@ -118,16 +128,17 @@ conversion utilities for the following models:
|
|||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
:caption: Get started
|
:caption: Get started
|
||||||
|
|
||||||
|
quicktour
|
||||||
installation
|
installation
|
||||||
quickstart
|
philosophy
|
||||||
glossary
|
glossary
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
:caption: Using Transformers
|
:caption: Using Transformers
|
||||||
|
|
||||||
usage
|
task_summary
|
||||||
summary
|
model_summary
|
||||||
serialization
|
serialization
|
||||||
model_sharing
|
model_sharing
|
||||||
multilingual
|
multilingual
|
||||||
|
|||||||
@@ -17,7 +17,7 @@ The pipeline abstraction
|
|||||||
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
|
The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
|
||||||
other pipeline but requires an additional argument which is the `task`.
|
other pipeline but requires an additional argument which is the `task`.
|
||||||
|
|
||||||
... autofunction:: transformers.pipeline
|
.. autofunction:: transformers.pipeline
|
||||||
|
|
||||||
|
|
||||||
The task specific pipelines
|
The task specific pipelines
|
||||||
|
|||||||
73
docs/source/philosophy.rst
Normal file
73
docs/source/philosophy.rst
Normal file
@@ -0,0 +1,73 @@
|
|||||||
|
Philosophy
|
||||||
|
==========
|
||||||
|
|
||||||
|
Transformers is an opinionated library built for:
|
||||||
|
|
||||||
|
- NLP researchers and educators seeking to use/study/extend large-scale transformers models
|
||||||
|
- hands-on practitioners who want to fine-tune those models and/or serve them in production
|
||||||
|
- engineers who just want to download a pretrained model and use it to solve a given NLP task.
|
||||||
|
|
||||||
|
The library was designed with two strong goals in mind:
|
||||||
|
|
||||||
|
- Be as easy and fast to use as possible:
|
||||||
|
|
||||||
|
- We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
|
||||||
|
just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`,
|
||||||
|
:doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
|
||||||
|
- All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
|
||||||
|
:obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
|
||||||
|
loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary,
|
||||||
|
and models' weights) from a pretrained checkpoint provided on
|
||||||
|
`Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint.
|
||||||
|
- On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
|
||||||
|
using a model (plus its associated tokenizer and configuration) on a given task and
|
||||||
|
:func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
|
||||||
|
- As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
|
||||||
|
extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
|
||||||
|
classes of the library to reuse functionalities like model loading/saving.
|
||||||
|
|
||||||
|
- Provide state-of-the-art models with performances as close as possible to the original models:
|
||||||
|
|
||||||
|
- We provide at least one example for each architecture which reproduces a result provided by the official authors
|
||||||
|
of said architecture.
|
||||||
|
- The code is usually as close to the original code base as possible which means some PyTorch code may be not as
|
||||||
|
*pytorchic* as it could be as a result of being converted TensorFlow code and vice versa.
|
||||||
|
|
||||||
|
A few other goals:
|
||||||
|
|
||||||
|
- Expose the models' internals as consistently as possible:
|
||||||
|
|
||||||
|
- We give access, using a single API, to the full hidden-states and attention weights.
|
||||||
|
- Tokenizer and base model's API are standardized to easily switch between models.
|
||||||
|
|
||||||
|
- Incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
|
||||||
|
|
||||||
|
- A simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning.
|
||||||
|
- Simple ways to mask and prune transformer heads.
|
||||||
|
|
||||||
|
- Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framwork and inference using another.
|
||||||
|
|
||||||
|
Main concepts
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The library is build around three types of classes for each model:
|
||||||
|
|
||||||
|
- **Model classes** such as :class:`~transformers.BertModel`, which are 30+ PyTorch models
|
||||||
|
(`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models
|
||||||
|
(`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained
|
||||||
|
weights provided in the library.
|
||||||
|
- **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
|
||||||
|
a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
|
||||||
|
without any modification, creating the model will automatically take care of instantiating the configuration (which
|
||||||
|
is part of the model).
|
||||||
|
- **Tokenizer classes** such as :class:`~transformers.BertTokenizer`, which store the vocabulary for each model and
|
||||||
|
provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.
|
||||||
|
|
||||||
|
All these classes can be instantiated from pretrained instances and saved locally using two methods:
|
||||||
|
|
||||||
|
- :obj:`from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either
|
||||||
|
provided by the library itself (the suported models are provided in the list :doc:`here <pretrained_models>`
|
||||||
|
or stored locally (or on a server) by the user,
|
||||||
|
- :obj:`save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using
|
||||||
|
:obj:`from_pretrained()`.
|
||||||
|
|
||||||
@@ -1,222 +0,0 @@
|
|||||||
# Quickstart
|
|
||||||
|
|
||||||
## Philosophy
|
|
||||||
|
|
||||||
Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.
|
|
||||||
|
|
||||||
The library was designed with two strong goals in mind:
|
|
||||||
|
|
||||||
- be as easy and fast to use as possible:
|
|
||||||
|
|
||||||
- we strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
|
|
||||||
- all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
|
|
||||||
- as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
|
|
||||||
|
|
||||||
- provide state-of-the-art models with performances as close as possible to the original models:
|
|
||||||
|
|
||||||
- we provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture,
|
|
||||||
- the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
|
|
||||||
|
|
||||||
A few other goals:
|
|
||||||
|
|
||||||
- expose the models' internals as consistently as possible:
|
|
||||||
|
|
||||||
- we give access, using a single API to the full hidden-states and attention weights,
|
|
||||||
- tokenizer and base model's API are standardized to easily switch between models.
|
|
||||||
|
|
||||||
- incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
|
|
||||||
|
|
||||||
- a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
|
|
||||||
- simple ways to mask and prune transformer heads.
|
|
||||||
|
|
||||||
## Main concepts
|
|
||||||
|
|
||||||
The library is build around three types of classes for each model:
|
|
||||||
|
|
||||||
- **model classes** e.g., `BertModel` which are 20+ PyTorch models (`torch.nn.Modules`) that work with the pretrained weights provided in the library. In TF2, these are `tf.keras.Model`.
|
|
||||||
- **configuration classes** which store all the parameters required to build a model, e.g., `BertConfig`. You don't always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
|
|
||||||
- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., `BertTokenizer`
|
|
||||||
|
|
||||||
All these classes can be instantiated from pretrained instances and saved locally using two methods:
|
|
||||||
|
|
||||||
- `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
|
|
||||||
- `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
|
|
||||||
|
|
||||||
We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized into two parts:
|
|
||||||
|
|
||||||
- the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
|
|
||||||
- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and, in particular, the input/output that you should expect when calling each of them.
|
|
||||||
|
|
||||||
## Quick tour: Usage
|
|
||||||
|
|
||||||
Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
|
|
||||||
|
|
||||||
See the full API reference for examples of each model class.
|
|
||||||
|
|
||||||
### BERT example
|
|
||||||
|
|
||||||
Let's start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) from a text string using `BertTokenizer`
|
|
||||||
|
|
||||||
```python
|
|
||||||
import torch
|
|
||||||
from transformers import BertTokenizer, BertModel, BertForMaskedLM
|
|
||||||
|
|
||||||
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
|
|
||||||
import logging
|
|
||||||
logging.basicConfig(level=logging.INFO)
|
|
||||||
|
|
||||||
# Load pre-trained model tokenizer (vocabulary)
|
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
|
||||||
|
|
||||||
# Tokenize input
|
|
||||||
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
|
|
||||||
tokenized_text = tokenizer.tokenize(text)
|
|
||||||
|
|
||||||
# Mask a token that we will try to predict back with `BertForMaskedLM`
|
|
||||||
masked_index = 8
|
|
||||||
tokenized_text[masked_index] = '[MASK]'
|
|
||||||
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
|
|
||||||
|
|
||||||
# Convert token to vocabulary indices
|
|
||||||
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
|
|
||||||
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
|
|
||||||
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
|
|
||||||
|
|
||||||
# Convert inputs to PyTorch tensors
|
|
||||||
tokens_tensor = torch.tensor([indexed_tokens])
|
|
||||||
segments_tensors = torch.tensor([segments_ids])
|
|
||||||
```
|
|
||||||
|
|
||||||
Let's see how we can use `BertModel` to encode our inputs in hidden-states:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Load pre-trained model (weights)
|
|
||||||
model = BertModel.from_pretrained('bert-base-uncased')
|
|
||||||
|
|
||||||
# Set the model in evaluation mode to deactivate the DropOut modules
|
|
||||||
# This is IMPORTANT to have reproducible results during evaluation!
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
# If you have a GPU, put everything on cuda
|
|
||||||
tokens_tensor = tokens_tensor.to('cuda')
|
|
||||||
segments_tensors = segments_tensors.to('cuda')
|
|
||||||
model.to('cuda')
|
|
||||||
|
|
||||||
# Predict hidden states features for each layer
|
|
||||||
with torch.no_grad():
|
|
||||||
# See the models docstrings for the detail of the inputs
|
|
||||||
outputs = model(tokens_tensor, token_type_ids=segments_tensors)
|
|
||||||
# Transformers models always output tuples.
|
|
||||||
# See the models docstrings for the detail of all the outputs
|
|
||||||
# In our case, the first element is the hidden state of the last layer of the Bert model
|
|
||||||
encoded_layers = outputs[0]
|
|
||||||
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
|
|
||||||
assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
|
|
||||||
```
|
|
||||||
|
|
||||||
And how to use `BertForMaskedLM` to predict a masked token:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Load pre-trained model (weights)
|
|
||||||
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
# If you have a GPU, put everything on cuda
|
|
||||||
tokens_tensor = tokens_tensor.to('cuda')
|
|
||||||
segments_tensors = segments_tensors.to('cuda')
|
|
||||||
model.to('cuda')
|
|
||||||
|
|
||||||
# Predict all tokens
|
|
||||||
with torch.no_grad():
|
|
||||||
outputs = model(tokens_tensor, token_type_ids=segments_tensors)
|
|
||||||
predictions = outputs[0]
|
|
||||||
|
|
||||||
# confirm we were able to predict 'henson'
|
|
||||||
predicted_index = torch.argmax(predictions[0, masked_index]).item()
|
|
||||||
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
|
|
||||||
assert predicted_token == 'henson'
|
|
||||||
```
|
|
||||||
|
|
||||||
### OpenAI GPT-2
|
|
||||||
|
|
||||||
Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.
|
|
||||||
|
|
||||||
First let's prepare a tokenized input from our text string using `GPT2Tokenizer`
|
|
||||||
|
|
||||||
```python
|
|
||||||
import torch
|
|
||||||
from transformers import GPT2Tokenizer, GPT2LMHeadModel
|
|
||||||
|
|
||||||
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
|
|
||||||
import logging
|
|
||||||
logging.basicConfig(level=logging.INFO)
|
|
||||||
|
|
||||||
# Load pre-trained model tokenizer (vocabulary)
|
|
||||||
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
|
|
||||||
|
|
||||||
# Encode a text inputs
|
|
||||||
text = "Who was Jim Henson ? Jim Henson was a"
|
|
||||||
indexed_tokens = tokenizer.encode(text)
|
|
||||||
|
|
||||||
# Convert indexed tokens in a PyTorch tensor
|
|
||||||
tokens_tensor = torch.tensor([indexed_tokens])
|
|
||||||
```
|
|
||||||
|
|
||||||
Let's see how to use `GPT2LMHeadModel` to generate the next token following our text:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Load pre-trained model (weights)
|
|
||||||
model = GPT2LMHeadModel.from_pretrained('gpt2')
|
|
||||||
|
|
||||||
# Set the model in evaluation mode to deactivate the DropOut modules
|
|
||||||
# This is IMPORTANT to have reproducible results during evaluation!
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
# If you have a GPU, put everything on cuda
|
|
||||||
tokens_tensor = tokens_tensor.to('cuda')
|
|
||||||
model.to('cuda')
|
|
||||||
|
|
||||||
# Predict all tokens
|
|
||||||
with torch.no_grad():
|
|
||||||
outputs = model(tokens_tensor)
|
|
||||||
predictions = outputs[0]
|
|
||||||
|
|
||||||
# get the predicted next sub-word (in our case, the word 'man')
|
|
||||||
predicted_index = torch.argmax(predictions[0, -1, :]).item()
|
|
||||||
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
|
|
||||||
assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'
|
|
||||||
```
|
|
||||||
|
|
||||||
Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).
|
|
||||||
|
|
||||||
#### Using the past
|
|
||||||
|
|
||||||
GPT-2, as well as some other models (GPT, XLNet, Transfo-XL, CTRL), make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.
|
|
||||||
|
|
||||||
Here is a fully-working example using the `past` with `GPT2LMHeadModel` and argmax decoding (which should only be used as an example, as argmax decoding introduces a lot of repetition):
|
|
||||||
|
|
||||||
```python
|
|
||||||
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
|
||||||
import torch
|
|
||||||
|
|
||||||
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
|
|
||||||
model = GPT2LMHeadModel.from_pretrained('gpt2')
|
|
||||||
|
|
||||||
generated = tokenizer.encode("The Manhattan bridge")
|
|
||||||
context = torch.tensor([generated])
|
|
||||||
past = None
|
|
||||||
|
|
||||||
for i in range(100):
|
|
||||||
print(i)
|
|
||||||
output, past = model(context, past=past)
|
|
||||||
token = torch.argmax(output[..., -1, :])
|
|
||||||
|
|
||||||
generated += [token.tolist()]
|
|
||||||
context = token.unsqueeze(0)
|
|
||||||
|
|
||||||
sequence = tokenizer.decode(generated)
|
|
||||||
|
|
||||||
print(sequence)
|
|
||||||
```
|
|
||||||
|
|
||||||
The model only requires a single token as input as all the previous tokens' key/value pairs are contained in the `past`.
|
|
||||||
379
docs/source/quicktour.rst
Normal file
379
docs/source/quicktour.rst
Normal file
@@ -0,0 +1,379 @@
|
|||||||
|
Quick tour
|
||||||
|
==========
|
||||||
|
|
||||||
|
Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for
|
||||||
|
Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
|
||||||
|
such as completing a prompt with new text or translating in another language.
|
||||||
|
|
||||||
|
First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
|
||||||
|
will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
All code examples presented in the documentation have a switch on the top left for Pytorch versus TensorFlow. If
|
||||||
|
not, the code is expected to work for both backends without any change needed.
|
||||||
|
|
||||||
|
Getting started on a task with a pipeline
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. 🤗 Transformers
|
||||||
|
provides the following tasks out of the box:
|
||||||
|
|
||||||
|
- Sentiment analysis: is a text positive or negative?
|
||||||
|
- Text generation (in English): provide a prompt and the model will generate what follows.
|
||||||
|
- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place,
|
||||||
|
etc.)
|
||||||
|
- Question answering: provide the model with some context and a question, extract the answer from the context.
|
||||||
|
- Filling masked text: given a text with masked words (e.g., replaced by ``[MASK]``), fill the blanks.
|
||||||
|
- Summarization: generate a summary of a long text.
|
||||||
|
- Translation: translate a text in another language.
|
||||||
|
- Feature extraction: return a tensor representation of the text.
|
||||||
|
|
||||||
|
Let's see how this work for sentiment analysis (the other tasks are all covered in the
|
||||||
|
:doc:`task summary </task_summary>`):
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
from transformers import pipeline
|
||||||
|
classifier = pipeline('sentiment-analysis')
|
||||||
|
|
||||||
|
When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will
|
||||||
|
look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is
|
||||||
|
then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to
|
||||||
|
make them readable. For instance
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
classifier('We are very happy to show you the Transformers library.')
|
||||||
|
|
||||||
|
will return something like this:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
[{'label': 'POSITIVE', 'score': 0.999799370765686}]
|
||||||
|
|
||||||
|
That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a
|
||||||
|
`batch`:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
classifier(["We are very happy to show you the Transformers library.",
|
||||||
|
"We hope you don't hate it."])
|
||||||
|
|
||||||
|
returning a list of dictionaries like this one:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
[{'label': 'POSITIVE', 'score': 0.999799370765686},
|
||||||
|
{'label': 'NEGATIVE', 'score': 0.5308589935302734}]
|
||||||
|
|
||||||
|
You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
|
||||||
|
fairly neutral.
|
||||||
|
|
||||||
|
By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can
|
||||||
|
look at its `model page <https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english>`__ to get more
|
||||||
|
information about it. It uses the :doc:`DistilBERT architecture </model_doc/distilbert>` and has been fine-tuned on a
|
||||||
|
dataset called SST-2 for the sentiment analysis task.
|
||||||
|
|
||||||
|
Let's say we want to use another model; for instance, one that has been trained on French data. We can search through
|
||||||
|
the `model hub <https://huggingface.co/models>`__ that gathers models pretrained on a lot of data by research labs, but
|
||||||
|
also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags
|
||||||
|
"French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's
|
||||||
|
see how we can use it.
|
||||||
|
|
||||||
|
You can directly pass the name of the model to use to :func:`~transformers.pipeline`:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
|
||||||
|
|
||||||
|
This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
|
||||||
|
replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model
|
||||||
|
object and its associated tokenizer.
|
||||||
|
|
||||||
|
We will need two classes for this. The first is :class:`~transformers.AutoTokenizer`, which we will use to download the
|
||||||
|
tokenizer associated to the model we picked and instantiate it. The second is
|
||||||
|
:class:`~transformers.AutoModelForSequenceClassification` (or
|
||||||
|
:class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow), which we will use to download
|
||||||
|
the model itself. Note that if we were using the library on an other task, the class of the model would change. The
|
||||||
|
:doc:`task summary </task_summary>` tutorial summarizes which class is used for which task.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
|
||||||
|
|
||||||
|
Now, to download the models and tokenizer we found previously, we just have to use the
|
||||||
|
:func:`~transformers.AutoModelForSequenceClassification.from_pretrained` method (feel free to replace ``model_name`` by
|
||||||
|
any other model from the model hub):
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
|
||||||
|
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
pipe = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
|
||||||
|
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
|
||||||
|
|
||||||
|
If you don't find a model that has been pretrained on some data similar to yours, you will need to fine-tune a
|
||||||
|
pretrained model on your data. We provide :doc:`example scripts </examples>` to do so. Once you're done, don't forget
|
||||||
|
to share your fine-tuned model on the hub with the community, using :doc:`this tutorial </model_sharing>`.
|
||||||
|
|
||||||
|
.. _pretrained-model:
|
||||||
|
|
||||||
|
Under the hood: pretrained models
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
|
||||||
|
using the :obj:`from_pretrained` method:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||||
|
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
|
||||||
|
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
|
||||||
|
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
|
||||||
|
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
|
|
||||||
|
Using the tokenizer
|
||||||
|
^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
|
||||||
|
words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
|
||||||
|
that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the
|
||||||
|
same rules as when the model was pretrained.
|
||||||
|
|
||||||
|
The second step is to convert those `tokens` into numbers, to be able to build a tensor out of them and feed them to
|
||||||
|
the model. To do this, the tokenizer has a `vocab`, which is the part we download when we instantiate it with the
|
||||||
|
:obj:`from_pretrained` method, since we need to use the same `vocab` as when the model was pretrained.
|
||||||
|
|
||||||
|
To apply these steps on a given text, we can just feed it to our tokenizer:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
input = tokenizer("We are very happy to show you the Transformers library.")
|
||||||
|
print(input)
|
||||||
|
|
||||||
|
This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__,
|
||||||
|
as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
|
||||||
|
`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the sequence:
|
||||||
|
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102],
|
||||||
|
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
||||||
|
|
||||||
|
You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a
|
||||||
|
batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept
|
||||||
|
and get tensors back. You can specify all of that to the tokenizer:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
batch = tokenizer(
|
||||||
|
["We are very happy to show you the Transformers library.",
|
||||||
|
"We hope you don't hate it."],
|
||||||
|
padding=True, truncation=True, return_tensors="pt")
|
||||||
|
print(batch)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
batch = tokenizer(
|
||||||
|
["We are very happy to show you the Transformers library.",
|
||||||
|
"We hope you don't hate it."],
|
||||||
|
padding=True, truncation=True, return_tensors="tf")
|
||||||
|
print(batch)
|
||||||
|
|
||||||
|
The padding is automatically applied on the side the model expect it (in this case, on the right), with the
|
||||||
|
padding token the model was pretrained with. The attention mask is also adapted to take the padding into account:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
{'input_ids': tensor([[ 101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102],
|
||||||
|
[ 101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0]]),
|
||||||
|
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
|
||||||
|
|
||||||
|
You can learn more about tokenizers on their :doc:`doc page <main_classes/tokenizer>` (tutorial coming soon).
|
||||||
|
|
||||||
|
Using the model
|
||||||
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Once your input has been preprocessed by the tokenizer, you can directly send it to the model. As we mentioned, it will
|
||||||
|
contain all the relevant information the model needs. If you're using a TensorFlow model, you can directly pass the
|
||||||
|
dictionary keys to tensor, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
outputs = model(**batch)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
outputs = model(batch)
|
||||||
|
|
||||||
|
In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the
|
||||||
|
final activations of the model.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
(tensor([[-4.1329, 4.3811],
|
||||||
|
[ 0.0818, -0.0418]]),)
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final
|
||||||
|
activation function (like SoftMax) since this final activation function is often fused with the loss.
|
||||||
|
|
||||||
|
Let's apply the SoftMax activation to get predictions.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
import torch.nn.functional as F
|
||||||
|
predictions = F.softmax(outputs[0], dim=-1)
|
||||||
|
print(predictions)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
predictions = tf.nn.softmax(outputs[0], axis=-1)
|
||||||
|
print(predictions)
|
||||||
|
|
||||||
|
We can see we get the numbers from before:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
tensor([[2.0060e-04, 9.9980e-01],
|
||||||
|
[5.3086e-01, 4.6914e-01]])
|
||||||
|
|
||||||
|
If you have labels, you can provide them to the model, it will return a tuple with the loss and the final activations.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
import torch
|
||||||
|
outputs = model(**batch, labels = torch.tensor([1, 0])
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
import tensorflow as tf
|
||||||
|
outputs = model(batch, labels = tf.constant([1, 0])
|
||||||
|
|
||||||
|
Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or
|
||||||
|
`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual
|
||||||
|
training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if
|
||||||
|
you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed
|
||||||
|
precision, etc.). See the training tutorial (coming soon) for more details.
|
||||||
|
|
||||||
|
Once your model is fine-tuned, you can save it with its tokenizer the following way:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
tokenizer.save_pretrained(save_directory)
|
||||||
|
model.save_pretrained(save_directory)
|
||||||
|
|
||||||
|
You can then load this model back using the :func:`~transformers.AutoModel.from_pretrained` method by passing the
|
||||||
|
directory name instead of the model name. One cool feature of 🤗 Transformers is that you can easily switch between
|
||||||
|
PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow. If you are
|
||||||
|
loading a saved PyTorch model in a TensorFlow model, use :func:`~transformers.TFAutoModel.from_pretrained` like this:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(save_directory)
|
||||||
|
model = TFAutoModel.from_pretrained(save_directory, from_pt=True)
|
||||||
|
|
||||||
|
and if you are loading a saved TensorFlow model in a PyTorch model, you should use the following code:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(save_directory)
|
||||||
|
model = AutoModel.from_pretrained(save_directory, from_tf=True)
|
||||||
|
|
||||||
|
Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:
|
||||||
|
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
outputs = model(**batch, output_hidden_states=True, output_attentions=True)
|
||||||
|
all_hidden_states, all_attentions = outputs[-2:]
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
outputs = model(batch, output_hidden_states=True, output_attentions=True)
|
||||||
|
all_hidden_states, all_attentions = outputs[-2:]
|
||||||
|
|
||||||
|
Accessing the code
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that will automatically work with any
|
||||||
|
pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
|
||||||
|
code is easy to access and tweak if you need to.
|
||||||
|
|
||||||
|
In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's
|
||||||
|
using the :doc:`DistilBERT </model_doc/distilbert>` architecture. The model automatically created is then a
|
||||||
|
:class:`~transformers.DistilBertForSequenceClassification`. You can look at its documentation for all details relevant
|
||||||
|
to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer
|
||||||
|
without the auto magic:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
|
||||||
|
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
|
||||||
|
model = DistilBertForSequenceClassification.from_pretrained(model_name)
|
||||||
|
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
|
||||||
|
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
|
||||||
|
model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
|
||||||
|
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
|
||||||
|
|
||||||
|
Customizing the model
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
If you want to change how the model itself is built, you can define your custom configuration class. Each architecture
|
||||||
|
comes with its own relevant configuration (in the case of DistilBERT, :class:`~transformers.DistilBertConfig`) which
|
||||||
|
allows you to specify any of the hidden dimension, dropout rate etc. If you do core modifications, like changing the
|
||||||
|
hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would then
|
||||||
|
instantiate the model directly from this configuration.
|
||||||
|
|
||||||
|
Here we use the predefined vocabulary of DistilBERT (hence load the tokenizer with the
|
||||||
|
:func:`~transformers.DistilBertTokenizer.from_pretrained` method) and initialize the model from scratch (hence
|
||||||
|
instantiate the model from the configuration instead of using the
|
||||||
|
:func:`~transformers.DistilBertForSequenceClassification.from_pretrained` method).
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
|
||||||
|
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
|
||||||
|
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
|
||||||
|
model = DistilBertForSequenceClassification(config)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
|
||||||
|
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
|
||||||
|
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
|
||||||
|
model = TFDistilBertForSequenceClassification(config)
|
||||||
|
|
||||||
|
For something that only changes the head of the model (for instance, the number of labels), you can still use a
|
||||||
|
pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body.
|
||||||
|
We could create a configuration with all the default values and just change the number of labels, but more easily, you
|
||||||
|
can directly pass any argument a configuration would take to the :func:`from_pretrained` method and it will update the
|
||||||
|
default configuration with it:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
## PYTORCH CODE
|
||||||
|
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
|
||||||
|
model_name = "distilbert-base-uncased"
|
||||||
|
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
|
||||||
|
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
|
||||||
|
## TENSORFLOW CODE
|
||||||
|
from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
|
||||||
|
model_name = "distilbert-base-uncased"
|
||||||
|
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
|
||||||
|
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
Usage
|
Summary of the tasks
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
This page shows the most frequent use-cases when using the library. The models available allow for many different
|
This page shows the most frequent use-cases when using the library. The models available allow for many different
|
||||||
Reference in New Issue
Block a user