From 417e492f1e832c0b93512600d3385aa4c8a887c9 Mon Sep 17 00:00:00 2001
From: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Date: Mon, 22 Jun 2020 16:08:09 -0400
Subject: [PATCH] Quick tour (#5145)

* Quicktour part 1

* Update

* All done

* Typos

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

* Address comments in quick tour

* Update docs/source/quicktour.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Update from feedback

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
---
 docs/source/index.rst                         |  17 +-
 docs/source/main_classes/pipelines.rst        |   2 +-
 .../source/{summary.rst => model_summary.rst} |   0
 docs/source/philosophy.rst                    |  73 ++++
 docs/source/quickstart.md                     | 222 ----------
 docs/source/quicktour.rst                     | 379 ++++++++++++++++++
 docs/source/{usage.rst => task_summary.rst}   |   2 +-
 7 files changed, 468 insertions(+), 227 deletions(-)
 rename docs/source/{summary.rst => model_summary.rst} (100%)
 create mode 100644 docs/source/philosophy.rst
 delete mode 100644 docs/source/quickstart.md
 create mode 100644 docs/source/quicktour.rst
 rename docs/source/{usage.rst => task_summary.rst} (99%)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index b84276ec05..07670a97c9 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -38,6 +38,16 @@ Choose the right framework for every part of a model's lifetime:
 Contents
 ---------------------------------
 
+The documentation is organized in five parts:
+
+- **GET STARTED** contains a quick tour, the installation instructions and some useful information about our philosophy
+  and a glossary.
+- **USING TRANSFORMERS** contains general tutorials on how to use the library.
+- **ADVANCED GUIDES** contains more advanced guides that are more specific to a given script or part of the library.
+- **RESEARCH** focuses on tutorials that have less to do with how to use the library but more about general resarch in
+  transformers model
+- **PACKAGE REFERENCE** contains the documentation of each public class and function.
+
 The library currently contains PyTorch and Tensorflow implementations, pre-trained model weights, usage scripts and
 conversion utilities for the following models:
 
@@ -118,16 +128,17 @@ conversion utilities for the following models:
     :maxdepth: 2
     :caption: Get started
 
+    quicktour
     installation
-    quickstart
+    philosophy
     glossary
 
 .. toctree::
     :maxdepth: 2
     :caption: Using Transformers
 
-    usage
-    summary
+    task_summary
+    model_summary
     serialization
     model_sharing
     multilingual
diff --git a/docs/source/main_classes/pipelines.rst b/docs/source/main_classes/pipelines.rst
index 04f918b362..ea51feb7ca 100644
--- a/docs/source/main_classes/pipelines.rst
+++ b/docs/source/main_classes/pipelines.rst
@@ -17,7 +17,7 @@ The pipeline abstraction
 The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any
 other pipeline but requires an additional argument which is the `task`.
 
-... autofunction:: transformers.pipeline
+.. autofunction:: transformers.pipeline
 
 
 The task specific pipelines
diff --git a/docs/source/summary.rst b/docs/source/model_summary.rst
similarity index 100%
rename from docs/source/summary.rst
rename to docs/source/model_summary.rst
diff --git a/docs/source/philosophy.rst b/docs/source/philosophy.rst
new file mode 100644
index 0000000000..be6182d19f
--- /dev/null
+++ b/docs/source/philosophy.rst
@@ -0,0 +1,73 @@
+Philosophy
+==========
+
+Transformers is an opinionated library built for:
+
+- NLP researchers and educators seeking to use/study/extend large-scale transformers models
+- hands-on practitioners who want to fine-tune those models and/or serve them in production
+- engineers who just want to download a pretrained model and use it to solve a given NLP task.
+
+The library was designed with two strong goals in mind:
+
+- Be as easy and fast to use as possible:
+
+    - We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
+      just three standard classes required to use each model: :doc:`configuration <main_classes/configuration>`, 
+      :doc:`models <main_classes/model>` and :doc:`tokenizer <main_classes/tokenizer>`.
+    - All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
+      :obj:`from_pretrained()` instantiation method which will take care of downloading (if needed), caching and
+      loading the related class instance and associated data (configurations' hyper-parameters, tokenizers' vocabulary, 
+      and models' weights) from a pretrained checkpoint provided on 
+      `Hugging Face Hub <https://huggingface.co/models>`__ or your own saved checkpoint.
+    - On top of those three base classes, the library provides two APIs: :func:`~transformers.pipeline` for quickly
+      using a model (plus its associated tokenizer and configuration) on a given task and 
+      :func:`~transformers.Trainer`/:func:`~transformers.TFTrainer` to quickly train or fine-tune a given model.
+    - As a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to
+      extend/build-upon the library, just use regular Python/PyTorch/TensorFlow/Keras modules and inherit from the base
+      classes of the library to reuse functionalities like model loading/saving.
+
+- Provide state-of-the-art models with performances as close as possible to the original models:
+
+    - We provide at least one example for each architecture which reproduces a result provided by the official authors
+      of said architecture.
+    - The code is usually as close to the original code base as possible which means some PyTorch code may be not as
+      *pytorchic* as it could be as a result of being converted TensorFlow code and vice versa.
+
+A few other goals:
+
+- Expose the models' internals as consistently as possible:
+
+    - We give access, using a single API, to the full hidden-states and attention weights.
+    - Tokenizer and base model's API are standardized to easily switch between models.
+
+- Incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
+
+    - A simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning.
+    - Simple ways to mask and prune transformer heads.
+
+- Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framwork and inference using another.
+
+Main concepts
+~~~~~~~~~~~~~
+
+The library is build around three types of classes for each model:
+
+- **Model classes**  such as :class:`~transformers.BertModel`, which are 30+ PyTorch models 
+  (`torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__) or Keras models 
+  (`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__) that work with the pretrained
+  weights provided in the library.
+- **Configuration classes** such as :class:`~transformers.BertConfig`, which store all the parameters required to build
+  a model. You don't always need to instantiate these yourself. In particular, if you are using a pretrained model
+  without any modification, creating the model will automatically take care of instantiating the configuration (which
+  is part of the model).
+- **Tokenizer classes** such as :class:`~transformers.BertTokenizer`, which store the vocabulary for each model and
+  provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model.
+
+All these classes can be instantiated from pretrained instances and saved locally using two methods:
+
+- :obj:`from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either
+  provided by the library itself (the suported models are provided in the list :doc:`here <pretrained_models>`
+  or stored locally (or on a server) by the user,
+- :obj:`save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using
+  :obj:`from_pretrained()`.
+
diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md
deleted file mode 100644
index e327679458..0000000000
--- a/docs/source/quickstart.md
+++ /dev/null
@@ -1,222 +0,0 @@
-# Quickstart
-
-## Philosophy
-
-Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.
-
-The library was designed with two strong goals in mind:
-
-- be as easy and fast to use as possible:
-
-  - we strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions, just three standard classes required to use each model: configuration, models and tokenizer,
-  - all of these classes can be initialized in a simple and unified way from pretrained instances by using a common `from_pretrained()` instantiation method which will take care of downloading (if needed), caching and loading the related class from a pretrained instance supplied in the library or your own saved instance.
-  - as a consequence, this library is NOT a modular toolbox of building blocks for neural nets. If you want to extend/build-upon the library, just use regular Python/PyTorch modules and inherit from the base classes of the library to reuse functionalities like model loading/saving.
-
-- provide state-of-the-art models with performances as close as possible to the original models:
-
-  - we provide at least one example for each architecture which reproduces a result provided by the official authors of said architecture,
-  - the code is usually as close to the original code base as possible which means some PyTorch code may be not as *pytorchic* as it could be as a result of being converted TensorFlow code.
-
-A few other goals:
-
-- expose the models' internals as consistently as possible:
-
-  - we give access, using a single API to the full hidden-states and attention weights,
-  - tokenizer and base model's API are standardized to easily switch between models.
-
-- incorporate a subjective selection of promising tools for fine-tuning/investigating these models:
-
-  - a simple/consistent way to add new tokens to the vocabulary and embeddings for fine-tuning,
-  - simple ways to mask and prune transformer heads.
-
-## Main concepts
-
-The library is build around three types of classes for each model:
-
-- **model classes**  e.g., `BertModel` which are 20+ PyTorch models (`torch.nn.Modules`) that work with the pretrained weights provided in the library. In TF2, these are `tf.keras.Model`.
-- **configuration classes** which store all the parameters required to build a model, e.g., `BertConfig`. You don't always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)
-- **tokenizer classes** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., `BertTokenizer`
-
-All these classes can be instantiated from pretrained instances and saved locally using two methods:
-
-- `from_pretrained()` let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed [here](https://huggingface.co/transformers/pretrained_models.html)) or stored locally (or on a server) by the user,
-- `save_pretrained()` let you save a model/configuration/tokenizer locally so that it can be reloaded using `from_pretrained()`.
-
-We'll finish this quickstart tour by going through a few simple quick-start examples to see how we can instantiate and use these classes. The rest of the documentation is organized into two parts:
-
-- the **MAIN CLASSES** section details the common functionalities/method/attributes of the three main type of classes (configuration, model, tokenizer) plus some optimization related classes provided as utilities for training,
-- the **PACKAGE REFERENCE** section details all the variants of each class for each model architectures and, in particular, the input/output that you should expect when calling each of them.
-
-## Quick tour: Usage
-
-Here are two examples showcasing a few `Bert` and `GPT2` classes and pre-trained models.
-
-See the full API reference for examples of each model class.
-
-### BERT example
-
-Let's start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) from a text string using `BertTokenizer`
-
-```python
-import torch
-from transformers import BertTokenizer, BertModel, BertForMaskedLM
-
-# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
-import logging
-logging.basicConfig(level=logging.INFO)
-
-# Load pre-trained model tokenizer (vocabulary)
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-
-# Tokenize input
-text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
-tokenized_text = tokenizer.tokenize(text)
-
-# Mask a token that we will try to predict back with `BertForMaskedLM`
-masked_index = 8
-tokenized_text[masked_index] = '[MASK]'
-assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
-
-# Convert token to vocabulary indices
-indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
-# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
-segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
-
-# Convert inputs to PyTorch tensors
-tokens_tensor = torch.tensor([indexed_tokens])
-segments_tensors = torch.tensor([segments_ids])
-```
-
-Let's see how we can use `BertModel` to encode our inputs in hidden-states:
-
-```python
-# Load pre-trained model (weights)
-model = BertModel.from_pretrained('bert-base-uncased')
-
-# Set the model in evaluation mode to deactivate the DropOut modules
-# This is IMPORTANT to have reproducible results during evaluation!
-model.eval()
-
-# If you have a GPU, put everything on cuda
-tokens_tensor = tokens_tensor.to('cuda')
-segments_tensors = segments_tensors.to('cuda')
-model.to('cuda')
-
-# Predict hidden states features for each layer
-with torch.no_grad():
-    # See the models docstrings for the detail of the inputs
-    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
-    # Transformers models always output tuples.
-    # See the models docstrings for the detail of all the outputs
-    # In our case, the first element is the hidden state of the last layer of the Bert model
-    encoded_layers = outputs[0]
-# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
-assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)
-```
-
-And how to use `BertForMaskedLM` to predict a masked token:
-
-```python
-# Load pre-trained model (weights)
-model = BertForMaskedLM.from_pretrained('bert-base-uncased')
-model.eval()
-
-# If you have a GPU, put everything on cuda
-tokens_tensor = tokens_tensor.to('cuda')
-segments_tensors = segments_tensors.to('cuda')
-model.to('cuda')
-
-# Predict all tokens
-with torch.no_grad():
-    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
-    predictions = outputs[0]
-
-# confirm we were able to predict 'henson'
-predicted_index = torch.argmax(predictions[0, masked_index]).item()
-predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
-assert predicted_token == 'henson'
-```
-
-### OpenAI GPT-2
-
-Here is a quick-start example using `GPT2Tokenizer` and `GPT2LMHeadModel` class with OpenAI's pre-trained model to predict the next token from a text prompt.
-
-First let's prepare a tokenized input from our text string using `GPT2Tokenizer`
-
-```python
-import torch
-from transformers import GPT2Tokenizer, GPT2LMHeadModel
-
-# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
-import logging
-logging.basicConfig(level=logging.INFO)
-
-# Load pre-trained model tokenizer (vocabulary)
-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-
-# Encode a text inputs
-text = "Who was Jim Henson ? Jim Henson was a"
-indexed_tokens = tokenizer.encode(text)
-
-# Convert indexed tokens in a PyTorch tensor
-tokens_tensor = torch.tensor([indexed_tokens])
-```
-
-Let's see how to use `GPT2LMHeadModel` to generate the next token following our text:
-
-```python
-# Load pre-trained model (weights)
-model = GPT2LMHeadModel.from_pretrained('gpt2')
-
-# Set the model in evaluation mode to deactivate the DropOut modules
-# This is IMPORTANT to have reproducible results during evaluation!
-model.eval()
-
-# If you have a GPU, put everything on cuda
-tokens_tensor = tokens_tensor.to('cuda')
-model.to('cuda')
-
-# Predict all tokens
-with torch.no_grad():
-    outputs = model(tokens_tensor)
-    predictions = outputs[0]
-
-# get the predicted next sub-word (in our case, the word 'man')
-predicted_index = torch.argmax(predictions[0, -1, :]).item()
-predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
-assert predicted_text == 'Who was Jim Henson? Jim Henson was a man'
-```
-
-Examples for each model class of each model architecture (Bert, GPT, GPT-2, Transformer-XL, XLNet and XLM) can be found in the [documentation](#documentation).
-
-#### Using the past
-
-GPT-2, as well as some other models (GPT, XLNet, Transfo-XL, CTRL), make use of a `past` or `mems` attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.
-
-Here is a fully-working example using the `past` with `GPT2LMHeadModel` and argmax decoding (which should only be used as an example, as argmax decoding introduces a lot of repetition):
-
-```python
-from transformers import GPT2LMHeadModel, GPT2Tokenizer
-import torch
-
-tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
-model = GPT2LMHeadModel.from_pretrained('gpt2')
-
-generated = tokenizer.encode("The Manhattan bridge")
-context = torch.tensor([generated])
-past = None
-
-for i in range(100):
-    print(i)
-    output, past = model(context, past=past)
-    token = torch.argmax(output[..., -1, :])
-
-    generated += [token.tolist()]
-    context = token.unsqueeze(0)
-
-sequence = tokenizer.decode(generated)
-
-print(sequence)
-```
-
-The model only requires a single token as input as all the previous tokens' key/value pairs are contained in the `past`.
diff --git a/docs/source/quicktour.rst b/docs/source/quicktour.rst
new file mode 100644
index 0000000000..c154265314
--- /dev/null
+++ b/docs/source/quicktour.rst
@@ -0,0 +1,379 @@
+Quick tour
+==========
+
+Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for
+Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
+such as completing a prompt with new text or translating in another language.
+
+First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
+will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.
+
+.. note::
+
+    All code examples presented in the documentation have a switch on the top left for Pytorch versus TensorFlow. If
+    not, the code is expected to work for both backends without any change needed.
+
+Getting started on a task with a pipeline
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The easiest way to use a pretrained model on a given task is to use :func:`~transformers.pipeline`. 🤗 Transformers
+provides the following tasks out of the box:
+
+- Sentiment analysis: is a text positive or negative?
+- Text generation (in English): provide a prompt and the model will generate what follows.
+- Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place,
+  etc.)
+- Question answering: provide the model with some context and a question, extract the answer from the context.
+- Filling masked text: given a text with masked words (e.g., replaced by ``[MASK]``), fill the blanks.
+- Summarization: generate a summary of a long text.
+- Translation: translate a text in another language.
+- Feature extraction: return a tensor representation of the text.
+
+Let's see how this work for sentiment analysis (the other tasks are all covered in the
+:doc:`task summary </task_summary>`):
+
+::
+
+    from transformers import pipeline
+    classifier = pipeline('sentiment-analysis')
+
+When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will
+look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is
+then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to
+make them readable. For instance
+
+::
+
+    classifier('We are very happy to show you the Transformers library.')
+
+will return something like this:
+
+::
+
+    [{'label': 'POSITIVE', 'score': 0.999799370765686}]
+
+That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model as a
+`batch`:
+
+::
+
+    classifier(["We are very happy to show you the Transformers library.",
+                "We hope you don't hate it."])
+
+returning a list of dictionaries like this one:
+
+::
+
+    [{'label': 'POSITIVE', 'score': 0.999799370765686},
+     {'label': 'NEGATIVE', 'score': 0.5308589935302734}]
+
+You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
+fairly neutral.
+
+By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can
+look at its `model page <https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english>`__ to get more
+information about it. It uses the :doc:`DistilBERT architecture </model_doc/distilbert>` and has been fine-tuned on a
+dataset called SST-2 for the sentiment analysis task.
+
+Let's say we want to use another model; for instance, one that has been trained on French data. We can search through
+the `model hub <https://huggingface.co/models>`__ that gathers models pretrained on a lot of data by research labs, but
+also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags
+"French" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's
+see how we can use it. 
+
+You can directly pass the name of the model to use to :func:`~transformers.pipeline`:
+
+::
+
+    classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
+
+This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
+replace that name by a local folder where you have saved a pretrained model (see below). You can also pass a model
+object and its associated tokenizer.
+
+We will need two classes for this. The first is :class:`~transformers.AutoTokenizer`, which we will use to download the
+tokenizer associated to the model we picked and instantiate it. The second is
+:class:`~transformers.AutoModelForSequenceClassification` (or
+:class:`~transformers.TFAutoModelForSequenceClassification` if you are using TensorFlow), which we will use to download
+the model itself. Note that if we were using the library on an other task, the class of the model would change. The
+:doc:`task summary </task_summary>` tutorial summarizes which class is used for which task.
+
+::
+
+    ## PYTORCH CODE
+    from transformers import AutoTokenizer, AutoModelForSequenceClassification
+    ## TENSORFLOW CODE
+    from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
+
+Now, to download the models and tokenizer we found previously, we just have to use the 
+:func:`~transformers.AutoModelForSequenceClassification.from_pretrained` method (feel free to replace ``model_name`` by
+any other model from the model hub):
+
+::
+
+    ## PYTORCH CODE
+    model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
+    model = AutoModelForSequenceClassification.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    pipe = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
+    ## TENSORFLOW CODE
+    model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
+    model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
+
+If you don't find a model that has been pretrained on some data similar to yours, you will need to fine-tune a
+pretrained model on your data. We provide :doc:`example scripts </examples>` to do so. Once you're done, don't forget
+to share your fine-tuned model on the hub with the community, using :doc:`this tutorial </model_sharing>`.
+
+.. _pretrained-model:
+
+Under the hood: pretrained models
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Let's now see what happens beneath the hood when using those pipelines. As we saw, the model and tokenizer are created
+using the :obj:`from_pretrained` method:
+
+::
+
+    ## PYTORCH CODE
+    from transformers import AutoTokenizer, AutoModelForSequenceClassification
+    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+    model = AutoModelForSequenceClassification.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    ## TENSORFLOW CODE
+    from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
+    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+    model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+Using the tokenizer
+^^^^^^^^^^^^^^^^^^^
+
+We mentioned the tokenizer is responsible for the preprocessing of your texts. First, it will split a given text in
+words (or part of words, punctuation symbols, etc.) usually called `tokens`. There are multiple rules that can govern
+that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the
+same rules as when the model was pretrained.
+
+The second step is to convert those `tokens` into numbers, to be able to build a tensor out of them and feed them to
+the model. To do this, the tokenizer has a `vocab`, which is the part we download when we instantiate it with the
+:obj:`from_pretrained` method, since we need to use the same `vocab` as when the model was pretrained.
+
+To apply these steps on a given text, we can just feed it to our tokenizer:
+
+::
+
+    input = tokenizer("We are very happy to show you the Transformers library.")
+    print(input)
+
+This returns a dictionary string to list of ints. It contains the `ids of the tokens <glossary.html#input-ids>`__,
+as mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
+`attention mask <glossary.html#attention-mask>`__ that the model will use to have a better understanding of the sequence:
+
+
+::
+
+    {'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102],
+     'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+
+You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a
+batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept
+and get tensors back. You can specify all of that to the tokenizer:
+
+::
+
+    ## PYTORCH CODE
+    batch = tokenizer(
+        ["We are very happy to show you the Transformers library.",
+         "We hope you don't hate it."],
+        padding=True, truncation=True, return_tensors="pt")
+    print(batch)
+    ## TENSORFLOW CODE
+    batch = tokenizer(
+        ["We are very happy to show you the Transformers library.",
+         "We hope you don't hate it."],
+        padding=True, truncation=True, return_tensors="tf")
+    print(batch)
+
+The padding is automatically applied on the side the model expect it (in this case, on the right), with the
+padding token the model was pretrained with. The attention mask is also adapted to take the padding into account:
+
+::
+
+    {'input_ids': tensor([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996, 19081, 3075,  1012,   102],
+                          [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,  1012,  102,     0,     0]]), 
+     'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+                               [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
+
+You can learn more about tokenizers on their :doc:`doc page <main_classes/tokenizer>` (tutorial coming soon).
+
+Using the model
+^^^^^^^^^^^^^^^
+
+Once your input has been preprocessed by the tokenizer, you can directly send it to the model. As we mentioned, it will
+contain all the relevant information the model needs. If you're using a TensorFlow model, you can directly pass the
+dictionary keys to tensor, for a PyTorch model, you need to unpack the dictionary by adding :obj:`**`.
+
+::
+
+    ## PYTORCH CODE
+    outputs = model(**batch)
+    ## TENSORFLOW CODE
+    outputs = model(batch)
+
+In 🤗 Transformers, all outputs are tuples (with only one element potentially). Here, we get a tuple with just the
+final activations of the model.
+
+::
+
+    (tensor([[-4.1329,  4.3811],
+             [ 0.0818, -0.0418]]),)
+
+.. note::
+
+    All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final
+    activation function (like SoftMax) since this final activation function is often fused with the loss.
+
+Let's apply the SoftMax activation to get predictions.
+
+::
+
+    ## PYTORCH CODE
+    import torch.nn.functional as F
+    predictions = F.softmax(outputs[0], dim=-1)
+    print(predictions)
+    ## TENSORFLOW CODE
+    predictions = tf.nn.softmax(outputs[0], axis=-1)
+    print(predictions)
+
+We can see we get the numbers from before:
+
+::
+
+    tensor([[2.0060e-04, 9.9980e-01],
+            [5.3086e-01, 4.6914e-01]])
+
+If you have labels, you can provide them to the model, it will return a tuple with the loss and the final activations.
+
+::
+
+    ## PYTORCH CODE
+    import torch
+    outputs = model(**batch, labels = torch.tensor([1, 0])
+    ## TENSORFLOW CODE
+    import tensorflow as tf
+    outputs = model(batch, labels = tf.constant([1, 0])
+
+Models are standard `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__ or
+`tf.keras.Model <https://www.tensorflow.org/api_docs/python/tf/keras/Model>`__ so you can use them in your usual
+training loop. 🤗 Transformers also provides a :class:`~transformers.Trainer` (or :class:`~transformers.TFTrainer` if
+you are using TensorFlow) class to help with your training (taking care of things such as distributed training, mixed
+precision, etc.). See the training tutorial (coming soon) for more details.
+
+Once your model is fine-tuned, you can save it with its tokenizer the following way:
+
+::
+
+    tokenizer.save_pretrained(save_directory)
+    model.save_pretrained(save_directory)
+
+You can then load this model back using the :func:`~transformers.AutoModel.from_pretrained` method by passing the
+directory name instead of the model name. One cool feature of 🤗 Transformers is that you can easily switch between
+PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow. If you are
+loading a saved PyTorch model in a TensorFlow model, use :func:`~transformers.TFAutoModel.from_pretrained` like this:
+
+::
+
+    tokenizer = AutoTokenizer.from_pretrained(save_directory)
+    model = TFAutoModel.from_pretrained(save_directory, from_pt=True)
+
+and if you are loading a saved TensorFlow model in a PyTorch model, you should use the following code:
+
+::
+
+    tokenizer = AutoTokenizer.from_pretrained(save_directory)
+    model = AutoModel.from_pretrained(save_directory, from_tf=True)
+
+Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:
+
+
+::
+
+    ## PYTORCH CODE
+    outputs = model(**batch, output_hidden_states=True, output_attentions=True)
+    all_hidden_states, all_attentions = outputs[-2:]
+    ## TENSORFLOW CODE
+    outputs = model(batch, output_hidden_states=True, output_attentions=True)
+    all_hidden_states, all_attentions = outputs[-2:]
+
+Accessing the code
+^^^^^^^^^^^^^^^^^^
+
+The :obj:`AutoModel` and :obj:`AutoTokenizer` classes are just shortcuts that will automatically work with any
+pretrained model. Behind the scenes, the library has one model class per combination of architecture plus class, so the
+code is easy to access and tweak if you need to.
+
+In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's
+using the :doc:`DistilBERT </model_doc/distilbert>` architecture. The model automatically created is then a
+:class:`~transformers.DistilBertForSequenceClassification`. You can look at its documentation for all details relevant
+to that specific model, or browse the source code. This is how you would directly instantiate model and tokenizer
+without the auto magic:
+
+::
+
+    ## PYTORCH CODE
+    from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
+    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+    model = DistilBertForSequenceClassification.from_pretrained(model_name)
+    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+    ## TENSORFLOW CODE
+    from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
+    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
+    model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
+    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+
+Customizing the model
+^^^^^^^^^^^^^^^^^^^^^
+
+If you want to change how the model itself is built, you can define your custom configuration class. Each architecture
+comes with its own relevant configuration (in the case of DistilBERT, :class:`~transformers.DistilBertConfig`) which
+allows you to specify any of the hidden dimension, dropout rate etc. If you do core modifications, like changing the
+hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would then
+instantiate the model directly from this configuration.
+
+Here we use the predefined vocabulary of DistilBERT (hence load the tokenizer with the
+:func:`~transformers.DistilBertTokenizer.from_pretrained` method) and initialize the model from scratch (hence
+instantiate the model from the configuration instead of using the
+:func:`~transformers.DistilBertForSequenceClassification.from_pretrained` method).
+
+::
+
+    ## PYTORCH CODE
+    from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
+    config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
+    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+    model = DistilBertForSequenceClassification(config)
+    ## TENSORFLOW CODE
+    from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
+    config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
+    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+    model = TFDistilBertForSequenceClassification(config)
+
+For something that only changes the head of the model (for instance, the number of labels), you can still use a
+pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body.
+We could create a configuration with all the default values and just change the number of labels, but more easily, you
+can directly pass any argument a configuration would take to the :func:`from_pretrained` method and it will update the
+default configuration with it:
+
+::
+
+    ## PYTORCH CODE
+    from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
+    model_name = "distilbert-base-uncased"
+    model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
+    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
+    ## TENSORFLOW CODE
+    from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
+    model_name = "distilbert-base-uncased"
+    model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
+    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
diff --git a/docs/source/usage.rst b/docs/source/task_summary.rst
similarity index 99%
rename from docs/source/usage.rst
rename to docs/source/task_summary.rst
index 5d035c4ab7..a7ef4d4572 100644
--- a/docs/source/usage.rst
+++ b/docs/source/task_summary.rst
@@ -1,4 +1,4 @@
-Usage
+Summary of the tasks
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 This page shows the most frequent use-cases when using the library. The models available allow for many different