Convert tutorials (#14665)

* Convert a few docs * And another * Last tutorials * New syntax for colab links * Convert a few docs * And another * Last tutorials * New syntax for colab links
2021-12-08 13:19:46 -05:00
parent 0f4e39c559
commit cf36f4d7a8
18 changed files with 3608 additions and 3836 deletions
--- a/docs/source/training.mdx
+++ b/docs/source/training.mdx
@@ -0,0 +1,399 @@
+<!--Copyright 2020 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Fine-tuning a pretrained model
+
+[[open-in-colab]]
+
+In this tutorial, we will show you how to fine-tune a pretrained model from the Transformers library. In TensorFlow,
+models can be directly trained using Keras and the `fit` method. In PyTorch, there is no generic training loop so
+the 🤗 Transformers library provides an API with the class [`Trainer`] to let you fine-tune or train
+a model from scratch easily. Then we will show you how to alternatively write the whole training loop in PyTorch.
+
+Before we can fine-tune a model, we need a dataset. In this tutorial, we will show you how to fine-tune BERT on the
+[IMDB dataset](https://www.imdb.com/interfaces/): the task is to classify whether movie reviews are positive or
+negative. For examples of other tasks, refer to the [additional-resources](#additional-resources) section!
+
+<a id='data-processing'></a>
+
+## Preparing the datasets
+
+<Youtube id="_BZearw7f0w"/>
+
+We will use the [🤗 Datasets](https://github.com/huggingface/datasets/) library to download and preprocess the IMDB
+datasets. We will go over this part pretty quickly. Since the focus of this tutorial is on training, you should refer
+to the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/) or the [preprocessing](preprocessing) tutorial for
+more information.
+
+First, we can use the `load_dataset` function to download and cache the dataset:
+
+```python
+from datasets import load_dataset
+
+raw_datasets = load_dataset("imdb")
+```
+
+This works like the `from_pretrained` method we saw for the models and tokenizers (except the cache directory is
+_~/.cache/huggingface/dataset_ by default).
+
+The `raw_datasets` object is a dictionary with three keys: `"train"`, `"test"` and `"unsupervised"`
+(which correspond to the three splits of that dataset). We will use the `"train"` split for training and the
+`"test"` split for validation.
+
+To preprocess our data, we will need a tokenizer:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
+```
+
+As we saw in [preprocessing](preprocessing), we can prepare the text inputs for the model with the following command (this is an
+example, not a command you can execute):
+
+```python
+inputs = tokenizer(sentences, padding="max_length", truncation=True)
+```
+
+This will make all the samples have the maximum length the model can accept (here 512), either by padding or truncating
+them.
+
+However, we can instead apply these preprocessing steps to all the splits of our dataset at once by using the
+`map` method:
+
+```python
+def tokenize_function(examples):
+    return tokenizer(examples["text"], padding="max_length", truncation=True)
+
+tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
+```
+
+You can learn more about the map method or the other ways to preprocess the data in the 🤗 Datasets [documentation](https://huggingface.co/docs/datasets/).
+
+Next we will generate a small subset of the training and validation set, to enable faster training:
+
+```python
+small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
+small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
+full_train_dataset = tokenized_datasets["train"]
+full_eval_dataset = tokenized_datasets["test"]
+```
+
+In all the examples below, we will always use `small_train_dataset` and `small_eval_dataset`. Just replace
+them by their _full_ equivalent to train or evaluate on the full dataset.
+
+<a id='trainer'></a>
+
+## Fine-tuning in PyTorch with the Trainer API
+
+<Youtube id="nvBXf7s7vTI"/>
+
+Since PyTorch does not provide a training loop, the 🤗 Transformers library provides a [`Trainer`]
+API that is optimized for 🤗 Transformers models, with a wide range of training options and with built-in features like
+logging, gradient accumulation, and mixed precision.
+
+First, let's define our model:
+
+```python
+from transformers import AutoModelForSequenceClassification
+
+model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
+```
+
+This will issue a warning about some of the pretrained weights not being used and some weights being randomly
+initialized. That's because we are throwing away the pretraining head of the BERT model to replace it with a
+classification head which is randomly initialized. We will fine-tune this model on our task, transferring the knowledge
+of the pretrained model to it (which is why doing this is called transfer learning).
+
+Then, to define our [`Trainer`], we will need to instantiate a
+[`TrainingArguments`]. This class contains all the hyperparameters we can tune for the
+[`Trainer`] or the flags to activate the different training options it supports. Let's begin by
+using all the defaults, the only thing we then have to provide is a directory in which the checkpoints will be saved:
+
+```python
+from transformers import TrainingArguments
+
+training_args = TrainingArguments("test_trainer")
+```
+
+Then we can instantiate a [`Trainer`] like this:
+
+```python
+from transformers import Trainer
+
+trainer = Trainer(
+    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
+)
+```
+
+To fine-tune our model, we just need to call
+
+```python
+trainer.train()
+```
+
+which will start a training that you can follow with a progress bar, which should take a couple of minutes to complete
+(as long as you have access to a GPU). It won't actually tell you anything useful about how well (or badly) your model
+is performing however as by default, there is no evaluation during training, and we didn't tell the
+[`Trainer`] to compute any metrics. Let's have a look on how to do that now!
+
+To have the [`Trainer`] compute and report metrics, we need to give it a `compute_metrics`
+function that takes predictions and labels (grouped in a namedtuple called [`EvalPrediction`]) and
+return a dictionary with string items (the metric names) and float values (the metric values).
+
+The 🤗 Datasets library provides an easy way to get the common metrics used in NLP with the `load_metric` function.
+here we simply use accuracy. Then we define the `compute_metrics` function that just convert logits to predictions
+(remember that all 🤗 Transformers models return the logits) and feed them to `compute` method of this metric.
+
+```python
+import numpy as np
+from datasets import load_metric
+
+metric = load_metric("accuracy")
+
+def compute_metrics(eval_pred):
+    logits, labels = eval_pred
+    predictions = np.argmax(logits, axis=-1)
+    return metric.compute(predictions=predictions, references=labels)
+```
+
+The compute function needs to receive a tuple (with logits and labels) and has to return a dictionary with string keys
+(the name of the metric) and float values. It will be called at the end of each evaluation phase on the whole arrays of
+predictions/labels.
+
+To check if this works on practice, let's create a new [`Trainer`] with our fine-tuned model:
+
+```python
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=small_train_dataset,
+    eval_dataset=small_eval_dataset,
+    compute_metrics=compute_metrics,
+)
+trainer.evaluate()
+```
+
+which showed an accuracy of 87.5% in our case.
+
+If you want to fine-tune your model and regularly report the evaluation metrics (for instance at the end of each
+epoch), here is how you should define your training arguments:
+
+```python
+from transformers import TrainingArguments
+
+training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")
+```
+
+See the documentation of [`TrainingArguments`] for more options.
+
+
+<a id='keras'></a>
+
+## Fine-tuning with Keras
+
+<Youtube id="rnTGBy2ax1c"/>
+
+Models can also be trained natively in TensorFlow using the Keras API. First, let's define our model:
+
+```python
+import tensorflow as tf
+from transformers import TFAutoModelForSequenceClassification
+
+model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
+```
+
+Then we will need to convert our datasets from before in standard `tf.data.Dataset`. Since we have fixed shapes,
+it can easily be done like this. First we remove the _"text"_ column from our datasets and set them in TensorFlow
+format:
+
+```python
+tf_train_dataset = small_train_dataset.remove_columns(["text"]).with_format("tensorflow")
+tf_eval_dataset = small_eval_dataset.remove_columns(["text"]).with_format("tensorflow")
+```
+
+Then we convert everything in big tensors and use the `tf.data.Dataset.from_tensor_slices` method:
+
+```python
+train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
+train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
+train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)
+
+eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
+eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset["label"]))
+eval_tf_dataset = eval_tf_dataset.batch(8)
+```
+
+With this done, the model can then be compiled and trained as any Keras model:
+
+```python
+model.compile(
+    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
+    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+    metrics=tf.metrics.SparseCategoricalAccuracy(),
+)
+
+model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)
+```
+
+With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it
+as a PyTorch model (or vice-versa):
+
+```python
+from transformers import AutoModelForSequenceClassification
+
+model.save_pretrained("my_imdb_model")
+pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)
+```
+
+<a id='pytorch_native'></a>
+
+## Fine-tuning in native PyTorch
+
+<Youtube id="Dh9CL8fyG80"/>
+
+You might need to restart your notebook at this stage to free some memory, or execute the following code:
+
+```python
+del model
+del pytorch_model
+del trainer
+torch.cuda.empty_cache()
+```
+
+Let's now see how to achieve the same results as in [trainer section](#trainer) in PyTorch. First we need to
+define the dataloaders, which we will use to iterate over batches. We just need to apply a bit of post-processing to
+our `tokenized_datasets` before doing that to:
+
+- remove the columns corresponding to values the model does not expect (here the `"text"` column)
+- rename the column `"label"` to `"labels"` (because the model expect the argument to be named `labels`)
+- set the format of the datasets so they return PyTorch Tensors instead of lists.
+
+Our _tokenized_datasets_ has one method for each of those steps:
+
+```python
+tokenized_datasets = tokenized_datasets.remove_columns(["text"])
+tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
+tokenized_datasets.set_format("torch")
+
+small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
+small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
+```
+
+Now that this is done, we can easily define our dataloaders:
+
+```python
+from torch.utils.data import DataLoader
+
+train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
+eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)
+```
+
+Next, we define our model:
+
+```python
+from transformers import AutoModelForSequenceClassification
+
+model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
+```
+
+We are almost ready to write our training loop, the only two things are missing are an optimizer and a learning rate
+scheduler. The default optimizer used by the [`Trainer`] is [`AdamW`]:
+
+```python
+from transformers import AdamW
+
+optimizer = AdamW(model.parameters(), lr=5e-5)
+```
+
+Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5 here) to 0:
+
+```python
+from transformers import get_scheduler
+
+num_epochs = 3
+num_training_steps = num_epochs * len(train_dataloader)
+lr_scheduler = get_scheduler(
+    "linear",
+    optimizer=optimizer,
+    num_warmup_steps=0,
+    num_training_steps=num_training_steps
+)
+```
+
+One last thing, we will want to use the GPU if we have access to one (otherwise training might take several hours
+instead of a couple of minutes). To do this, we define a `device` we will put our model and our batches on.
+
+```python
+import torch
+
+device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+model.to(device)
+```
+
+We now are ready to train! To get some sense of when it will be finished, we add a progress bar over our number of
+training steps, using the _tqdm_ library.
+
+```python
+from tqdm.auto import tqdm
+
+progress_bar = tqdm(range(num_training_steps))
+
+model.train()
+for epoch in range(num_epochs):
+    for batch in train_dataloader:
+        batch = {k: v.to(device) for k, v in batch.items()}
+        outputs = model(**batch)
+        loss = outputs.loss
+        loss.backward()
+
+        optimizer.step()
+        lr_scheduler.step()
+        optimizer.zero_grad()
+        progress_bar.update(1)
+```
+
+Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a
+bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better
+this way for Transformers model (so this is not an oversight on our side). If you're not familiar with what "freezing
+the body" of the model means, forget you read this paragraph.
+
+Now to check the results, we need to write the evaluation loop. Like in the [trainer section](#trainer) we will
+use a metric from the datasets library. Here we accumulate the predictions at each batch before computing the final
+result when the loop is finished.
+
+```python
+metric= load_metric("accuracy")
+model.eval()
+for batch in eval_dataloader:
+    batch = {k: v.to(device) for k, v in batch.items()}
+    with torch.no_grad():
+        outputs = model(**batch)
+
+    logits = outputs.logits
+    predictions = torch.argmax(logits, dim=-1)
+    metric.add_batch(predictions=predictions, references=batch["labels"])
+
+metric.compute()
+```
+
+<a id='additional-resources'></a>
+
+## Additional resources
+
+To look at more fine-tuning examples you can refer to:
+
+- [🤗 Transformers Examples](https://github.com/huggingface/transformers/tree/master/examples) which includes scripts
+  to train on all common NLP tasks in PyTorch and TensorFlow.
+
+- [🤗 Transformers Notebooks](notebooks) which contains various notebooks and in particular one per task (look for
+  the _how to finetune a model on xxx_).