From 2ffef0d0c7a6cfa5a59c4b883849321f66c79d62 Mon Sep 17 00:00:00 2001 From: Joe Davison Date: Thu, 25 Jun 2020 15:11:11 -0600 Subject: [PATCH] Training & fine-tuning quickstart (#5034) * add initial fine-tuning guide * split code blocks to smaller segments * fix up trianer section of fine-tune doc * a few last typos * Update usage -> task summary link Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --- docs/source/index.rst | 1 + docs/source/training.rst | 323 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 324 insertions(+) create mode 100644 docs/source/training.rst diff --git a/docs/source/index.rst b/docs/source/index.rst index bb25474e11..269bc33052 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -139,6 +139,7 @@ conversion utilities for the following models: task_summary model_summary + training preprocessing serialization model_sharing diff --git a/docs/source/training.rst b/docs/source/training.rst new file mode 100644 index 0000000000..5d465f8c37 --- /dev/null +++ b/docs/source/training.rst @@ -0,0 +1,323 @@ +Training and fine-tuning +======================== + +Model classes in 🤗 Transformers are designed to be compatible with native +PyTorch and TensorFlow 2 and can be used seemlessly with either. In this +quickstart, we will show how to fine-tune (or train from scratch) a model +using the standard training tools available in either framework. We will also +show how to use our included :func:`~transformers.Trainer` class which +handles much of the complexity of training for you. + +This guide assume that you are already familiar with loading and use our +models for inference; otherwise, see the :doc:`task summary `. We also assume +that you are familiar with training deep neural networks in either PyTorch or +TF2, and focus specifically on the nuances and tools for training models in +🤗 Transformers. + +Sections: + + * :ref:`pytorch` + * :ref:`tensorflow` + * :ref:`trainer` + * :ref:`additional-resources` + +.. _pytorch: + +Fine-tuning in native PyTorch +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Model classes in 🤗 Transformers that don't begin with ``TF`` are +`PyTorch Modules `_, +meaning that you can use them just as you would any model in PyTorch for +both inference and optimization. + +Let's consider the common task of fine-tuning a masked language model like +BERT on a sequence classification dataset. When we instantiate a model with +:func:`~transformers.PreTrainedModel.from_pretrained`, the model +configuration and pre-trained weights +of the specified model are used to initialize the model. The +library also includes a number of task-specific final layers or 'heads' whose +weights are instantiated randomly when not present in the specified +pre-trained model. For example, instantiating a model with +``BertForSequenceClassification.from_pretrained('bert-base-uncased', num_classes=2)`` +will create a BERT model instance with encoder weights copied from the +``bert-base-uncased`` model and a randomly initialized sequence +classification head on top of the encoder with an output size of 2. Models +are initialized in ``eval`` mode by default. We can call ``model.train()`` to +put it in train mode. + +.. code-block:: python + + from transformers import BertForSequenceClassification + model = BertForSequenceClassification.from_pretrained('bert-base-uncased') + model.train() + +This is useful because it allows us to make use of the pre-trained BERT +encoder and easily train it on whatever sequence classification dataset we +choose. We can use any PyTorch optimizer, but our library also provides the +:func:`~transformers.AdamW` optimizer which implements gradient bias +correction as well as weight decay. + +.. code-block:: python + + from transformers import AdamW + optimizer = AdamW(model.parameters(), lr=1e-5) + +The optimizer allows us to apply different hyperpameters for specific +parameter groups. For example, we can apply weight decay to all parameters +other than bias and layer normalization terms: + +.. code-block:: python + + no_decay = ['bias', 'LayerNorm.weight'] + optimizer_grouped_parameters = [ + {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, + {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} + ] + optimizer = AdamW(optimizer_grouped_parameters, lr=1e-5) + +Now we can set up a simple dummy training batch using +:func:`~transformers.PreTrainedTokenizer.batch_encode_plus`. This returns a +:func:`~transformers.BatchEncoding` instance which +prepares everything we might need to pass to the model. + +.. code-block:: python + + from transformers import BertTokenizer + tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') + text_batch = ["I love Pixar.", "I don't care for Pixar."] + encoding = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True) + input_ids = encoding['input_ids'] + attention_mask = encoding['attention_mask'] + +When we call a classification model with the ``labels`` argument, the first +returned element is the Cross Entropy loss between the predictions and the +passed labels. Having already set up our optimizer, we can then do a +backwards pass and update the weights: + +.. code-block:: python + + labels = torch.tensor([1,0]).unsqueeze(0) + outputs = model(input_ids, attention_mask=attention_mask, labels=labels) + loss = outputs[0] + loss.backward() + optimizer.step() + +Alternatively, you can just get the logits and calculate the loss yourself. +The following is equivalent to the previous example: + +.. code-block:: python + + from torch.nn import functional as F + labels = torch.tensor([1,0]).unsqueeze(0) + outputs = model(input_ids, attention_mask=attention_mask) + loss = F.cross_entropy(labels, outputs[0]) + loss.backward() + optimizer.step() + +Of course, you can train on GPU by calling ``to('cuda')`` on the model and +inputs as usual. + +We also provide a few learning rate scheduling tools. With the following, we +can set up a scheduler which warms up for ``num_warmup_steps`` and then +linearly decays to 0 by the end of training. + +.. code-block:: python + + from transformers import get_linear_schedule_with_warmup + scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_train_steps) + +Then all we have to do is call ``scheduler.step()`` after ``optimizer.step()``. + +.. code-block:: python + + ... + loss.backward() + optimizer.step() + scheduler.step() + +We highly recommend using :func:`~transformers.Trainer`, discussed below, +which conveniently handles the moving parts of training 🤗 Transformers models +with features like mixed precision and easy tensorboard logging. + + +Freezing the encoder +-------------------- + +In some cases, you might be interested in keeping the weights of the +pre-trained encoder frozen and optimizing only the weights of the head +layers. To do so, simply set the ``requires_grad`` attribute to ``False`` on +the encoder parameters, which can be accessed with the ``base_model`` +submodule on any task-specific model in the library: + +.. code-block:: python + + for param in model.base_model.parameters(): + param.requires_grad = False + + +.. _tensorflow: + +Fine-tuning in native TensorFlow 2 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Models can also be trained natively in TensorFlow 2. Just as with PyTorch, +TensorFlow models can be instantiated with +:func:`~transformers.PreTrainedModel.from_pretrained` to load the weights of +the encoder from a pretrained model. + +.. code-block:: python + + from transformers import TFBertForSequenceClassification + model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased') + +Let's use ``tensorflow_datasets`` to load in the `MRPC dataset +`_ from GLUE. We +can then use our built-in +:func:`~transformers.data.processors.glue.glue_convert_examples_to_features` +to tokenize MRPC and convert it to a TensorFlow ``Dataset`` object. Note that +tokenizers are framework-agnostic, so there is no need to prepend ``TF`` to +the pretrained tokenizer name. + +.. code-block:: python + + from transformers import BertTokenizer, glue_convert_examples_to_features + import tensorflow_datasets as tfds + tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') + data = tfds.load('glue/mrpc') + train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc') + train_dataset = train_dataset.shuffle(100).batch(32).repeat(2) + +The model can then be compiled and trained as any Keras model: + +.. code-block:: python + + optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5) + loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) + model.compile(optimizer=optimizer, loss=loss) + model.fit(train_dataset, epochs=2, steps_per_epoch=115) + +With the tight interoperability between TensorFlow and PyTorch models, you +can even save the model and then reload it as a PyTorch model (or vice-versa): + +.. code-block:: python + + from transformers import BertForSequenceClassification + model.save_pretrained('./my_mrpc_model/') + pytorch_model = BertForSequenceClassification.from_pretrained('./my_mrpc_model/', from_tf=True) + + +.. _trainer: + +Trainer +^^^^^^^ + +We also provide a simple but feature-complete training and evaluation +interface through :func:`~transformers.Trainer` and +:func:`~transformers.TFTrainer`. You can train, fine-tune, +and evaluate any 🤗 Transformers model with a wide range of training options and +with built-in features like logging, gradient accumulation, and mixed +precision. + +.. code-block:: python + + ## PYTORCH CODE + from transformers import BertForSequenceClassification, Trainer, TrainingArguments + + model = BertForSequenceClassification.from_pretrained("bert-large-uncased") + + training_args = TrainingArguments( + output_dir='./results', # output directory + num_train_epochs=3, # total # of training epochs + per_device_train_batch_size=16, # batch size per device during training + per_device_eval_batch_size=64, # batch size for evaluation + warmup_steps=500, # number of warmup steps for learning rate scheduler + weight_decay=0.01, # strength of weight decay + logging_dir='./logs', # directory for storing logs + ) + + trainer = Trainer( + model=model, # the instantiated 🤗 Transformers model to be trained + args=training_args, # training arguments, defined above + train_dataset=train_dataset, # training dataset + eval_dataset=test_dataset # evaluation dataset + ) + ## TENSORFLOW CODE + from transformers import TFBertForSequenceClassification, TFTrainer, TFTrainingArguments + + model = TFBertForSequenceClassification.from_pretrained("bert-large-uncased") + + training_args = TFTrainingArguments( + output_dir='./results', # output directory + num_train_epochs=3, # total # of training epochs + per_device_train_batch_size=16, # batch size per device during training + per_device_eval_batch_size=64, # batch size for evaluation + warmup_steps=500, # number of warmup steps for learning rate scheduler + weight_decay=0.01, # strength of weight decay + logging_dir='./logs', # directory for storing logs + ) + + trainer = TFTrainer( + model=model, # the instantiated 🤗 Transformers model to be trained + args=training_args, # training arguments, defined above + train_dataset=tfds_train_dataset, # tensorflow_datasets training dataset + eval_dataset=tfds_test_dataset # tensorflow_datasets evaluation dataset + ) + +Now simply call ``trainer.train()`` to train and ``trainer.evaluate()`` to +evaluate. You can use your own module as well, but the first +argument returned from ``forward`` must be the loss which you wish to +optimize. + +:func:`~transformers.Trainer` uses a built-in default function to collate +batches and prepare them to be fed into the model. If needed, you can also +use the ``data_collator`` argument to pass your own collator function which +takes in the data in the format provides by your dataset and returns a +batch ready to be fed into the model. Note that +:func:`~transformers.TFTrainer` expects the passed datasets to be dataset +objects from ``tensorflow_datasets``. + +To calculate additional metrics in addition to the loss, you can also define +your own ``compute_metrics`` function and pass it to the trainer. + +.. code-block:: python + + from sklearn.metrics import precision_recall_fscore_support + + def compute_metrics(pred): + labels = pred.label_ids + preds = pred.predictions.argmax(-1) + precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary') + acc = accuracy_score(labels, preds) + return { + 'accuracy': acc, + 'f1': f1, + 'precision': precision, + 'recall': recall + } + +Finally, you can view the results, including any calculated metrics, by +launching tensorboard in your specified ``logging_dir`` directory. + + +.. _additional-resources: + +Additional resources +^^^^^^^^^^^^^^^^^^^^ + + * `A lightweight colab demo + `_ + which uses ``Trainer`` for IMDb sentiment classification. + + * `🤗 Transformers Examples `_ + including scripts for training and fine-tuning on GLUE, SQuAD, and + several other tasks. + + * `How to train a language model + `_, + a detailed colab notebook which uses ``Trainer`` to train a masked + language model from scratch on Esperanto. + + * `🤗 Transformers Notebooks <./notebooks.html>`_ which contain dozens + of example notebooks from the community for training and using + 🤗 Transformers on a variety of tasks.