From 2b9513fdabbcfd3ca5d7003a955be633a2f365fc Mon Sep 17 00:00:00 2001 From: Matt Date: Wed, 7 Sep 2022 13:30:07 +0100 Subject: [PATCH] Update TF fine-tuning docs (#18654) * Update TF fine-tuning docs * Fix formatting * Add some section headers so the right sidebar works better * Squiggly it * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Explain things in the text, not the comments * Make the two dataset creation methods into a list * Move the advice about collation out of a * Edits for clarity * Edits for clarity * Edits for clarity * Replace `to_tf_dataset` with `prepare_tf_dataset` in the fine-tuning pages * Restructure the page a little bit * Restructure the page a little bit * Restructure the page a little bit Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --- docs/source/en/tasks/language_modeling.mdx | 24 ++-- docs/source/en/tasks/multiple_choice.mdx | 17 +-- docs/source/en/tasks/question_answering.mdx | 12 +- .../en/tasks/sequence_classification.mdx | 11 +- docs/source/en/tasks/summarization.mdx | 10 +- docs/source/en/tasks/token_classification.mdx | 10 +- docs/source/en/tasks/translation.mdx | 12 +- docs/source/en/training.mdx | 121 +++++++++++++----- 8 files changed, 130 insertions(+), 87 deletions(-) diff --git a/docs/source/en/tasks/language_modeling.mdx b/docs/source/en/tasks/language_modeling.mdx index f410bd5a55..82708f2f89 100644 --- a/docs/source/en/tasks/language_modeling.mdx +++ b/docs/source/en/tasks/language_modeling.mdx @@ -245,20 +245,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = lm_dataset["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = lm_dataset["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, @@ -352,20 +350,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = lm_dataset["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = lm_dataset["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], -... dummy_labels=True, +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/multiple_choice.mdx b/docs/source/en/tasks/multiple_choice.mdx index b8eb528497..6ee0d7137f 100644 --- a/docs/source/en/tasks/multiple_choice.mdx +++ b/docs/source/en/tasks/multiple_choice.mdx @@ -224,21 +224,19 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py >>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer) ->>> tf_train_set = tokenized_swag["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids"], -... label_cols=["labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_swag["train"], ... shuffle=True, ... batch_size=batch_size, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_swag["validation"].to_tf_dataset( -... columns=["attention_mask", "input_ids"], -... label_cols=["labels"], +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_swag["validation"], ... shuffle=False, ... batch_size=batch_size, ... collate_fn=data_collator, @@ -273,10 +271,7 @@ Load BERT with [`TFAutoModelForMultipleChoice`]: Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): ```py ->>> model.compile( -... optimizer=optimizer, -... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), -... ) +>>> model.compile(optimizer=optimizer) ``` Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model: diff --git a/docs/source/en/tasks/question_answering.mdx b/docs/source/en/tasks/question_answering.mdx index 2cb54760e8..218fa7bb55 100644 --- a/docs/source/en/tasks/question_answering.mdx +++ b/docs/source/en/tasks/question_answering.mdx @@ -199,20 +199,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_squad["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "start_positions", "end_positions"], -... dummy_labels=True, +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_squad["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_squad["validation"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "start_positions", "end_positions"], -... dummy_labels=True, +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_squad["validation"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/sequence_classification.mdx b/docs/source/en/tasks/sequence_classification.mdx index 44729dc28f..2ef8a9619c 100644 --- a/docs/source/en/tasks/sequence_classification.mdx +++ b/docs/source/en/tasks/sequence_classification.mdx @@ -144,18 +144,19 @@ At this point, only three steps remain: -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. + ```py ->>> tf_train_set = tokenized_imdb["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "label"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_imdb["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_imdb["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "label"], +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_imdb["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/summarization.mdx b/docs/source/en/tasks/summarization.mdx index f636141a15..1b2eafcb5f 100644 --- a/docs/source/en/tasks/summarization.mdx +++ b/docs/source/en/tasks/summarization.mdx @@ -159,18 +159,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_billsum["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_billsum["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = tokenized_billsum["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_billsum["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/token_classification.mdx b/docs/source/en/tasks/token_classification.mdx index aa5739534f..3d2a3ccb05 100644 --- a/docs/source/en/tasks/token_classification.mdx +++ b/docs/source/en/tasks/token_classification.mdx @@ -199,18 +199,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_wnut["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_wnut["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_validation_set = tokenized_wnut["validation"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_wnut["validation"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, diff --git a/docs/source/en/tasks/translation.mdx b/docs/source/en/tasks/translation.mdx index d17b870414..7439bc7b61 100644 --- a/docs/source/en/tasks/translation.mdx +++ b/docs/source/en/tasks/translation.mdx @@ -175,18 +175,18 @@ At this point, only three steps remain: ``` -To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator: +To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`]. ```py ->>> tf_train_set = tokenized_books["train"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_books["train"], ... shuffle=True, ... batch_size=16, ... collate_fn=data_collator, ... ) ->>> tf_test_set = tokenized_books["test"].to_tf_dataset( -... columns=["attention_mask", "input_ids", "labels"], +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_books["test"], ... shuffle=False, ... batch_size=16, ... collate_fn=data_collator, @@ -216,7 +216,7 @@ Configure the model for training with [`compile`](https://keras.io/api/models/mo Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model: ```py ->>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3) +>>> model.fit(tf_train_set, validation_data=tf_test_set, epochs=3) ``` diff --git a/docs/source/en/training.mdx b/docs/source/en/training.mdx index 9222d27ac8..89f5c3148b 100644 --- a/docs/source/en/training.mdx +++ b/docs/source/en/training.mdx @@ -65,10 +65,16 @@ If you like, you can create a smaller subset of the full dataset to fine-tune on ## Train +At this point, you should follow the section corresponding to the framework you want to use. You can use the links +in the right sidebar to jump to the one you want - and if you want to hide all of the content for a given framework, +just use the button at the top-right of that framework's block! + +## Train with PyTorch Trainer + 🤗 Transformers provides a [`Trainer`] class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [`Trainer`] API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision. Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels: @@ -151,66 +157,113 @@ Then fine-tune your model by calling [`~transformers.Trainer.train`]: -🤗 Transformers models also supports training in TensorFlow with the Keras API. +## Train a TensorFlow model with Keras -### Convert dataset to TensorFlow format +You can also train 🤗 Transformers models in TensorFlow with the Keras API! -The [`DefaultDataCollator`] assembles tensors into a batch for the model to train on. Make sure you specify `return_tensors` to return TensorFlow tensors: +### Loading data for Keras + +When you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that +Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras. +Let's try that first before we do anything more complicated. + +First, load a dataset. We'll use the CoLA dataset from the [GLUE benchmark](https://huggingface.co/datasets/glue), +since it's a simple binary text classification task, and just take the training split for now. ```py ->>> from transformers import DefaultDataCollator +from datasets import load_dataset ->>> data_collator = DefaultDataCollator(return_tensors="tf") +dataset = load_dataset("glue", "cola") +dataset = dataset["train"] # Just take the training split for now +``` + +Next, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s, +so we can just convert that directly to a NumPy array without tokenization! + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True) + +labels = np.array(dataset["label"]) # Label is already an array of 0 and 1 +``` + +Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model: + +```py +from transformers import TFAutoModelForSequenceClassification +from tensorflow.keras.optimizers import Adam + +# Load and compile our model +model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased") +# Lower learning rates are often better for fine-tuning transformers +model.compile(optimizer=Adam(3e-5)) + +model.fit(tokenized_data, labels) ``` -[`Trainer`] uses [`DataCollatorWithPadding`] by default so you don't need to explicitly specify a data collator. +You don't have to pass a loss argument to your models when you `compile()` them! Hugging Face models automatically +choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always +override this by specifying a loss yourself if you want to! -Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`: +This approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why? +Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle +“jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole +dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too! + +### Loading data as a tf.data.Dataset + +If you want to avoid slowing down training, you can load your data as a `tf.data.Dataset` instead. Although you can write your own +`tf.data` pipeline if you want, we have two convenience methods for doing this: + +- [`~TFPreTrainedModel.prepare_tf_dataset`]: This is the method we recommend in most cases. Because it is a method +on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and +discard the others to make a simpler, more performant dataset. +- [`~datasets.Dataset.to_tf_dataset`]: This method is more low-level, and is useful when you want to exactly control how +your dataset is created, by specifying exactly which `columns` and `label_cols` to include. + +Before you can use [`~TFPreTrainedModel.prepare_tf_dataset`], you will need to add the tokenizer outputs to your dataset as columns, as shown in +the following code sample: ```py ->>> tf_train_dataset = small_train_dataset.to_tf_dataset( -... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], -... shuffle=True, -... collate_fn=data_collator, -... batch_size=8, -... ) +def tokenize_dataset(data): + # Keys of the returned dictionary will be added to the dataset as columns + return tokenizer(data["text"]) ->>> tf_validation_dataset = small_eval_dataset.to_tf_dataset( -... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], -... shuffle=False, -... collate_fn=data_collator, -... batch_size=8, -... ) + +dataset = dataset.map(tokenize_dataset) ``` -### Compile and fit +Remember that Hugging Face datasets are stored on disk by default, so this will not inflate your memory usage! Once the +columns have been added, you can stream batches from the dataset and add padding to each batch, which greatly +reduces the number of padding tokens compared to padding the entire dataset. -Load a TensorFlow model with the expected number of labels: ```py ->>> import tensorflow as tf ->>> from transformers import TFAutoModelForSequenceClassification - ->>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer) ``` -Then compile and fine-tune your model with [`fit`](https://keras.io/api/models/model_training_apis/) as you would with any other Keras model: +Note that in the code sample above, you need to pass the tokenizer to `prepare_tf_dataset` so it can correctly pad batches as they're loaded. +If all the samples in your dataset are the same length and no padding is necessary, you can skip this argument. +If you need to do something more complex than just padding samples (e.g. corrupting tokens for masked language +modelling), you can use the `collate_fn` argument instead to pass a function that will be called to transform the +list of samples into a batch and apply any preprocessing you want. See our +[examples](https://github.com/huggingface/transformers/tree/main/examples) or +[notebooks](https://huggingface.co/docs/transformers/notebooks) to see this approach in action. + +Once you've created a `tf.data.Dataset`, you can compile and fit the model as before: ```py ->>> model.compile( -... optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5), -... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), -... metrics=tf.metrics.SparseCategoricalAccuracy(), -... ) +model.compile(optimizer=Adam(3e-5)) ->>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3) +model.fit(tf_dataset) ``` +