Update TF fine-tuning docs (#18654)

* Update TF fine-tuning docs * Fix formatting * Add some section headers so the right sidebar works better * Squiggly it * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Explain things in the text, not the comments * Make the two dataset creation methods into a list * Move the advice about collation out of a <Tip> * Edits for clarity * Edits for clarity * Edits for clarity * Replace `to_tf_dataset` with `prepare_tf_dataset` in the fine-tuning pages * Restructure the page a little bit * Restructure the page a little bit * Restructure the page a little bit Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-09-07 13:30:07 +01:00
parent d842f2d5b9
commit 2b9513fdab
8 changed files with 130 additions and 87 deletions
--- a/docs/source/en/tasks/language_modeling.mdx
+++ b/docs/source/en/tasks/language_modeling.mdx
@@ -245,20 +245,18 @@ At this point, only three steps remain:
 ```
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
->>> tf_train_set = lm_dataset["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     lm_dataset["train"],
 ...     dummy_labels=True,
 ...     shuffle=True,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
 ... )
->>> tf_test_set = lm_dataset["test"].to_tf_dataset(
+>>> tf_test_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     lm_dataset["test"],
 ...     dummy_labels=True,
 ...     shuffle=False,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
@@ -352,20 +350,18 @@ At this point, only three steps remain:
 ```
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
->>> tf_train_set = lm_dataset["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     lm_dataset["train"],
 ...     dummy_labels=True,
 ...     shuffle=True,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
 ... )
->>> tf_test_set = lm_dataset["test"].to_tf_dataset(
+>>> tf_test_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     lm_dataset["test"],
 ...     dummy_labels=True,
 ...     shuffle=False,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
--- a/docs/source/en/tasks/multiple_choice.mdx
+++ b/docs/source/en/tasks/multiple_choice.mdx
@@ -224,21 +224,19 @@ At this point, only three steps remain:
 ```
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
 >>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
->>> tf_train_set = tokenized_swag["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids"],
+...     tokenized_swag["train"],
 ...     label_cols=["labels"],
 ...     shuffle=True,
 ...     batch_size=batch_size,
 ...     collate_fn=data_collator,
 ... )
->>> tf_validation_set = tokenized_swag["validation"].to_tf_dataset(
+>>> tf_validation_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids"],
+...     tokenized_swag["validation"],
 ...     label_cols=["labels"],
 ...     shuffle=False,
 ...     batch_size=batch_size,
 ...     collate_fn=data_collator,
@@ -273,10 +271,7 @@ Load BERT with [`TFAutoModelForMultipleChoice`]:
 Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
 ```py
->>> model.compile(
+>>> model.compile(optimizer=optimizer)
 ...     optimizer=optimizer,
 ...     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
 ... )
 ```
 Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
--- a/docs/source/en/tasks/question_answering.mdx
+++ b/docs/source/en/tasks/question_answering.mdx
@@ -199,20 +199,18 @@ At this point, only three steps remain:
 ```
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
->>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
+...     tokenized_squad["train"],
 ...     dummy_labels=True,
 ...     shuffle=True,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
 ... )
->>> tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
+>>> tf_validation_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
+...     tokenized_squad["validation"],
 ...     dummy_labels=True,
 ...     shuffle=False,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
--- a/docs/source/en/tasks/sequence_classification.mdx
+++ b/docs/source/en/tasks/sequence_classification.mdx
@@ -144,18 +144,19 @@ At this point, only three steps remain:
 </Tip>
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
->>> tf_train_set = tokenized_imdb["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "label"],
+...     tokenized_imdb["train"],
 ...     shuffle=True,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
 ... )
->>> tf_validation_set = tokenized_imdb["test"].to_tf_dataset(
+>>> tf_validation_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "label"],
+...     tokenized_imdb["test"],
 ...     shuffle=False,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
--- a/docs/source/en/tasks/summarization.mdx
+++ b/docs/source/en/tasks/summarization.mdx
@@ -159,18 +159,18 @@ At this point, only three steps remain:
 ```
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
->>> tf_train_set = tokenized_billsum["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     tokenized_billsum["train"],
 ...     shuffle=True,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
 ... )
->>> tf_test_set = tokenized_billsum["test"].to_tf_dataset(
+>>> tf_test_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     tokenized_billsum["test"],
 ...     shuffle=False,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
--- a/docs/source/en/tasks/token_classification.mdx
+++ b/docs/source/en/tasks/token_classification.mdx
@@ -199,18 +199,18 @@ At this point, only three steps remain:
 ```
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
->>> tf_train_set = tokenized_wnut["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     tokenized_wnut["train"],
 ...     shuffle=True,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
 ... )
->>> tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
+>>> tf_validation_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     tokenized_wnut["validation"],
 ...     shuffle=False,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
--- a/docs/source/en/tasks/translation.mdx
+++ b/docs/source/en/tasks/translation.mdx
@@ -175,18 +175,18 @@ At this point, only three steps remain:
 ```
 </pt>
 <tf>
-To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
+To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
 ```py
->>> tf_train_set = tokenized_books["train"].to_tf_dataset(
+>>> tf_train_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     tokenized_books["train"],
 ...     shuffle=True,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
 ... )
->>> tf_test_set = tokenized_books["test"].to_tf_dataset(
+>>> tf_test_set = model.prepare_tf_dataset(
-...     columns=["attention_mask", "input_ids", "labels"],
+...     tokenized_books["test"],
 ...     shuffle=False,
 ...     batch_size=16,
 ...     collate_fn=data_collator,
@@ -216,7 +216,7 @@ Configure the model for training with [`compile`](https://keras.io/api/models/mo
 Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
 ```py
->>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
+>>> model.fit(tf_train_set, validation_data=tf_test_set, epochs=3)
 ```
 </tf>
 </frameworkcontent>
--- a/docs/source/en/training.mdx
+++ b/docs/source/en/training.mdx
@@ -65,10 +65,16 @@ If you like, you can create a smaller subset of the full dataset to fine-tune on
 ## Train
 At this point, you should follow the section corresponding to the framework you want to use. You can use the links
 in the right sidebar to jump to the one you want - and if you want to hide all of the content for a given framework,
 just use the button at the top-right of that framework's block!
 <frameworkcontent>
 <pt>
 <Youtube id="nvBXf7s7vTI"/>
 ## Train with PyTorch Trainer
 🤗 Transformers provides a [`Trainer`] class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [`Trainer`] API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.
 Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels:
@@ -151,66 +157,113 @@ Then fine-tune your model by calling [`~transformers.Trainer.train`]:
 <Youtube id="rnTGBy2ax1c"/>
-🤗 Transformers models also supports training in TensorFlow with the Keras API.
+## Train a TensorFlow model with Keras
-### Convert dataset to TensorFlow format
+You can also train 🤗 Transformers models in TensorFlow with the Keras API!
-The [`DefaultDataCollator`] assembles tensors into a batch for the model to train on. Make sure you specify `return_tensors` to return TensorFlow tensors:
+### Loading data for Keras
 When you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that
 Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras.
 Let's try that first before we do anything more complicated.
 First, load a dataset. We'll use the CoLA dataset from the [GLUE benchmark](https://huggingface.co/datasets/glue),
 since it's a simple binary text classification task, and just take the training split for now.
 ```py
->>> from transformers import DefaultDataCollator
+from datasets import load_dataset
->>> data_collator = DefaultDataCollator(return_tensors="tf")
+dataset = load_dataset("glue", "cola")
 dataset = dataset["train"]  # Just take the training split for now
 ```
 Next, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s,
 so we can just convert that directly to a NumPy array without tokenization!
 ```py
 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
 tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True)
 labels = np.array(dataset["label"])  # Label is already an array of 0 and 1
 ```
 Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model:
 ```py
 from transformers import TFAutoModelForSequenceClassification
 from tensorflow.keras.optimizers import Adam
 # Load and compile our model
 model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
 # Lower learning rates are often better for fine-tuning transformers
 model.compile(optimizer=Adam(3e-5))
 model.fit(tokenized_data, labels)
 ```
 <Tip>
-[`Trainer`] uses [`DataCollatorWithPadding`] by default so you don't need to explicitly specify a data collator.
+You don't have to pass a loss argument to your models when you `compile()` them! Hugging Face models automatically
 choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always
 override this by specifying a loss yourself if you want to!
 </Tip>
-Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`:
+This approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why?
 Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle
 “jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole
 dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too!
 ### Loading data as a tf.data.Dataset
 If you want to avoid slowing down training, you can load your data as a `tf.data.Dataset` instead. Although you can write your own
 `tf.data` pipeline if you want, we have two convenience methods for doing this:
 - [`~TFPreTrainedModel.prepare_tf_dataset`]: This is the method we recommend in most cases. Because it is a method
 on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and
 discard the others to make a simpler, more performant dataset.
 - [`~datasets.Dataset.to_tf_dataset`]: This method is more low-level, and is useful when you want to exactly control how
 your dataset is created, by specifying exactly which `columns` and `label_cols` to include.
 Before you can use [`~TFPreTrainedModel.prepare_tf_dataset`], you will need to add the tokenizer outputs to your dataset as columns, as shown in
 the following code sample:
 ```py
->>> tf_train_dataset = small_train_dataset.to_tf_dataset(
+def tokenize_dataset(data):
-...     columns=["attention_mask", "input_ids", "token_type_ids"],
+    # Keys of the returned dictionary will be added to the dataset as columns
-...     label_cols=["labels"],
+    return tokenizer(data["text"])
 ...     shuffle=True,
 ...     collate_fn=data_collator,
 ...     batch_size=8,
 ... )
->>> tf_validation_dataset = small_eval_dataset.to_tf_dataset(
+
-...     columns=["attention_mask", "input_ids", "token_type_ids"],
+dataset = dataset.map(tokenize_dataset)
 ...     label_cols=["labels"],
 ...     shuffle=False,
 ...     collate_fn=data_collator,
 ...     batch_size=8,
 ... )
 ```
-### Compile and fit
+Remember that Hugging Face datasets are stored on disk by default, so this will not inflate your memory usage! Once the
 columns have been added, you can stream batches from the dataset and add padding to each batch, which greatly
 reduces the number of padding tokens compared to padding the entire dataset.
 Load a TensorFlow model with the expected number of labels:
 ```py
->>> import tensorflow as tf
+>>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer)
 >>> from transformers import TFAutoModelForSequenceClassification
 >>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
 ```
-Then compile and fine-tune your model with [`fit`](https://keras.io/api/models/model_training_apis/) as you would with any other Keras model:
+Note that in the code sample above, you need to pass the tokenizer to `prepare_tf_dataset` so it can correctly pad batches as they're loaded.
 If all the samples in your dataset are the same length and no padding is necessary, you can skip this argument.
 If you need to do something more complex than just padding samples (e.g. corrupting tokens for masked language
 modelling), you can use the `collate_fn` argument instead to pass a function that will be called to transform the
 list of samples into a batch and apply any preprocessing you want. See our
 [examples](https://github.com/huggingface/transformers/tree/main/examples) or
 [notebooks](https://huggingface.co/docs/transformers/notebooks) to see this approach in action.
 Once you've created a `tf.data.Dataset`, you can compile and fit the model as before:
 ```py
->>> model.compile(
+model.compile(optimizer=Adam(3e-5))
 ...     optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
 ...     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
 ...     metrics=tf.metrics.SparseCategoricalAccuracy(),
 ... )
->>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)
+model.fit(tf_dataset)
 ```
 </tf>
 </frameworkcontent>