Update TF fine-tuning docs (#18654)
* Update TF fine-tuning docs * Fix formatting * Add some section headers so the right sidebar works better * Squiggly it * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/training.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Explain things in the text, not the comments * Make the two dataset creation methods into a list * Move the advice about collation out of a <Tip> * Edits for clarity * Edits for clarity * Edits for clarity * Replace `to_tf_dataset` with `prepare_tf_dataset` in the fine-tuning pages * Restructure the page a little bit * Restructure the page a little bit * Restructure the page a little bit Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -245,20 +245,18 @@ At this point, only three steps remain:
|
|||||||
```
|
```
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... lm_dataset["train"],
|
||||||
... dummy_labels=True,
|
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
|
>>> tf_test_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... lm_dataset["test"],
|
||||||
... dummy_labels=True,
|
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
@@ -352,20 +350,18 @@ At this point, only three steps remain:
|
|||||||
```
|
```
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... lm_dataset["train"],
|
||||||
... dummy_labels=True,
|
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
|
>>> tf_test_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... lm_dataset["test"],
|
||||||
... dummy_labels=True,
|
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
|
|||||||
@@ -224,21 +224,19 @@ At this point, only three steps remain:
|
|||||||
```
|
```
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
|
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
|
||||||
>>> tf_train_set = tokenized_swag["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids"],
|
... tokenized_swag["train"],
|
||||||
... label_cols=["labels"],
|
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=batch_size,
|
... batch_size=batch_size,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_validation_set = tokenized_swag["validation"].to_tf_dataset(
|
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids"],
|
... tokenized_swag["validation"],
|
||||||
... label_cols=["labels"],
|
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=batch_size,
|
... batch_size=batch_size,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
@@ -273,10 +271,7 @@ Load BERT with [`TFAutoModelForMultipleChoice`]:
|
|||||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> model.compile(
|
>>> model.compile(optimizer=optimizer)
|
||||||
... optimizer=optimizer,
|
|
||||||
... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
|
|
||||||
... )
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||||
|
|||||||
@@ -199,20 +199,18 @@ At this point, only three steps remain:
|
|||||||
```
|
```
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
|
... tokenized_squad["train"],
|
||||||
... dummy_labels=True,
|
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
|
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
|
... tokenized_squad["validation"],
|
||||||
... dummy_labels=True,
|
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
|
|||||||
@@ -144,18 +144,19 @@ At this point, only three steps remain:
|
|||||||
</Tip>
|
</Tip>
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_set = tokenized_imdb["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "label"],
|
... tokenized_imdb["train"],
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_validation_set = tokenized_imdb["test"].to_tf_dataset(
|
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "label"],
|
... tokenized_imdb["test"],
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
|
|||||||
@@ -159,18 +159,18 @@ At this point, only three steps remain:
|
|||||||
```
|
```
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_set = tokenized_billsum["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... tokenized_billsum["train"],
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_test_set = tokenized_billsum["test"].to_tf_dataset(
|
>>> tf_test_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... tokenized_billsum["test"],
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
|
|||||||
@@ -199,18 +199,18 @@ At this point, only three steps remain:
|
|||||||
```
|
```
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_set = tokenized_wnut["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... tokenized_wnut["train"],
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
|
>>> tf_validation_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... tokenized_wnut["validation"],
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
|
|||||||
@@ -175,18 +175,18 @@ At this point, only three steps remain:
|
|||||||
```
|
```
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~TFPreTrainedModel.prepare_tf_dataset`].
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
|
>>> tf_train_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... tokenized_books["train"],
|
||||||
... shuffle=True,
|
... shuffle=True,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
|
>>> tf_test_set = model.prepare_tf_dataset(
|
||||||
... columns=["attention_mask", "input_ids", "labels"],
|
... tokenized_books["test"],
|
||||||
... shuffle=False,
|
... shuffle=False,
|
||||||
... batch_size=16,
|
... batch_size=16,
|
||||||
... collate_fn=data_collator,
|
... collate_fn=data_collator,
|
||||||
@@ -216,7 +216,7 @@ Configure the model for training with [`compile`](https://keras.io/api/models/mo
|
|||||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
|
>>> model.fit(tf_train_set, validation_data=tf_test_set, epochs=3)
|
||||||
```
|
```
|
||||||
</tf>
|
</tf>
|
||||||
</frameworkcontent>
|
</frameworkcontent>
|
||||||
|
|||||||
@@ -65,10 +65,16 @@ If you like, you can create a smaller subset of the full dataset to fine-tune on
|
|||||||
|
|
||||||
## Train
|
## Train
|
||||||
|
|
||||||
|
At this point, you should follow the section corresponding to the framework you want to use. You can use the links
|
||||||
|
in the right sidebar to jump to the one you want - and if you want to hide all of the content for a given framework,
|
||||||
|
just use the button at the top-right of that framework's block!
|
||||||
|
|
||||||
<frameworkcontent>
|
<frameworkcontent>
|
||||||
<pt>
|
<pt>
|
||||||
<Youtube id="nvBXf7s7vTI"/>
|
<Youtube id="nvBXf7s7vTI"/>
|
||||||
|
|
||||||
|
## Train with PyTorch Trainer
|
||||||
|
|
||||||
🤗 Transformers provides a [`Trainer`] class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [`Trainer`] API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.
|
🤗 Transformers provides a [`Trainer`] class optimized for training 🤗 Transformers models, making it easier to start training without manually writing your own training loop. The [`Trainer`] API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.
|
||||||
|
|
||||||
Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels:
|
Start by loading your model and specify the number of expected labels. From the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields), you know there are five labels:
|
||||||
@@ -151,66 +157,113 @@ Then fine-tune your model by calling [`~transformers.Trainer.train`]:
|
|||||||
|
|
||||||
<Youtube id="rnTGBy2ax1c"/>
|
<Youtube id="rnTGBy2ax1c"/>
|
||||||
|
|
||||||
🤗 Transformers models also supports training in TensorFlow with the Keras API.
|
## Train a TensorFlow model with Keras
|
||||||
|
|
||||||
### Convert dataset to TensorFlow format
|
You can also train 🤗 Transformers models in TensorFlow with the Keras API!
|
||||||
|
|
||||||
The [`DefaultDataCollator`] assembles tensors into a batch for the model to train on. Make sure you specify `return_tensors` to return TensorFlow tensors:
|
### Loading data for Keras
|
||||||
|
|
||||||
|
When you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that
|
||||||
|
Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras.
|
||||||
|
Let's try that first before we do anything more complicated.
|
||||||
|
|
||||||
|
First, load a dataset. We'll use the CoLA dataset from the [GLUE benchmark](https://huggingface.co/datasets/glue),
|
||||||
|
since it's a simple binary text classification task, and just take the training split for now.
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import DefaultDataCollator
|
from datasets import load_dataset
|
||||||
|
|
||||||
>>> data_collator = DefaultDataCollator(return_tensors="tf")
|
dataset = load_dataset("glue", "cola")
|
||||||
|
dataset = dataset["train"] # Just take the training split for now
|
||||||
|
```
|
||||||
|
|
||||||
|
Next, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s,
|
||||||
|
so we can just convert that directly to a NumPy array without tokenization!
|
||||||
|
|
||||||
|
```py
|
||||||
|
from transformers import AutoTokenizer
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
|
||||||
|
tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True)
|
||||||
|
|
||||||
|
labels = np.array(dataset["label"]) # Label is already an array of 0 and 1
|
||||||
|
```
|
||||||
|
|
||||||
|
Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model:
|
||||||
|
|
||||||
|
```py
|
||||||
|
from transformers import TFAutoModelForSequenceClassification
|
||||||
|
from tensorflow.keras.optimizers import Adam
|
||||||
|
|
||||||
|
# Load and compile our model
|
||||||
|
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
|
||||||
|
# Lower learning rates are often better for fine-tuning transformers
|
||||||
|
model.compile(optimizer=Adam(3e-5))
|
||||||
|
|
||||||
|
model.fit(tokenized_data, labels)
|
||||||
```
|
```
|
||||||
|
|
||||||
<Tip>
|
<Tip>
|
||||||
|
|
||||||
[`Trainer`] uses [`DataCollatorWithPadding`] by default so you don't need to explicitly specify a data collator.
|
You don't have to pass a loss argument to your models when you `compile()` them! Hugging Face models automatically
|
||||||
|
choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always
|
||||||
|
override this by specifying a loss yourself if you want to!
|
||||||
|
|
||||||
</Tip>
|
</Tip>
|
||||||
|
|
||||||
Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`:
|
This approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why?
|
||||||
|
Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle
|
||||||
|
“jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole
|
||||||
|
dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too!
|
||||||
|
|
||||||
|
### Loading data as a tf.data.Dataset
|
||||||
|
|
||||||
|
If you want to avoid slowing down training, you can load your data as a `tf.data.Dataset` instead. Although you can write your own
|
||||||
|
`tf.data` pipeline if you want, we have two convenience methods for doing this:
|
||||||
|
|
||||||
|
- [`~TFPreTrainedModel.prepare_tf_dataset`]: This is the method we recommend in most cases. Because it is a method
|
||||||
|
on your model, it can inspect the model to automatically figure out which columns are usable as model inputs, and
|
||||||
|
discard the others to make a simpler, more performant dataset.
|
||||||
|
- [`~datasets.Dataset.to_tf_dataset`]: This method is more low-level, and is useful when you want to exactly control how
|
||||||
|
your dataset is created, by specifying exactly which `columns` and `label_cols` to include.
|
||||||
|
|
||||||
|
Before you can use [`~TFPreTrainedModel.prepare_tf_dataset`], you will need to add the tokenizer outputs to your dataset as columns, as shown in
|
||||||
|
the following code sample:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> tf_train_dataset = small_train_dataset.to_tf_dataset(
|
def tokenize_dataset(data):
|
||||||
... columns=["attention_mask", "input_ids", "token_type_ids"],
|
# Keys of the returned dictionary will be added to the dataset as columns
|
||||||
... label_cols=["labels"],
|
return tokenizer(data["text"])
|
||||||
... shuffle=True,
|
|
||||||
... collate_fn=data_collator,
|
|
||||||
... batch_size=8,
|
|
||||||
... )
|
|
||||||
|
|
||||||
>>> tf_validation_dataset = small_eval_dataset.to_tf_dataset(
|
|
||||||
... columns=["attention_mask", "input_ids", "token_type_ids"],
|
dataset = dataset.map(tokenize_dataset)
|
||||||
... label_cols=["labels"],
|
|
||||||
... shuffle=False,
|
|
||||||
... collate_fn=data_collator,
|
|
||||||
... batch_size=8,
|
|
||||||
... )
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Compile and fit
|
Remember that Hugging Face datasets are stored on disk by default, so this will not inflate your memory usage! Once the
|
||||||
|
columns have been added, you can stream batches from the dataset and add padding to each batch, which greatly
|
||||||
|
reduces the number of padding tokens compared to padding the entire dataset.
|
||||||
|
|
||||||
Load a TensorFlow model with the expected number of labels:
|
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> import tensorflow as tf
|
>>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer)
|
||||||
>>> from transformers import TFAutoModelForSequenceClassification
|
|
||||||
|
|
||||||
>>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Then compile and fine-tune your model with [`fit`](https://keras.io/api/models/model_training_apis/) as you would with any other Keras model:
|
Note that in the code sample above, you need to pass the tokenizer to `prepare_tf_dataset` so it can correctly pad batches as they're loaded.
|
||||||
|
If all the samples in your dataset are the same length and no padding is necessary, you can skip this argument.
|
||||||
|
If you need to do something more complex than just padding samples (e.g. corrupting tokens for masked language
|
||||||
|
modelling), you can use the `collate_fn` argument instead to pass a function that will be called to transform the
|
||||||
|
list of samples into a batch and apply any preprocessing you want. See our
|
||||||
|
[examples](https://github.com/huggingface/transformers/tree/main/examples) or
|
||||||
|
[notebooks](https://huggingface.co/docs/transformers/notebooks) to see this approach in action.
|
||||||
|
|
||||||
|
Once you've created a `tf.data.Dataset`, you can compile and fit the model as before:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> model.compile(
|
model.compile(optimizer=Adam(3e-5))
|
||||||
... optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
|
|
||||||
... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
|
|
||||||
... metrics=tf.metrics.SparseCategoricalAccuracy(),
|
|
||||||
... )
|
|
||||||
|
|
||||||
>>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)
|
model.fit(tf_dataset)
|
||||||
```
|
```
|
||||||
|
|
||||||
</tf>
|
</tf>
|
||||||
</frameworkcontent>
|
</frameworkcontent>
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user