Enable doc in Spanish (#16518)
* Reorganize doc for multilingual support * Fix style * Style * Toc trees * Adapt templates
This commit is contained in:
218
docs/source/en/tasks/asr.mdx
Normal file
218
docs/source/en/tasks/asr.mdx
Normal file
@@ -0,0 +1,218 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Automatic speech recognition
|
||||
|
||||
<Youtube id="TksaY_FDgnk"/>
|
||||
|
||||
Automatic speech recognition (ASR) converts a speech signal to text. It is an example of a sequence-to-sequence task, going from a sequence of audio inputs to textual outputs. Voice assistants like Siri and Alexa utilize ASR models to assist users.
|
||||
|
||||
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the [TIMIT](https://huggingface.co/datasets/timit_asr) dataset to transcribe audio to text.
|
||||
|
||||
<Tip>
|
||||
|
||||
See the automatic speech recognition [task page](https://huggingface.co/tasks/automatic-speech-recognition) for more information about its associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load TIMIT dataset
|
||||
|
||||
Load the TIMIT dataset from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> timit = load_dataset("timit_asr")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> timit
|
||||
DatasetDict({
|
||||
train: Dataset({
|
||||
features: ['file', 'audio', 'text', 'phonetic_detail', 'word_detail', 'dialect_region', 'sentence_type', 'speaker_id', 'id'],
|
||||
num_rows: 4620
|
||||
})
|
||||
test: Dataset({
|
||||
features: ['file', 'audio', 'text', 'phonetic_detail', 'word_detail', 'dialect_region', 'sentence_type', 'speaker_id', 'id'],
|
||||
num_rows: 1680
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
While the dataset contains a lot of helpful information, like `dialect_region` and `sentence_type`, you will focus on the `audio` and `text` fields in this guide. Remove the other columns:
|
||||
|
||||
```py
|
||||
>>> timit = timit.remove_columns(
|
||||
... ["phonetic_detail", "word_detail", "dialect_region", "id", "sentence_type", "speaker_id"]
|
||||
... )
|
||||
```
|
||||
|
||||
Take a look at the example again:
|
||||
|
||||
```py
|
||||
>>> timit["train"][0]
|
||||
{'audio': {'array': array([-2.1362305e-04, 6.1035156e-05, 3.0517578e-05, ...,
|
||||
-3.0517578e-05, -9.1552734e-05, -6.1035156e-05], dtype=float32),
|
||||
'path': '/root/.cache/huggingface/datasets/downloads/extracted/404950a46da14eac65eb4e2a8317b1372fb3971d980d91d5d5b221275b1fd7e0/data/TRAIN/DR4/MMDM0/SI681.WAV',
|
||||
'sampling_rate': 16000},
|
||||
'file': '/root/.cache/huggingface/datasets/downloads/extracted/404950a46da14eac65eb4e2a8317b1372fb3971d980d91d5d5b221275b1fd7e0/data/TRAIN/DR4/MMDM0/SI681.WAV',
|
||||
'text': 'Would such an act of refusal be useful?'}
|
||||
```
|
||||
|
||||
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file.
|
||||
|
||||
## Preprocess
|
||||
|
||||
Load the Wav2Vec2 processor to process the audio signal and transcribed text:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoProcessor
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")
|
||||
```
|
||||
|
||||
The preprocessing function needs to:
|
||||
|
||||
1. Call the `audio` column to load and resample the audio file.
|
||||
2. Extract the `input_values` from the audio file.
|
||||
3. Typically, when you call the processor, you call the feature extractor. Since you also want to tokenize text, instruct the processor to call the tokenizer instead with a context manager.
|
||||
|
||||
```py
|
||||
>>> def prepare_dataset(batch):
|
||||
... audio = batch["audio"]
|
||||
|
||||
... batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
|
||||
... batch["input_length"] = len(batch["input_values"])
|
||||
|
||||
... with processor.as_target_processor():
|
||||
... batch["labels"] = processor(batch["text"]).input_ids
|
||||
... return batch
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> timit = timit.map(prepare_dataset, remove_columns=timit.column_names["train"], num_proc=4)
|
||||
```
|
||||
|
||||
🤗 Transformers doesn't have a data collator for automatic speech recognition, so you will need to create one. You can adapt the [`DataCollatorWithPadding`] to create a batch of examples for automatic speech recognition. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
Unlike other data collators, this specific data collator needs to apply a different padding method to `input_values` and `labels`. You can apply a different padding method with a context manager:
|
||||
|
||||
```py
|
||||
>>> import torch
|
||||
|
||||
>>> from dataclasses import dataclass, field
|
||||
>>> from typing import Any, Dict, List, Optional, Union
|
||||
|
||||
|
||||
>>> @dataclass
|
||||
... class DataCollatorCTCWithPadding:
|
||||
|
||||
... processor: AutoProcessor
|
||||
... padding: Union[bool, str] = True
|
||||
|
||||
... def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
|
||||
... # split inputs and labels since they have to be of different lengths and need
|
||||
... # different padding methods
|
||||
... input_features = [{"input_values": feature["input_values"]} for feature in features]
|
||||
... label_features = [{"input_ids": feature["labels"]} for feature in features]
|
||||
|
||||
... batch = self.processor.pad(
|
||||
... input_features,
|
||||
... padding=self.padding,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
... with self.processor.as_target_processor():
|
||||
... labels_batch = self.processor.pad(
|
||||
... label_features,
|
||||
... padding=self.padding,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
|
||||
... # replace padding with -100 to ignore loss correctly
|
||||
... labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
|
||||
|
||||
... batch["labels"] = labels
|
||||
|
||||
... return batch
|
||||
```
|
||||
|
||||
Create a batch of examples and dynamically pad them with `DataCollatorForCTCWithPadding`:
|
||||
|
||||
```py
|
||||
>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
|
||||
```
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load Wav2Vec2 with [`AutoModelForCTC`]. For `ctc_loss_reduction`, it is often better to use the average instead of the default summation:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForCTC.from_pretrained(
|
||||
... "facebook/wav2vec-base",
|
||||
... ctc_loss_reduction="mean",
|
||||
... pad_token_id=processor.tokenizer.pad_token_id,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, datasets, tokenizer, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... group_by_length=True,
|
||||
... per_device_train_batch_size=16,
|
||||
... evaluation_strategy="steps",
|
||||
... num_train_epochs=3,
|
||||
... fp16=True,
|
||||
... gradient_checkpointing=True,
|
||||
... learning_rate=1e-4,
|
||||
... weight_decay=0.005,
|
||||
... save_total_limit=2,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=timit["train"],
|
||||
... eval_dataset=timit["test"],
|
||||
... tokenizer=processor.feature_extractor,
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for automatic speech recognition, take a look at this blog [post](https://huggingface.co/blog/fine-tune-wav2vec2-english) for English ASR and this [post](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) for multilingual ASR.
|
||||
|
||||
</Tip>
|
||||
147
docs/source/en/tasks/audio_classification.mdx
Normal file
147
docs/source/en/tasks/audio_classification.mdx
Normal file
@@ -0,0 +1,147 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Audio classification
|
||||
|
||||
<Youtube id="KWwzcmG98Ds"/>
|
||||
|
||||
Audio classification assigns a label or class to audio data. It is similar to text classification, except an audio input is continuous and must be discretized, whereas text can be split into tokens. Some practical applications of audio classification include identifying intent, speakers, and even animal species by their sounds.
|
||||
|
||||
This guide will show you how to fine-tune [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) on the Keyword Spotting subset of the [SUPERB](https://huggingface.co/datasets/superb) benchmark to classify utterances.
|
||||
|
||||
<Tip>
|
||||
|
||||
See the audio classification [task page](https://huggingface.co/tasks/audio-classification) for more information about its associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load SUPERB dataset
|
||||
|
||||
Load the SUPERB dataset from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> ks = load_dataset("superb", "ks")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> ks["train"][0]
|
||||
{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'sampling_rate': 16000}, 'file': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'label': 10}
|
||||
```
|
||||
|
||||
The `audio` column contains a 1-dimensional `array` of the speech signal that must be called to load and resample the audio file. The `label` column is an integer that represents the utterance class. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:
|
||||
|
||||
```py
|
||||
>>> labels = ks["train"].features["label"].names
|
||||
>>> label2id, id2label = dict(), dict()
|
||||
>>> for i, label in enumerate(labels):
|
||||
... label2id[label] = str(i)
|
||||
... id2label[str(i)] = label
|
||||
```
|
||||
|
||||
Now you can convert the label number to a label name for more information:
|
||||
|
||||
```py
|
||||
>>> id2label[str(10)]
|
||||
'_silence_'
|
||||
```
|
||||
|
||||
Each keyword - or label - corresponds to a number; `10` indicates `silence` in the example above.
|
||||
|
||||
## Preprocess
|
||||
|
||||
Load the Wav2Vec2 feature extractor to process the audio signal:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
|
||||
```
|
||||
|
||||
The preprocessing function needs to:
|
||||
|
||||
1. Call the `audio` column to load and if necessary resample the audio file.
|
||||
2. Check the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information on the Wav2Vec2 [model card]((https://huggingface.co/facebook/wav2vec2-base)).
|
||||
3. Set a maximum input length so longer inputs are batched without being truncated.
|
||||
|
||||
```py
|
||||
>>> def preprocess_function(examples):
|
||||
... audio_arrays = [x["array"] for x in examples["audio"]]
|
||||
... inputs = feature_extractor(
|
||||
... audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
|
||||
... )
|
||||
... return inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> encoded_ks = ks.map(preprocess_function, remove_columns=["audio", "file"], batched=True)
|
||||
```
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load Wav2Vec2 with [`AutoModelForAudioClassification`]. Specify the number of labels, and pass the model the mapping between label number and label class:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer
|
||||
|
||||
>>> num_labels = len(id2label)
|
||||
>>> model = AutoModelForAudioClassification.from_pretrained(
|
||||
... "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, datasets, and feature extractor.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... evaluation_strategy="epoch",
|
||||
... save_strategy="epoch",
|
||||
... learning_rate=3e-5,
|
||||
... num_train_epochs=5,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=encoded_ks["train"],
|
||||
... eval_dataset=encoded_ks["validation"],
|
||||
... tokenizer=feature_extractor,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb).
|
||||
|
||||
</Tip>
|
||||
174
docs/source/en/tasks/image_classification.mdx
Normal file
174
docs/source/en/tasks/image_classification.mdx
Normal file
@@ -0,0 +1,174 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Image classification
|
||||
|
||||
<Youtube id="tjAIM7BOYhw"/>
|
||||
|
||||
Image classification assigns a label or class to an image. Unlike text or audio classification, the inputs are the pixel values that represent an image. There are many uses for image classification, like detecting damage after a disaster, monitoring crop health, or helping screen medical images for signs of disease.
|
||||
|
||||
This guide will show you how to fine-tune [ViT](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/vit) on the [Food-101](https://huggingface.co/datasets/food101) dataset to classify a food item in an image.
|
||||
|
||||
<Tip>
|
||||
|
||||
See the image classification [task page](https://huggingface.co/tasks/audio-classification) for more information about its associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load Food-101 dataset
|
||||
|
||||
Load only the first 5000 images of the Food-101 dataset from the 🤗 Datasets library since it is pretty large:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> food = load_dataset("food101", split="train[:5000]")
|
||||
```
|
||||
|
||||
Split this dataset into a train and test set:
|
||||
|
||||
```py
|
||||
>>> food = food.train_test_split(test_size=0.2)
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> food["train"][0]
|
||||
{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F52AFC8AC50>,
|
||||
'label': 79}
|
||||
```
|
||||
|
||||
The `image` field contains a PIL image, and each `label` is an integer that represents a class. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:
|
||||
|
||||
```py
|
||||
>>> labels = food["train"].features["label"].names
|
||||
>>> label2id, id2label = dict(), dict()
|
||||
>>> for i, label in enumerate(labels):
|
||||
... label2id[label] = str(i)
|
||||
... id2label[str(i)] = label
|
||||
```
|
||||
|
||||
Now you can convert the label number to a label name for more information:
|
||||
|
||||
```py
|
||||
>>> id2label[str(79)]
|
||||
'prime_rib'
|
||||
```
|
||||
|
||||
Each food class - or label - corresponds to a number; `79` indicates a prime rib in the example above.
|
||||
|
||||
## Preprocess
|
||||
|
||||
Load the ViT feature extractor to process the image into a tensor:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoFeatureExtractor
|
||||
|
||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||
```
|
||||
|
||||
Apply several image transformations to the dataset to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
|
||||
|
||||
```py
|
||||
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor
|
||||
|
||||
>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
|
||||
>>> _transforms = Compose([RandomResizedCrop(feature_extractor.size), ToTensor(), normalize])
|
||||
```
|
||||
|
||||
Create a preprocessing function that will apply the transforms and return the `pixel_values` - the inputs to the model - of the image:
|
||||
|
||||
```py
|
||||
>>> def transforms(examples):
|
||||
... examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
|
||||
... del examples["image"]
|
||||
... return examples
|
||||
```
|
||||
|
||||
Use 🤗 Dataset's [`with_transform`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?#datasets.Dataset.with_transform) method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset:
|
||||
|
||||
```py
|
||||
>>> food = food.with_transform(transforms)
|
||||
```
|
||||
|
||||
Use [`DefaultDataCollator`] to create a batch of examples. Unlike other data collators in 🤗 Transformers, the DefaultDataCollator does not apply additional preprocessing such as padding.
|
||||
|
||||
```py
|
||||
>>> from transformers import DefaultDataCollator
|
||||
|
||||
>>> data_collator = DefaultDataCollator()
|
||||
```
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load ViT with [`AutoModelForImageClassification`]. Specify the number of labels, and pass the model the mapping between label number and label class:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForImageClassification.from_pretrained(
|
||||
... "google/vit-base-patch16-224-in21k",
|
||||
... num_labels=len(labels),
|
||||
... id2label=id2label,
|
||||
... label2id=label2id,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because this will drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior!
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, datasets, tokenizer, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... per_device_train_batch_size=16,
|
||||
... evaluation_strategy="steps",
|
||||
... num_train_epochs=4,
|
||||
... fp16=True,
|
||||
... save_steps=100,
|
||||
... eval_steps=100,
|
||||
... logging_steps=10,
|
||||
... learning_rate=2e-4,
|
||||
... save_total_limit=2,
|
||||
... remove_unused_columns=False,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... data_collator=data_collator,
|
||||
... train_dataset=food["train"],
|
||||
... eval_dataset=food["test"],
|
||||
... tokenizer=feature_extractor,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for image classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
|
||||
|
||||
</Tip>
|
||||
418
docs/source/en/tasks/language_modeling.mdx
Normal file
418
docs/source/en/tasks/language_modeling.mdx
Normal file
@@ -0,0 +1,418 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Language modeling
|
||||
|
||||
Language modeling predicts words in a sentence. There are two forms of language modeling.
|
||||
|
||||
<Youtube id="Vpjb1lu0MDk"/>
|
||||
|
||||
Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left.
|
||||
|
||||
<Youtube id="mqElG5QJWUg"/>
|
||||
|
||||
Masked language modeling predicts a masked token in a sequence, and the model can attend to tokens bidirectionally.
|
||||
|
||||
This guide will show you how to fine-tune [DistilGPT2](https://huggingface.co/distilgpt2) for causal language modeling and [DistilRoBERTa](https://huggingface.co/distilroberta-base) for masked language modeling on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset.
|
||||
|
||||
<Tip>
|
||||
|
||||
You can fine-tune other architectures for language modeling such as [GPT-Neo](https://huggingface.co/EleutherAI/gpt-neo-125M), [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B), and [BERT](https://huggingface.co/bert-base-uncased), following the same steps presented in this guide!
|
||||
|
||||
See the text generation [task page](https://huggingface.co/tasks/text-generation) and fill mask [task page](https://huggingface.co/tasks/fill-mask) for more information about their associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load ELI5 dataset
|
||||
|
||||
Load only the first 5000 rows of the ELI5 dataset from the 🤗 Datasets library since it is pretty large:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")
|
||||
```
|
||||
|
||||
Split this dataset into a train and test set:
|
||||
|
||||
```py
|
||||
eli5 = eli5.train_test_split(test_size=0.2)
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> eli5["train"][0]
|
||||
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
|
||||
'score': [6, 3],
|
||||
'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
|
||||
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
|
||||
'answers_urls': {'url': []},
|
||||
'document': '',
|
||||
'q_id': 'nyxfp',
|
||||
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
|
||||
'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
|
||||
'subreddit': 'askscience',
|
||||
'title': 'Few questions about this space walk photograph.',
|
||||
'title_urls': {'url': []}}
|
||||
```
|
||||
|
||||
Notice `text` is a subfield nested inside the `answers` dictionary. When you preprocess the dataset, you will need to extract the `text` subfield into a separate column.
|
||||
|
||||
## Preprocess
|
||||
|
||||
<Youtube id="ma1TrR7gE7I"/>
|
||||
|
||||
For causal language modeling, load the DistilGPT2 tokenizer to process the `text` subfield:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
|
||||
```
|
||||
|
||||
<Youtube id="8PmhEIXhBvI"/>
|
||||
|
||||
For masked language modeling, load the DistilRoBERTa tokenizer instead:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
|
||||
```
|
||||
|
||||
Extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method:
|
||||
|
||||
```py
|
||||
>>> eli5 = eli5.flatten()
|
||||
>>> eli5["train"][0]
|
||||
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
|
||||
'answers.score': [6, 3],
|
||||
'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
|
||||
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
|
||||
'answers_urls.url': [],
|
||||
'document': '',
|
||||
'q_id': 'nyxfp',
|
||||
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
|
||||
'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
|
||||
'subreddit': 'askscience',
|
||||
'title': 'Few questions about this space walk photograph.',
|
||||
'title_urls.url': []}
|
||||
```
|
||||
|
||||
Each subfield is now a separate column as indicated by the `answers` prefix. Notice that `answers.text` is a list. Instead of tokenizing each sentence separately, convert the list to a string to jointly tokenize them.
|
||||
|
||||
Here is how you can create a preprocessing function to convert the list to a string and truncate sequences to be no longer than DistilGPT2's maximum input length:
|
||||
|
||||
```py
|
||||
>>> def preprocess_function(examples):
|
||||
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> tokenized_eli5 = eli5.map(
|
||||
... preprocess_function,
|
||||
... batched=True,
|
||||
... num_proc=4,
|
||||
... remove_columns=eli5["train"].column_names,
|
||||
... )
|
||||
```
|
||||
|
||||
Now you need a second preprocessing function to capture text truncated from any lengthy examples to prevent loss of information. This preprocessing function should:
|
||||
|
||||
- Concatenate all the text.
|
||||
- Split the concatenated text into smaller chunks defined by `block_size`.
|
||||
|
||||
```py
|
||||
>>> block_size = 128
|
||||
|
||||
|
||||
>>> def group_texts(examples):
|
||||
... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
|
||||
... total_length = len(concatenated_examples[list(examples.keys())[0]])
|
||||
... result = {
|
||||
... k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
|
||||
... for k, t in concatenated_examples.items()
|
||||
... }
|
||||
... result["labels"] = result["input_ids"].copy()
|
||||
... return result
|
||||
```
|
||||
|
||||
Apply the `group_texts` function over the entire dataset:
|
||||
|
||||
```py
|
||||
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
|
||||
```
|
||||
|
||||
For causal language modeling, use [`DataCollatorForLanguageModeling`] to create a batch of examples. It will also *dynamically pad* your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:
|
||||
|
||||
```py
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> tokenizer.pad_token = tokenizer.eos_token
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
|
||||
```
|
||||
|
||||
For masked language modeling, use the same [`DataCollatorForLanguageModeling`] except you should specify `mlm_probability` to randomly mask tokens each time you iterate over the data.
|
||||
|
||||
```py
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> tokenizer.pad_token = tokenizer.eos_token
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
You can use the end of sequence token as the padding token, and set `mlm=False`. This will use the inputs as labels shifted to the right by one element:
|
||||
|
||||
```py
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
|
||||
```
|
||||
|
||||
For masked language modeling, use the same [`DataCollatorForLanguageModeling`] except you should specify `mlm_probability` to randomly mask tokens each time you iterate over the data.
|
||||
|
||||
```py
|
||||
>>> from transformers import DataCollatorForLanguageModeling
|
||||
|
||||
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Causal language modeling
|
||||
|
||||
Causal language modeling is frequently used for text generation. This section shows you how to fine-tune [DistilGPT2](https://huggingface.co/distilgpt2) to generate new text.
|
||||
|
||||
### Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load DistilGPT2 with [`AutoModelForCausalLM`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, datasets, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... evaluation_strategy="epoch",
|
||||
... learning_rate=2e-5,
|
||||
... weight_decay=0.01,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=lm_dataset["train"],
|
||||
... eval_dataset=lm_dataset["test"],
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
||||
|
||||
</Tip>
|
||||
|
||||
Set up an optimizer function, learning rate, and some training hyperparameters:
|
||||
|
||||
```py
|
||||
>>> from transformers import create_optimizer, AdamWeightDecay
|
||||
|
||||
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
|
||||
```
|
||||
|
||||
Load DistilGPT2 with [`TFAutoModelForCausalLM`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import TFAutoModelForCausalLM
|
||||
|
||||
>>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
|
||||
```
|
||||
|
||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> import tensorflow as tf
|
||||
|
||||
>>> model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Masked language modeling
|
||||
|
||||
Masked language modeling is also known as a fill-mask task because it predicts a masked token in a sequence. Models for masked language modeling require a good contextual understanding of an entire sequence instead of only the left context. This section shows you how to fine-tune [DistilRoBERTa](https://huggingface.co/distilroberta-base) to predict a masked word.
|
||||
|
||||
### Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load DistilRoBERTa with [`AutoModelForMaskedlM`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForMaskedLM
|
||||
|
||||
>>> model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, datasets, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... evaluation_strategy="epoch",
|
||||
... learning_rate=2e-5,
|
||||
... num_train_epochs=3,
|
||||
... weight_decay=0.01,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=lm_dataset["train"],
|
||||
... eval_dataset=lm_dataset["test"],
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_test_set = lm_dataset["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... dummy_labels=True,
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
||||
|
||||
</Tip>
|
||||
|
||||
Set up an optimizer function, learning rate, and some training hyperparameters:
|
||||
|
||||
```py
|
||||
>>> from transformers import create_optimizer, AdamWeightDecay
|
||||
|
||||
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
|
||||
```
|
||||
|
||||
Load DistilRoBERTa with [`TFAutoModelForMaskedLM`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import TFAutoModelForMaskedLM
|
||||
|
||||
>>> model = TFAutoModelForCausalLM.from_pretrained("distilroberta-base")
|
||||
```
|
||||
|
||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> import tensorflow as tf
|
||||
|
||||
>>> model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for causal language modeling, take a look at the corresponding
|
||||
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
|
||||
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).
|
||||
|
||||
</Tip>
|
||||
288
docs/source/en/tasks/multiple_choice.mdx
Normal file
288
docs/source/en/tasks/multiple_choice.mdx
Normal file
@@ -0,0 +1,288 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Multiple choice
|
||||
|
||||
A multiple choice task is similar to question answering, except several candidate answers are provided along with a context. The model is trained to select the correct answer from multiple inputs given a context.
|
||||
|
||||
This guide will show you how to fine-tune [BERT](https://huggingface.co/bert-base-uncased) on the `regular` configuration of the [SWAG](https://huggingface.co/datasets/swag) dataset to select the best answer given multiple options and some context.
|
||||
|
||||
## Load SWAG dataset
|
||||
|
||||
Load the SWAG dataset from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> swag = load_dataset("swag", "regular")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> swag["train"][0]
|
||||
{'ending0': 'passes by walking down the street playing their instruments.',
|
||||
'ending1': 'has heard approaching them.',
|
||||
'ending2': "arrives and they're outside dancing and asleep.",
|
||||
'ending3': 'turns the lead singer watches the performance.',
|
||||
'fold-ind': '3416',
|
||||
'gold-source': 'gold',
|
||||
'label': 0,
|
||||
'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
|
||||
'sent2': 'A drum line',
|
||||
'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
|
||||
'video-id': 'anetv_jkn6uvmqwh4'}
|
||||
```
|
||||
|
||||
The `sent1` and `sent2` fields show how a sentence begins, and each `ending` field shows how a sentence could end. Given the sentence beginning, the model must pick the correct sentence ending as indicated by the `label` field.
|
||||
|
||||
## Preprocess
|
||||
|
||||
Load the BERT tokenizer to process the start of each sentence and the four possible endings:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
|
||||
```
|
||||
|
||||
The preprocessing function needs to do:
|
||||
|
||||
1. Make four copies of the `sent1` field so you can combine each of them with `sent2` to recreate how a sentence starts.
|
||||
2. Combine `sent2` with each of the four possible sentence endings.
|
||||
3. Flatten these two lists so you can tokenize them, and then unflatten them afterward so each example has a corresponding `input_ids`, `attention_mask`, and `labels` field.
|
||||
|
||||
```py
|
||||
>>> ending_names = ["ending0", "ending1", "ending2", "ending3"]
|
||||
|
||||
|
||||
>>> def preprocess_function(examples):
|
||||
... first_sentences = [[context] * 4 for context in examples["sent1"]]
|
||||
... question_headers = examples["sent2"]
|
||||
... second_sentences = [
|
||||
... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
|
||||
... ]
|
||||
|
||||
... first_sentences = sum(first_sentences, [])
|
||||
... second_sentences = sum(second_sentences, [])
|
||||
|
||||
... tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
|
||||
... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
tokenized_swag = swag.map(preprocess_function, batched=True)
|
||||
```
|
||||
|
||||
🤗 Transformers doesn't have a data collator for multiple choice, so you will need to create one. You can adapt the [`DataCollatorWithPadding`] to create a batch of examples for multiple choice. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
`DataCollatorForMultipleChoice` will flatten all the model inputs, apply padding, and then unflatten the results:
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from dataclasses import dataclass
|
||||
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
|
||||
>>> from typing import Optional, Union
|
||||
>>> import torch
|
||||
|
||||
|
||||
>>> @dataclass
|
||||
... class DataCollatorForMultipleChoice:
|
||||
... """
|
||||
... Data collator that will dynamically pad the inputs for multiple choice received.
|
||||
... """
|
||||
|
||||
... tokenizer: PreTrainedTokenizerBase
|
||||
... padding: Union[bool, str, PaddingStrategy] = True
|
||||
... max_length: Optional[int] = None
|
||||
... pad_to_multiple_of: Optional[int] = None
|
||||
|
||||
... def __call__(self, features):
|
||||
... label_name = "label" if "label" in features[0].keys() else "labels"
|
||||
... labels = [feature.pop(label_name) for feature in features]
|
||||
... batch_size = len(features)
|
||||
... num_choices = len(features[0]["input_ids"])
|
||||
... flattened_features = [
|
||||
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
|
||||
... ]
|
||||
... flattened_features = sum(flattened_features, [])
|
||||
|
||||
... batch = self.tokenizer.pad(
|
||||
... flattened_features,
|
||||
... padding=self.padding,
|
||||
... max_length=self.max_length,
|
||||
... pad_to_multiple_of=self.pad_to_multiple_of,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
|
||||
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
|
||||
... batch["labels"] = torch.tensor(labels, dtype=torch.int64)
|
||||
... return batch
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from dataclasses import dataclass
|
||||
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
|
||||
>>> from typing import Optional, Union
|
||||
>>> import tensorflow as tf
|
||||
|
||||
|
||||
>>> @dataclass
|
||||
... class DataCollatorForMultipleChoice:
|
||||
... """
|
||||
... Data collator that will dynamically pad the inputs for multiple choice received.
|
||||
... """
|
||||
|
||||
... tokenizer: PreTrainedTokenizerBase
|
||||
... padding: Union[bool, str, PaddingStrategy] = True
|
||||
... max_length: Optional[int] = None
|
||||
... pad_to_multiple_of: Optional[int] = None
|
||||
|
||||
... def __call__(self, features):
|
||||
... label_name = "label" if "label" in features[0].keys() else "labels"
|
||||
... labels = [feature.pop(label_name) for feature in features]
|
||||
... batch_size = len(features)
|
||||
... num_choices = len(features[0]["input_ids"])
|
||||
... flattened_features = [
|
||||
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
|
||||
... ]
|
||||
... flattened_features = sum(flattened_features, [])
|
||||
|
||||
... batch = self.tokenizer.pad(
|
||||
... flattened_features,
|
||||
... padding=self.padding,
|
||||
... max_length=self.max_length,
|
||||
... pad_to_multiple_of=self.pad_to_multiple_of,
|
||||
... return_tensors="tf",
|
||||
... )
|
||||
|
||||
... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
|
||||
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
|
||||
... return batch
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load BERT with [`AutoModelForMultipleChoice`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Trainer, take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... evaluation_strategy="epoch",
|
||||
... learning_rate=5e-5,
|
||||
... per_device_train_batch_size=16,
|
||||
... per_device_eval_batch_size=16,
|
||||
... num_train_epochs=3,
|
||||
... weight_decay=0.01,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=tokenized_swag["train"],
|
||||
... eval_dataset=tokenized_swag["validation"],
|
||||
... tokenizer=tokenizer,
|
||||
... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
|
||||
>>> tf_train_set = tokenized_swag["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids"],
|
||||
... label_cols=["labels"],
|
||||
... shuffle=True,
|
||||
... batch_size=batch_size,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_swag["validation"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids"],
|
||||
... label_cols=["labels"],
|
||||
... shuffle=False,
|
||||
... batch_size=batch_size,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
||||
|
||||
</Tip>
|
||||
|
||||
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
|
||||
|
||||
```py
|
||||
>>> from transformers import create_optimizer
|
||||
|
||||
>>> batch_size = 16
|
||||
>>> num_train_epochs = 2
|
||||
>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
|
||||
>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
|
||||
```
|
||||
|
||||
Load BERT with [`TFAutoModelForMultipleChoice`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import TFAutoModelForMultipleChoice
|
||||
|
||||
>>> model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
|
||||
```
|
||||
|
||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> model.compile(
|
||||
... optimizer=optimizer,
|
||||
... loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
|
||||
... )
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
273
docs/source/en/tasks/question_answering.mdx
Normal file
273
docs/source/en/tasks/question_answering.mdx
Normal file
@@ -0,0 +1,273 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Question answering
|
||||
|
||||
<Youtube id="ajPx5LwJD-I"/>
|
||||
|
||||
Question answering tasks return an answer given a question. There are two common forms of question answering:
|
||||
|
||||
- Extractive: extract the answer from the given context.
|
||||
- Abstractive: generate an answer from the context that correctly answers the question.
|
||||
|
||||
This guide will show you how to fine-tune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [SQuAD](https://huggingface.co/datasets/squad) dataset for extractive question answering.
|
||||
|
||||
<Tip>
|
||||
|
||||
See the question answering [task page](https://huggingface.co/tasks/question-answering) for more information about other forms of question answering and their associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load SQuAD dataset
|
||||
|
||||
Load the SQuAD dataset from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> squad = load_dataset("squad")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> squad["train"][0]
|
||||
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
|
||||
'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
|
||||
'id': '5733be284776f41900661182',
|
||||
'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
|
||||
'title': 'University_of_Notre_Dame'
|
||||
}
|
||||
```
|
||||
|
||||
The `answers` field is a dictionary containing the starting position of the answer and the `text` of the answer.
|
||||
|
||||
## Preprocess
|
||||
|
||||
<Youtube id="qgaM0weJHpA"/>
|
||||
|
||||
Load the DistilBERT tokenizer to process the `question` and `context` fields:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
There are a few preprocessing steps particular to question answering that you should be aware of:
|
||||
|
||||
1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. Truncate only the `context` by setting `truncation="only_second"`.
|
||||
2. Next, map the start and end positions of the answer to the original `context` by setting
|
||||
`return_offset_mapping=True`.
|
||||
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the [`sequence_ids`](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.Encoding.sequence_ids) method to
|
||||
find which part of the offset corresponds to the `question` and which corresponds to the `context`.
|
||||
|
||||
Here is how you can create a function to truncate and map the start and end tokens of the answer to the `context`:
|
||||
|
||||
```py
|
||||
>>> def preprocess_function(examples):
|
||||
... questions = [q.strip() for q in examples["question"]]
|
||||
... inputs = tokenizer(
|
||||
... questions,
|
||||
... examples["context"],
|
||||
... max_length=384,
|
||||
... truncation="only_second",
|
||||
... return_offsets_mapping=True,
|
||||
... padding="max_length",
|
||||
... )
|
||||
|
||||
... offset_mapping = inputs.pop("offset_mapping")
|
||||
... answers = examples["answers"]
|
||||
... start_positions = []
|
||||
... end_positions = []
|
||||
|
||||
... for i, offset in enumerate(offset_mapping):
|
||||
... answer = answers[i]
|
||||
... start_char = answer["answer_start"][0]
|
||||
... end_char = answer["answer_start"][0] + len(answer["text"][0])
|
||||
... sequence_ids = inputs.sequence_ids(i)
|
||||
|
||||
... # Find the start and end of the context
|
||||
... idx = 0
|
||||
... while sequence_ids[idx] != 1:
|
||||
... idx += 1
|
||||
... context_start = idx
|
||||
... while sequence_ids[idx] == 1:
|
||||
... idx += 1
|
||||
... context_end = idx - 1
|
||||
|
||||
... # If the answer is not fully inside the context, label it (0, 0)
|
||||
... if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
|
||||
... start_positions.append(0)
|
||||
... end_positions.append(0)
|
||||
... else:
|
||||
... # Otherwise it's the start and end token positions
|
||||
... idx = context_start
|
||||
... while idx <= context_end and offset[idx][0] <= start_char:
|
||||
... idx += 1
|
||||
... start_positions.append(idx - 1)
|
||||
|
||||
... idx = context_end
|
||||
... while idx >= context_start and offset[idx][1] >= end_char:
|
||||
... idx -= 1
|
||||
... end_positions.append(idx + 1)
|
||||
|
||||
... inputs["start_positions"] = start_positions
|
||||
... inputs["end_positions"] = end_positions
|
||||
... return inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
|
||||
|
||||
```py
|
||||
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
|
||||
```
|
||||
|
||||
Use [`DefaultDataCollator`] to create a batch of examples. Unlike other data collators in 🤗 Transformers, the `DefaultDataCollator` does not apply additional preprocessing such as padding.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DefaultDataCollator
|
||||
|
||||
>>> data_collator = DefaultDataCollator()
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DefaultDataCollator
|
||||
|
||||
>>> data_collator = DefaultDataCollator(return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load DistilBERT with [`AutoModelForQuestionAnswering`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... evaluation_strategy="epoch",
|
||||
... learning_rate=2e-5,
|
||||
... per_device_train_batch_size=16,
|
||||
... per_device_eval_batch_size=16,
|
||||
... num_train_epochs=3,
|
||||
... weight_decay=0.01,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=tokenized_squad["train"],
|
||||
... eval_dataset=tokenized_squad["validation"],
|
||||
... tokenizer=tokenizer,
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
|
||||
... dummy_labels=True,
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
|
||||
... dummy_labels=True,
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
||||
|
||||
</Tip>
|
||||
|
||||
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
|
||||
|
||||
```py
|
||||
>>> from transformers import create_optimizer
|
||||
|
||||
>>> batch_size = 16
|
||||
>>> num_epochs = 2
|
||||
>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
|
||||
>>> optimizer, schedule = create_optimizer(
|
||||
... init_lr=2e-5,
|
||||
... num_warmup_steps=0,
|
||||
... num_train_steps=total_train_steps,
|
||||
... )
|
||||
```
|
||||
|
||||
Load DistilBERT with [`TFAutoModelForQuestionAnswering`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import TFAutoModelForQuestionAnswering
|
||||
|
||||
>>> model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> import tensorflow as tf
|
||||
|
||||
>>> model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for question answering, take a look at the corresponding
|
||||
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)
|
||||
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb).
|
||||
|
||||
</Tip>
|
||||
214
docs/source/en/tasks/sequence_classification.mdx
Normal file
214
docs/source/en/tasks/sequence_classification.mdx
Normal file
@@ -0,0 +1,214 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Text classification
|
||||
|
||||
<Youtube id="leNG9fN9FQU"/>
|
||||
|
||||
Text classification is a common NLP task that assigns a label or class to text. There are many practical applications of text classification widely used in production by some of today's largest companies. One of the most popular forms of text classification is sentiment analysis, which assigns a label like positive, negative, or neutral to a sequence of text.
|
||||
|
||||
This guide will show you how to fine-tune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [IMDb](https://huggingface.co/datasets/imdb) dataset to determine whether a movie review is positive or negative.
|
||||
|
||||
<Tip>
|
||||
|
||||
See the text classification [task page](https://huggingface.co/tasks/text-classification) for more information about other forms of text classification and their associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load IMDb dataset
|
||||
|
||||
Load the IMDb dataset from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> imdb = load_dataset("imdb")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> imdb["test"][0]
|
||||
{
|
||||
"label": 0,
|
||||
"text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
|
||||
}
|
||||
```
|
||||
|
||||
There are two fields in this dataset:
|
||||
|
||||
- `text`: a string containing the text of the movie review.
|
||||
- `label`: a value that can either be `0` for a negative review or `1` for a positive review.
|
||||
|
||||
## Preprocess
|
||||
|
||||
Load the DistilBERT tokenizer to process the `text` field:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
Create a preprocessing function to tokenize `text` and truncate sequences to be no longer than DistilBERT's maximum input length:
|
||||
|
||||
```py
|
||||
>>> def preprocess_function(examples):
|
||||
... return tokenizer(examples["text"], truncation=True)
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
tokenized_imdb = imdb.map(preprocess_function, batched=True)
|
||||
```
|
||||
|
||||
Use [`DataCollatorWithPadding`] to create a batch of examples. It will also *dynamically pad* your text to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DataCollatorWithPadding
|
||||
|
||||
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DataCollatorWithPadding
|
||||
|
||||
>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load DistilBERT with [`AutoModelForSequenceClassification`] along with the number of expected labels:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... learning_rate=2e-5,
|
||||
... per_device_train_batch_size=16,
|
||||
... per_device_eval_batch_size=16,
|
||||
... num_train_epochs=5,
|
||||
... weight_decay=0.01,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=tokenized_imdb["train"],
|
||||
... eval_dataset=tokenized_imdb["test"],
|
||||
... tokenizer=tokenizer,
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
[`Trainer`] will apply dynamic padding by default when you pass `tokenizer` to it. In this case, you don't need to specify a data collator explicitly.
|
||||
|
||||
</Tip>
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_imdb["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "label"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_imdb["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "label"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
||||
|
||||
</Tip>
|
||||
|
||||
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
|
||||
|
||||
```py
|
||||
>>> from transformers import create_optimizer
|
||||
>>> import tensorflow as tf
|
||||
|
||||
>>> batch_size = 16
|
||||
>>> num_epochs = 5
|
||||
>>> batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
|
||||
>>> total_train_steps = int(batches_per_epoch * num_epochs)
|
||||
>>> optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
|
||||
```
|
||||
|
||||
Load DistilBERT with [`TFAutoModelForSequenceClassification`] along with the number of expected labels:
|
||||
|
||||
```py
|
||||
>>> from transformers import TFAutoModelForSequenceClassification
|
||||
|
||||
>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
|
||||
```
|
||||
|
||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> import tensorflow as tf
|
||||
|
||||
>>> model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for text classification, take a look at the corresponding
|
||||
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
|
||||
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb).
|
||||
|
||||
</Tip>
|
||||
223
docs/source/en/tasks/summarization.mdx
Normal file
223
docs/source/en/tasks/summarization.mdx
Normal file
File diff suppressed because one or more lines are too long
272
docs/source/en/tasks/token_classification.mdx
Normal file
272
docs/source/en/tasks/token_classification.mdx
Normal file
@@ -0,0 +1,272 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Token classification
|
||||
|
||||
<Youtube id="wVHdVlPScxA"/>
|
||||
|
||||
Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.
|
||||
|
||||
This guide will show you how to fine-tune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities.
|
||||
|
||||
<Tip>
|
||||
|
||||
See the token classification [task page](https://huggingface.co/tasks/token-classification) for more information about other forms of token classification and their associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load WNUT 17 dataset
|
||||
|
||||
Load the WNUT 17 dataset from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> wnut = load_dataset("wnut_17")
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> wnut["train"][0]
|
||||
{'id': '0',
|
||||
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
|
||||
}
|
||||
```
|
||||
|
||||
Each number in `ner_tags` represents an entity. Convert the number to a label name for more information:
|
||||
|
||||
```py
|
||||
>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
|
||||
>>> label_list
|
||||
[
|
||||
"O",
|
||||
"B-corporation",
|
||||
"I-corporation",
|
||||
"B-creative-work",
|
||||
"I-creative-work",
|
||||
"B-group",
|
||||
"I-group",
|
||||
"B-location",
|
||||
"I-location",
|
||||
"B-person",
|
||||
"I-person",
|
||||
"B-product",
|
||||
"I-product",
|
||||
]
|
||||
```
|
||||
|
||||
The `ner_tag` describes an entity, such as a corporation, location, or person. The letter that prefixes each `ner_tag` indicates the token position of the entity:
|
||||
|
||||
- `B-` indicates the beginning of an entity.
|
||||
- `I-` indicates a token is contained inside the same entity (e.g., the `State` token is a part of an entity like
|
||||
`Empire State Building`).
|
||||
- `0` indicates the token doesn't correspond to any entity.
|
||||
|
||||
## Preprocess
|
||||
|
||||
<Youtube id="iY2AZYdZAr0"/>
|
||||
|
||||
Load the DistilBERT tokenizer to process the `tokens`:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
Since the input has already been split into words, set `is_split_into_words=True` to tokenize the words into subwords:
|
||||
|
||||
```py
|
||||
>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
|
||||
>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
|
||||
>>> tokens
|
||||
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
|
||||
```
|
||||
|
||||
Adding the special tokens `[CLS]` and `[SEP]` and subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may be split into two subwords. You will need to realign the tokens and labels by:
|
||||
|
||||
1. Mapping all tokens to their corresponding word with the [`word_ids`](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.Encoding.word_ids) method.
|
||||
2. Assigning the label `-100` to the special tokens `[CLS]` and `[SEP]` so the PyTorch loss function ignores
|
||||
them.
|
||||
3. Only labeling the first token of a given word. Assign `-100` to other subtokens from the same word.
|
||||
|
||||
Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than DistilBERT's maximum input length::
|
||||
|
||||
```py
|
||||
>>> def tokenize_and_align_labels(examples):
|
||||
... tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
|
||||
|
||||
... labels = []
|
||||
... for i, label in enumerate(examples[f"ner_tags"]):
|
||||
... word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
|
||||
... previous_word_idx = None
|
||||
... label_ids = []
|
||||
... for word_idx in word_ids: # Set the special tokens to -100.
|
||||
... if word_idx is None:
|
||||
... label_ids.append(-100)
|
||||
... elif word_idx != previous_word_idx: # Only label the first token of a given word.
|
||||
... label_ids.append(label[word_idx])
|
||||
... else:
|
||||
... label_ids.append(-100)
|
||||
... previous_word_idx = word_idx
|
||||
... labels.append(label_ids)
|
||||
|
||||
... tokenized_inputs["labels"] = labels
|
||||
... return tokenized_inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to tokenize and align the labels over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
|
||||
```
|
||||
|
||||
Use [`DataCollatorForTokenClassification`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForTokenClassification
|
||||
|
||||
>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForTokenClassification
|
||||
|
||||
>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load DistilBERT with [`AutoModelForTokenClassification`] along with the number of expected labels:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
|
||||
|
||||
>>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=14)
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`TrainingArguments`].
|
||||
2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = TrainingArguments(
|
||||
... output_dir="./results",
|
||||
... evaluation_strategy="epoch",
|
||||
... learning_rate=2e-5,
|
||||
... per_device_train_batch_size=16,
|
||||
... per_device_eval_batch_size=16,
|
||||
... num_train_epochs=3,
|
||||
... weight_decay=0.01,
|
||||
... )
|
||||
|
||||
>>> trainer = Trainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=tokenized_wnut["train"],
|
||||
... eval_dataset=tokenized_wnut["test"],
|
||||
... tokenizer=tokenizer,
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_wnut["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
||||
|
||||
</Tip>
|
||||
|
||||
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
|
||||
|
||||
```py
|
||||
>>> from transformers import create_optimizer
|
||||
|
||||
>>> batch_size = 16
|
||||
>>> num_train_epochs = 3
|
||||
>>> num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs
|
||||
>>> optimizer, lr_schedule = create_optimizer(
|
||||
... init_lr=2e-5,
|
||||
... num_train_steps=num_train_steps,
|
||||
... weight_decay_rate=0.01,
|
||||
... num_warmup_steps=0,
|
||||
... )
|
||||
```
|
||||
|
||||
Load DistilBERT with [`TFAutoModelForTokenClassification`] along with the number of expected labels:
|
||||
|
||||
```py
|
||||
>>> from transformers import TFAutoModelForTokenClassification
|
||||
|
||||
>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
|
||||
```
|
||||
|
||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> import tensorflow as tf
|
||||
|
||||
>>> model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for token classification, take a look at the corresponding
|
||||
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)
|
||||
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb).
|
||||
|
||||
</Tip>
|
||||
225
docs/source/en/tasks/translation.mdx
Normal file
225
docs/source/en/tasks/translation.mdx
Normal file
@@ -0,0 +1,225 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Translation
|
||||
|
||||
<Youtube id="1JvfrvZgi6c"/>
|
||||
|
||||
Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework that extends to vision and audio tasks.
|
||||
|
||||
This guide will show you how to fine-tune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French.
|
||||
|
||||
<Tip>
|
||||
|
||||
See the translation [task page](https://huggingface.co/tasks/translation) for more information about its associated models, datasets, and metrics.
|
||||
|
||||
</Tip>
|
||||
|
||||
## Load OPUS Books dataset
|
||||
|
||||
Load the OPUS Books dataset from the 🤗 Datasets library:
|
||||
|
||||
```py
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> books = load_dataset("opus_books", "en-fr")
|
||||
```
|
||||
|
||||
Split this dataset into a train and test set:
|
||||
|
||||
```py
|
||||
books = books["train"].train_test_split(test_size=0.2)
|
||||
```
|
||||
|
||||
Then take a look at an example:
|
||||
|
||||
```py
|
||||
>>> books["train"][0]
|
||||
{'id': '90560',
|
||||
'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
|
||||
'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}
|
||||
```
|
||||
|
||||
The `translation` field is a dictionary containing the English and French translations of the text.
|
||||
|
||||
## Preprocess
|
||||
|
||||
<Youtube id="XAR8jnZZuUs"/>
|
||||
|
||||
Load the T5 tokenizer to process the language pairs:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
|
||||
```
|
||||
|
||||
The preprocessing function needs to:
|
||||
|
||||
1. Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
|
||||
2. Tokenize the input (English) and target (French) separately. You can't tokenize French text with a tokenizer pretrained on an English vocabulary. A context manager will help set the tokenizer to French first before tokenizing it.
|
||||
3. Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.
|
||||
|
||||
```py
|
||||
>>> source_lang = "en"
|
||||
>>> target_lang = "fr"
|
||||
>>> prefix = "translate English to French: "
|
||||
|
||||
|
||||
>>> def preprocess_function(examples):
|
||||
... inputs = [prefix + example[source_lang] for example in examples["translation"]]
|
||||
... targets = [example[target_lang] for example in examples["translation"]]
|
||||
... model_inputs = tokenizer(inputs, max_length=128, truncation=True)
|
||||
|
||||
... with tokenizer.as_target_tokenizer():
|
||||
... labels = tokenizer(targets, max_length=128, truncation=True)
|
||||
|
||||
... model_inputs["labels"] = labels["input_ids"]
|
||||
... return model_inputs
|
||||
```
|
||||
|
||||
Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
|
||||
|
||||
```py
|
||||
>>> tokenized_books = books.map(preprocess_function, batched=True)
|
||||
```
|
||||
|
||||
Use [`DataCollatorForSeq2Seq`] to create a batch of examples. It will also *dynamically pad* your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the `tokenizer` function by setting `padding=True`, dynamic padding is more efficient.
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForSeq2Seq
|
||||
|
||||
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
```py
|
||||
>>> from transformers import DataCollatorForSeq2Seq
|
||||
|
||||
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
## Train
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
Load T5 with [`AutoModelForSeq2SeqLM`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
|
||||
|
||||
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)!
|
||||
|
||||
</Tip>
|
||||
|
||||
At this point, only three steps remain:
|
||||
|
||||
1. Define your training hyperparameters in [`Seq2SeqTrainingArguments`].
|
||||
2. Pass the training arguments to [`Seq2SeqTrainer`] along with the model, dataset, tokenizer, and data collator.
|
||||
3. Call [`~Trainer.train`] to fine-tune your model.
|
||||
|
||||
```py
|
||||
>>> training_args = Seq2SeqTrainingArguments(
|
||||
... output_dir="./results",
|
||||
... evaluation_strategy="epoch",
|
||||
... learning_rate=2e-5,
|
||||
... per_device_train_batch_size=16,
|
||||
... per_device_eval_batch_size=16,
|
||||
... weight_decay=0.01,
|
||||
... save_total_limit=3,
|
||||
... num_train_epochs=1,
|
||||
... fp16=True,
|
||||
... )
|
||||
|
||||
>>> trainer = Seq2SeqTrainer(
|
||||
... model=model,
|
||||
... args=training_args,
|
||||
... train_dataset=tokenized_books["train"],
|
||||
... eval_dataset=tokenized_books["test"],
|
||||
... tokenizer=tokenizer,
|
||||
... data_collator=data_collator,
|
||||
... )
|
||||
|
||||
>>> trainer.train()
|
||||
```
|
||||
</pt>
|
||||
<tf>
|
||||
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
|
||||
|
||||
```py
|
||||
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... shuffle=True,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
|
||||
>>> tf_test_set = tokenized_books["test"].to_tf_dataset(
|
||||
... columns=["attention_mask", "input_ids", "labels"],
|
||||
... shuffle=False,
|
||||
... batch_size=16,
|
||||
... collate_fn=data_collator,
|
||||
... )
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
If you aren't familiar with fine-tuning a model with Keras, take a look at the basic tutorial [here](training#finetune-with-keras)!
|
||||
|
||||
</Tip>
|
||||
|
||||
Set up an optimizer function, learning rate schedule, and some training hyperparameters:
|
||||
|
||||
```py
|
||||
>>> from transformers import create_optimizer, AdamWeightDecay
|
||||
|
||||
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
|
||||
```
|
||||
|
||||
Load T5 with [`TFAutoModelForSeq2SeqLM`]:
|
||||
|
||||
```py
|
||||
>>> from transformers import TFAutoModelForSeq2SeqLM
|
||||
|
||||
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
|
||||
```
|
||||
|
||||
Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method):
|
||||
|
||||
```py
|
||||
>>> model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
Call [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) to fine-tune the model:
|
||||
|
||||
```py
|
||||
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
|
||||
```
|
||||
</tf>
|
||||
</frameworkcontent>
|
||||
|
||||
<Tip>
|
||||
|
||||
For a more in-depth example of how to fine-tune a model for translation, take a look at the corresponding
|
||||
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)
|
||||
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb).
|
||||
|
||||
</Tip>
|
||||
Reference in New Issue
Block a user