From b9418a1d97d33dac0e7ec1df7fc1178f361104c5 Mon Sep 17 00:00:00 2001
From: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Date: Tue, 1 Feb 2022 18:31:35 -0600
Subject: [PATCH] Update tutorial docs (#15165)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* first draft of pipeline, autoclass, preprocess tutorials
* apply review feedback
* 🖍 apply feedback from patrick/niels
* 📝add output image to preprocessed image
* 🖍 apply feedback from patrick
---
docs/source/_toctree.yml | 10 +-
docs/source/autoclass_tutorial.mdx | 104 +++++
docs/source/pipeline_tutorial.mdx | 139 ++++++
docs/source/preprocessing.mdx | 664 +++++++++++++++++++----------
4 files changed, 683 insertions(+), 234 deletions(-)
create mode 100644 docs/source/autoclass_tutorial.mdx
create mode 100644 docs/source/pipeline_tutorial.mdx
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index dfd442ef13..5ed48d88f7 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -11,12 +11,16 @@
title: Glossary
title: Get started
- sections:
+ - local: pipeline_tutorial
+ title: Pipelines for inference
+ - local: autoclass_tutorial
+ title: Load pretrained instances with an AutoClass
+ - local: preprocessing
+ title: Preprocess
- local: task_summary
title: Summary of the tasks
- local: model_summary
title: Summary of the models
- - local: preprocessing
- title: Preprocessing data
- local: training
title: Fine-tuning a pretrained model
- local: accelerate
@@ -27,7 +31,7 @@
title: Summary of the tokenizers
- local: multilingual
title: Multi-lingual models
- title: "Using 🤗 Transformers"
+ title: Tutorials
- sections:
- local: examples
title: Examples
diff --git a/docs/source/autoclass_tutorial.mdx b/docs/source/autoclass_tutorial.mdx
new file mode 100644
index 0000000000..ea791b1845
--- /dev/null
+++ b/docs/source/autoclass_tutorial.mdx
@@ -0,0 +1,104 @@
+
+
+# Load pretrained instances with an AutoClass
+
+With so many different Transformer architectures, it can be challenging to create one for your checkpoint. As a part of 🤗 Transformers core philosophy to make the library easy, simple and flexible to use, an `AutoClass` automatically infer and load the correct architecture from a given checkpoint. The `from_pretrained` method lets you quickly load a pretrained model for any architecture so you don't have to devote time and resources to train a model from scratch. Producing this type of checkpoint-agnostic code means if your code works for one checkpoint, it will work with another checkpoint - as long as it was trained for a similar task - even if the architecture is different.
+
+
+
+Remember, architecture refers to the skeleton of the model and checkpoints are the weights for a given architecture. For example, [BERT](https://huggingface.co/bert-base-uncased) is an architecture, while `bert-base-uncased` is a checkpoint. Model is a general term that can mean either architecture or checkpoint.
+
+
+
+In this tutorial, learn to:
+
+* Load a pretrained tokenizer.
+* Load a pretrained feature extractor.
+* Load a pretrained processor.
+* Load a pretrained model.
+
+## AutoTokenizer
+
+Nearly every NLP task begins with a tokenizer. A tokenizer converts your input into a format that can be processed by the model.
+
+Load a tokenizer with [`AutoTokenizer.from_pretrained`]:
+
+```py
+>>> from transformers import AutoTokenizer
+
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+```
+
+Then tokenize your input as shown below:
+
+```py
+>>> sequence = "In a hole in the ground there lived a hobbit."
+>>> print(tokenizer(sequence))
+{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102],
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+## AutoFeatureExtractor
+
+For audio and vision tasks, a feature extractor processes the audio signal or image into the correct input format.
+
+Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
+
+```py
+>>> from transformers import AutoFeatureExtractor
+
+>>> feature_extractor = AutoFeatureExtractor.from_pretrained(
+... "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
+... )
+```
+
+## AutoProcessor
+
+Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the [LayoutLMV2](model_doc/layoutlmv2) model requires a feature extractor to handle images and a tokenizer to handle text; a processor combines both of them.
+
+Load a processor with [`AutoProcessor.from_pretrained`]:
+
+```py
+>>> from transformers import AutoProcessor
+
+>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")
+```
+
+## AutoModel
+
+Finally, the `AutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`AutoModelForSequenceClassification.from_pretrained`]:
+
+```py
+>>> from transformers import AutoModelForSequenceClassification
+
+>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForSequenceClassification
+
+>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
+```
+
+Easily reuse the same checkpoint to load an architecture for a different task:
+
+```py
+>>> from transformers import AutoModelForTokenClassification
+
+>>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
+===PT-TF-SPLIT===
+>>> from transformers import TFAutoModelForTokenClassification
+
+>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
+```
+
+Generally, we recommend using the `AutoTokenizer` class and the `AutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
\ No newline at end of file
diff --git a/docs/source/pipeline_tutorial.mdx b/docs/source/pipeline_tutorial.mdx
new file mode 100644
index 0000000000..0d815a61b7
--- /dev/null
+++ b/docs/source/pipeline_tutorial.mdx
@@ -0,0 +1,139 @@
+
+
+# Pipelines for inference
+
+The [`pipeline`] makes it simple to use any model from the [Model Hub](https://huggingface.co/models) for inference on a variety of tasks such as text generation, image segmentation and audio classification. Even if you don't have experience with a specific modality or understand the code powering the models, you can still use them with the [`pipeline`]! This tutorial will teach you to:
+
+* Use a [`pipeline`] for inference.
+* Use a specific tokenizer or model.
+* Use a [`pipeline`] for audio and vision tasks.
+
+
+
+Take a look at the [`pipeline`] documentation for a complete list of supported taska.
+
+
+
+## Pipeline usage
+
+While each task has an associated [`pipeline`], it is simpler to use the general [`pipeline`] abstraction which contains all the specific task pipelines. The [`pipeline`] automatically loads a default model and tokenizer capable of inference for your task.
+
+1. Start by creating a [`pipeline`] and specify an inference task:
+
+```py
+>>> from transformers import pipeline
+
+>>> generator = pipeline(task="text-generation")
+```
+
+2. Pass your input text to the [`pipeline`]:
+
+```py
+>>> generator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")
+[{'generated_text': 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Seven for the Iron-priests at the door to the east, and thirteen for the Lord Kings at the end of the mountain'}]
+```
+
+If you have more than one input, pass your input as a list:
+
+```py
+>>> generator(
+... [
+... "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
+... "Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne",
+... ]
+... )
+```
+
+Any additional parameters for your task can also be included in the [`pipeline`]. The `text-generation` task has a [`~generation_utils.GenerationMixin.generate`] method with several parameters for controlling the output. For example, if you want to generate more than one output, set the `num_return_sequences` parameter:
+
+```py
+>>> generator(
+... "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
+... num_return_sequences=2,
+... )
+```
+
+### Choose a model and tokenizer
+
+The [`pipeline`] accepts any model from the [Model Hub](https://huggingface.co/models). There are tags on the Model Hub that allow you to filter for a model you'd like to use for your task. Once you've picked an appropriate model, load it with the corresponding `AutoModelFor` and [`AutoTokenizer'] class. For example, load the [`AutoModelForCausalLM`] class for a causal language modeling task:
+
+```py
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM
+
+>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
+>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
+```
+
+Create a [`pipeline`] for your task, and specify the model and tokenizer you've loaded:
+
+```py
+>>> from transformers import pipeline
+
+>>> generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)
+```
+
+Pass your input text to the [`pipeline`] to generate some text:
+
+```py
+>>> generator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")
+[{'generated_text': 'Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Seven for the Dragon-lords (for them to rule in a world ruled by their rulers, and all who live within the realm'}]
+```
+
+## Audio pipeline
+
+The flexibility of the [`pipeline`] means it can also be extended to audio tasks.
+
+For example, let's classify the emotion from a short clip of John F. Kennedy's famous ["We choose to go to the Moon"](https://en.wikipedia.org/wiki/We_choose_to_go_to_the_Moon) speech. Find an [audio classification](https://huggingface.co/models?pipeline_tag=audio-classification) model on the Model Hub for emotion recognition and load it in the [`pipeline`]:
+
+```py
+>>> from transformers import pipeline
+
+>>> audio_classifier = pipeline(
+... task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
+... )
+```
+
+Pass the audio file to the [`pipeline`]:
+
+```py
+>>> audio_classifier("jfk_moon_speech.wav")
+[{'label': 'calm', 'score': 0.13856211304664612},
+ {'label': 'disgust', 'score': 0.13148026168346405},
+ {'label': 'happy', 'score': 0.12635163962841034},
+ {'label': 'angry', 'score': 0.12439591437578201},
+ {'label': 'fearful', 'score': 0.12404385954141617}]
+```
+
+## Vision pipeline
+
+Finally, using a [`pipeline`] for vision tasks is practically identical.
+
+Specify your vision task and pass your image to the classifier. The imaage can be a link or a local path to the image. For example, what species of cat is shown below?
+
+
+
+```py
+>>> from transformers import pipeline
+
+>>> vision_classifier = pipeline(task="image-classification")
+>>> vision_classifier(
+... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+... )
+[{'label': 'lynx, catamount', 'score': 0.4403027892112732},
+ {'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor',
+ 'score': 0.03433405980467796},
+ {'label': 'snow leopard, ounce, Panthera uncia',
+ 'score': 0.032148055732250214},
+ {'label': 'Egyptian cat', 'score': 0.02353910356760025},
+ {'label': 'tiger cat', 'score': 0.023034192621707916}]
+```
diff --git a/docs/source/preprocessing.mdx b/docs/source/preprocessing.mdx
index 331d1566ed..e3629fb19d 100644
--- a/docs/source/preprocessing.mdx
+++ b/docs/source/preprocessing.mdx
@@ -1,4 +1,4 @@
-
-# Preprocessing data
+# Preprocess
[[open-in-colab]]
-In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we
-call a [tokenizer](main_classes/tokenizer). You can build one using the tokenizer class associated to the model
-you would like to use, or directly with the [`AutoTokenizer`] class.
+Before you can use your data in a model, the data needs to be processed into an acceptable format for the model. A model does not understand raw text, images or audio. These inputs need to be converted into numbers and assembled into tensors. In this tutorial, you will:
-As we saw in the [quick tour](quicktour), the tokenizer will first split a given text in words (or part of
-words, punctuation symbols, etc.) usually called _tokens_. Then it will convert those _tokens_ into numbers, to be able
-to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect
-to work properly.
-
-
-
-If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer: it will split
-the text you give it in tokens the same way for the pretraining corpus, and it will use the same correspondence
-token to index (that we usually call a _vocab_) as during pretraining.
-
-
-
-To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the
-[`AutoTokenizer.from_pretrained`] method:
-
-```py
-from transformers import AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
-```
-
-## Base use
+* Preprocess textual data with a tokenizer.
+* Preprocess image or audio data with a feature extractor.
+* Preprocess data for a multimodal task with a processor.
+## NLP
-A [`PreTrainedTokenizer`] has many methods, but the only one you need to remember for preprocessing
-is its `__call__`: you just need to feed your sentence to your tokenizer object.
+The main tool for processing textual data is a [tokenizer](main_classes/tokenizer). A tokenizer starts by splitting text into *tokens* according to a set of rules. The tokens are converted into numbers, which are used to build tensors as input to a model. Any additional inputs required by a model are also added by the tokenizer.
+
+
+
+If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referrred to as the *vocab*) during pretraining.
+
+
+
+Get started quickly by loading a pretrained tokenizer with the [`AutoTokenizer`] class. This downloads the *vocab* used when a model is pretrained.
+
+### Tokenize
+
+Load a pretrained tokenizer with [`AutoTokenizer.from_pretrained`]:
```py
->>> encoded_input = tokenizer("Hello, I'm a single sentence!")
->>> print(encoded_input)
-{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102],
- 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
- 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+>>> from transformers import AutoTokenizer
+
+>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
```
-This returns a dictionary string to list of ints. The [input_ids](glossary#input-ids) are the indices corresponding
-to each token in our sentence. We will see below what the [attention_mask](glossary#attention-mask) is used for and
-in [the next section](#preprocessing-pairs-of-sentences) the goal of [token_type_ids](glossary#token-type-ids).
+Then pass your sentence to the tokenizer:
-The tokenizer can decode a list of token ids in a proper sentence:
+```py
+>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
+>>> print(encoded_input)
+{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102],
+ 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
+```
+
+The tokenizer returns a dictionary with three important itmes:
+
+* [input_ids](glossary#input-ids) are the indices corresponding to each token in the sentence.
+* [attention_mask](glossary#attention-mask) indicates whether a token should be attended to or not.
+* [token_type_ids](glossary#token-type-ids) identifies which sequence a token belongs to when there is more than one sequence.
+
+You can decode the `input_ids` to return the original input:
```py
>>> tokenizer.decode(encoded_input["input_ids"])
-"[CLS] Hello, I'm a single sentence! [SEP]"
+'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'
```
-As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
-special tokens; for instance, if we had used _gpt2-medium_ instead of _bert-base-cased_ to create our tokenizer, we
-would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
-have added those special tokens yourself) by passing `add_special_tokens=False`.
+As you can see, the tokenizer added two special tokens - `CLS` and `SEP` (classifier and separator) - to the sentence. Not all models need
+special tokens, but if they do, the tokenizer will automatically add them for you.
-If you have several sentences you want to process, you can do this efficiently by sending them as a list to the
-tokenizer:
+If there are several sentences you want to process, pass the sentences as a list to the tokenizer:
```py
->>> batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
+>>> batch_sentences = [
+... "But what about second breakfast?",
+... "Don't think he knows about second breakfast, Pip.",
+... "What about elevensies?",
+... ]
>>> encoded_inputs = tokenizer(batch_sentences)
>>> print(encoded_inputs)
-{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
- [101, 1262, 1330, 5650, 102],
- [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
- 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
- [0, 0, 0, 0, 0],
- [0, 0, 0, 0, 0, 0, 0, 0]],
- 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
- [1, 1, 1, 1, 1],
- [1, 1, 1, 1, 1, 1, 1, 1]]}
+{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102],
+ [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
+ [101, 1327, 1164, 5450, 23434, 136, 102]],
+ 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0],
+ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ [0, 0, 0, 0, 0, 0, 0]],
+ 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1],
+ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+ [1, 1, 1, 1, 1, 1, 1]]}
```
-We get back a dictionary once again, this time with values being lists of lists of ints.
+### Pad
-If the purpose of sending several sentences at a time to the tokenizer is to build a batch to feed the model, you will
-probably want:
+This brings us to an important topic. When you process a batch of sentences, they aren't always the same length. This is a problem because tensors, the input to the model, need to have a uniform shape. Padding is a strategy for ensuring tensors are rectangular by adding a special *padding token* to sentences with fewer tokens.
-- To pad each sentence to the maximum length there is in your batch.
-- To truncate each sentence to the maximum length the model can accept (if applicable).
-- To return tensors.
-
-You can do all of this by using the following options when feeding your list of sentences to the tokenizer:
+Set the `padding` parameter to `True` to pad the shorter sequences in the batch to match the longest sequence:
```py
->>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
->>> print(batch)
-{'input_ids': tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
- [ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0],
- [ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]),
+>>> batch_sentences = [
+... "But what about second breakfast?",
+... "Don't think he knows about second breakfast, Pip.",
+... "What about elevensies?",
+... ]
+>>> encoded_input = tokenizer(batch_sentences, padding=True)
+>>> print(encoded_input)
+{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
+ [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
+ [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
+ 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
+ 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
+ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+ [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
+```
+
+Notice the tokenizer padded the first and third sentences with a `0` because they are shorter!
+
+### Truncation
+
+On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you will need to truncate the sequence to a shorter length.
+
+Set the `truncation` parameter to `True` to truncate a sequence to the maximum length accepted by the model:
+
+```py
+>>> batch_sentences = [
+... "But what about second breakfast?",
+... "Don't think he knows about second breakfast, Pip.",
+... "What about elevensies?",
+... ]
+>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
+>>> print(encoded_input)
+{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],
+ [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],
+ [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]],
+ 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
+ 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
+ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+ [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}
+```
+
+### Build tensors
+
+Finally, you want the tokenizer to return the actual tensors that are fed to the model.
+
+Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for TensorFlow:
+
+```py
+>>> batch_sentences = [
+... "But what about second breakfast?",
+... "Don't think he knows about second breakfast, Pip.",
+... "What about elevensies?",
+... ]
+>>> encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
+>>> print(encoded_input)
+{'input_ids': tensor([[ 101, 153, 7719, 21490, 1122, 1114, 9582, 1623, 102],
+ [ 101, 5226, 1122, 9649, 1199, 2610, 1236, 102, 0]]),
'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
- [0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
- [1, 1, 1, 1, 1, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 0]])}
===PT-TF-SPLIT===
->>> batch = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf")
->>> print(batch)
-{'input_ids': tf.Tensor([[ 101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
- [ 101, 1262, 1330, 5650, 102, 0, 0, 0, 0],
- [ 101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 0]]),
- 'token_type_ids': tf.Tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
- [0, 0, 0, 0, 0, 0, 0, 0, 0],
- [0, 0, 0, 0, 0, 0, 0, 0, 0]]),
- 'attention_mask': tf.Tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
- [1, 1, 1, 1, 1, 0, 0, 0, 0],
- [1, 1, 1, 1, 1, 1, 1, 1, 0]])}
-```
-
-It returns a dictionary with string keys and tensor values. We can now see what the [attention_mask](glossary#attention-mask) is all about: it points out which tokens the model should pay attention to and which ones
-it should not (because they represent padding in this case).
-
-
-Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You
-can safely ignore it. You can also pass `verbose=False` to stop the tokenizer from throwing those kinds of warnings.
-
-
-
-## Preprocessing pairs of sentences
-
-
-
-Sometimes you need to feed a pair of sentences to your model. For instance, if you want to classify if two sentences in
-a pair are similar, or for question-answering models, which take a context and a question. For BERT models, the input
-is then represented like this: `[CLS] Sequence A [SEP] Sequence B [SEP]`
-
-You can encode a pair of sentences in the format expected by your model by supplying the two sentences as two arguments
-(not a list since a list of two sentences will be interpreted as a batch of two single sentences, as we saw before).
-This will once again return a dict string to list of ints:
-
-```py
->>> encoded_input = tokenizer("How old are you?", "I'm 6 years old")
->>> print(encoded_input)
-{'input_ids': [101, 1731, 1385, 1132, 1128, 136, 102, 146, 112, 182, 127, 1201, 1385, 102],
- 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
- 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-```
-
-This shows us what the [token_type_ids](glossary#token-type-ids) are for: they indicate to the model which part of
-the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that
-_token_type_ids_ are not required or handled by all models. By default, a tokenizer will only return the inputs that
-its associated model expects. You can force the return (or the non-return) of any of those special arguments by using
-`return_input_ids` or `return_token_type_ids`.
-
-If we decode the token ids we obtained, we will see that the special tokens have been properly added.
-
-```py
->>> tokenizer.decode(encoded_input["input_ids"])
-"[CLS] How old are you? [SEP] I'm 6 years old [SEP]"
-```
-
-If you have a list of pairs of sequences you want to process, you should feed them as two lists to your tokenizer: the
-list of first sentences and the list of second sentences:
-
-```py
->>> batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
->>> batch_of_second_sentences = [
-... "I'm a sentence that goes with the first sentence",
-... "And I should be encoded with the second sentence",
-... "And I go with the very last one",
+>>> batch_sentences = [
+... "But what about second breakfast?",
+... "Don't think he knows about second breakfast, Pip.",
+... "What about elevensies?",
... ]
->>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
->>> print(encoded_inputs)
-{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102],
- [101, 1262, 1330, 5650, 102, 1262, 146, 1431, 1129, 12544, 1114, 1103, 1248, 5650, 102],
- [101, 1262, 1103, 1304, 1304, 1314, 1141, 102, 1262, 146, 1301, 1114, 1103, 1304, 1314, 1141, 102]],
-'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
- [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
- [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
-'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
- [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
- [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
+>>> encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors="tf")
+>>> print(encoded_input)
+{'input_ids': ,
+ 'token_type_ids': ,
+ 'attention_mask': }
```
-As we can see, it returns a dictionary where each value is a list of lists of ints.
+## Audio
-To double-check what is fed to the model, we can decode each list in _input_ids_ one by one:
+Audio inputs are preprocessed differently than textual inputs, but the end goal remains the same: create numerical sequences the model can understand. A [feature extractor](main_classes/feature_extractor) is designed for the express purpose of extracting features from raw image or audio data and converting them into tensors. Before you begin, install 🤗 Datasets to load an audio dataset to experiment with:
+
+```bash
+pip install datasets
+```
+
+Load the keyword spotting task from the [SUPERB](https://huggingface.co/datasets/superb) benchmark (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset):
```py
->>> for ids in encoded_inputs["input_ids"]:
-... print(tokenizer.decode(ids))
-[CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
-[CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
-[CLS] And the very very last one [SEP] And I go with the very last one [SEP]
+>>> from datasets import load_dataset, Audio
+
+>>> dataset = load_dataset("superb", "ks")
```
-Once again, you can automatically pad your inputs to the maximum sentence length in the batch, truncate to the maximum
-length the model can accept and return tensors directly with the following:
+Access the first element of the `audio` column to take a look at the input. Calling the `audio` column will automatically load and resample the audio file:
```py
-batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="pt")
-===PT-TF-SPLIT===
-batch = tokenizer(batch_sentences, batch_of_second_sentences, padding=True, truncation=True, return_tensors="tf")
+>>> dataset["train"][0]["audio"]
+{'array': array([ 0. , 0. , 0. , ..., -0.00592041,
+ -0.00405884, -0.00253296], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav',
+ 'sampling_rate': 16000}
```
+This returns three items:
+
+* `array` is the speech signal loaded - and potentially resampled - as a 1D array.
+* `path` points to the location of the audio file.
+* `sampling_rate` refers to how many data points in the speech signal are measured per second.
+
+### Resample
+
+For this tutorial, you will use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. As you can see from the model card, the Wav2Vec2 model is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your audio data.
+
+For example, load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset which has a sampling rate of 22050kHz. In order to use the Wav2Vec2 model with this dataset, downsample the sampling rate to 16kHz:
+
+```py
+>>> lj_speech = load_dataset("lj_speech", split="train")
+>>> lj_speech[0]["audio"]
+{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
+ 7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
+ 'sampling_rate': 22050}
+```
+
+1. Use 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to downsample the sampling rate to 16kHz:
+
+```py
+>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
+```
+
+2. Load the audio file:
+
+```py
+>>> lj_speech[0]["audio"]
+{'array': array([-0.00064146, -0.00074657, -0.00068768, ..., 0.00068341,
+ 0.00014045, 0. ], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
+ 'sampling_rate': 16000}
+```
+
+As you can see, the `sampling_rate` was downsampled to 16kHz. Now that you know how resampling works, let's return to our previous example with the SUPERB dataset!
+
+### Feature extractor
+
+The next step is to load a feature extractor to normalize and pad the input. When padding textual data, a `0` is added for shorter sequences. The same idea applies to audio data, and the audio feature extractor will add a `0` - interpreted as silence - to `array`.
+
+Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
+
+```py
+>>> from transformers import AutoFeatureExtractor
+
+>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")
+```
+
+Pass the audio `array` to the feature extractor. We also recommend adding the `sampling_rate` argument in the feature extractor in order to better debug any silent errors that may occur.
+
+```py
+>>> audio_input = [dataset["train"][0]["audio"]["array"]]
+>>> feature_extractor(audio_input, sampling_rate=16000)
+{'input_values': [array([ 0.00045439, 0.00045439, 0.00045439, ..., -0.1578519 , -0.10807519, -0.06727459], dtype=float32)]}
+```
+
+### Pad and truncate
+
+Just like the tokenizer, you can apply padding or truncation to handle variable sequences in a batch. Take a look at the sequence length of these two audio samples:
+
+```py
+>>> dataset["train"][0]["audio"]["array"].shape
+(1522930,)
+
+>>> dataset["train"][1]["audio"]["array"].shape
+(988891,)
+```
+
+As you can see, the first sample has a longer sequence than the second sample. Let's create a function that will preprocess the dataset. Specify a maximum sample length, and the feature extractor will either pad or truncate the sequences to match it:
+
+```py
+>>> def preprocess_function(examples):
+... audio_arrays = [x["array"] for x in examples["audio"]]
+... inputs = feature_extractor(
+... audio_arrays,
+... sampling_rate=16000,
+... padding=True,
+... max_length=1000000,
+... truncation=True,
+... )
+... return inputs
+```
+
+Apply the function to the the first few examples in the dataset:
+
+```py
+>>> processed_dataset = preprocess_function(dataset["train"][:5])
+```
+
+Now take another look at the processed sample lengths:
+
+```py
+>>> processed_dataset["input_values"][0].shape
+(1000000,)
+
+>>> processed_dataset["input_values"][1].shape
+(1000000,)
+```
+
+The lengths of the first two samples now match the maximum length you specified.
+
+## Vision
+
+A feature extractor is also used to process images for vision tasks. Once again, the goal is to convert the raw image into a batch of tensors as input.
+
+Let's load the [food101](https://huggingface.co/datasets/food101) dataset for this tutorial. Use 🤗 Datasets `split` parameter to only load a small sample from the training split since the dataset is quite large:
+
+```py
+>>> from datasets import load_dataset
+
+>>> dataset = load_dataset("food101", split="train[:100]")
+```
+
+Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature:
+
+```py
+>>> dataset[0]["image"]
+```
+
+
+
+### Feature extractor
+
+Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
+
+```py
+>>> from transformers import AutoFeatureExtractor
+
+>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
+```
+
+### Data augmentation
+
+For vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you will use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module.
+
+1. Normalize the image and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
+
+```py
+>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
+
+>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
+>>> _transforms = Compose(
+... [RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize]
+... )
+```
+
+2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as it's input. This value is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:
+
+```py
+>>> def transforms(examples):
+... examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
+... return examples
+```
+
+3. Then use 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on-the-fly:
+
+```py
+>>> dataset.set_transform(transforms)
+```
+
+4. Now when you access the image, you will notice the feature extractor has added the model input `pixel_values`:
+
+```py
+>>> dataset[0]["image"]
+{'image': ,
+ 'label': 6,
+ 'pixel_values': tensor([[[ 0.0353, 0.0745, 0.1216, ..., -0.9922, -0.9922, -0.9922],
+ [-0.0196, 0.0667, 0.1294, ..., -0.9765, -0.9843, -0.9922],
+ [ 0.0196, 0.0824, 0.1137, ..., -0.9765, -0.9686, -0.8667],
+ ...,
+ [ 0.0275, 0.0745, 0.0510, ..., -0.1137, -0.1216, -0.0824],
+ [ 0.0667, 0.0824, 0.0667, ..., -0.0588, -0.0745, -0.0980],
+ [ 0.0353, 0.0353, 0.0431, ..., -0.0039, -0.0039, -0.0588]],
+
+ [[ 0.2078, 0.2471, 0.2863, ..., -0.9451, -0.9373, -0.9451],
+ [ 0.1608, 0.2471, 0.3098, ..., -0.9373, -0.9451, -0.9373],
+ [ 0.2078, 0.2706, 0.3020, ..., -0.9608, -0.9373, -0.8275],
+ ...,
+ [-0.0353, 0.0118, -0.0039, ..., -0.2392, -0.2471, -0.2078],
+ [ 0.0196, 0.0353, 0.0196, ..., -0.1843, -0.2000, -0.2235],
+ [-0.0118, -0.0039, -0.0039, ..., -0.0980, -0.0980, -0.1529]],
+
+ [[ 0.3961, 0.4431, 0.4980, ..., -0.9216, -0.9137, -0.9216],
+ [ 0.3569, 0.4510, 0.5216, ..., -0.9059, -0.9137, -0.9137],
+ [ 0.4118, 0.4745, 0.5216, ..., -0.9137, -0.8902, -0.7804],
+ ...,
+ [-0.2314, -0.1922, -0.2078, ..., -0.4196, -0.4275, -0.3882],
+ [-0.1843, -0.1686, -0.2000, ..., -0.3647, -0.3804, -0.4039],
+ [-0.1922, -0.1922, -0.1922, ..., -0.2941, -0.2863, -0.3412]]])}
+```
+
+Here is what the image looks like after you preprocess it. Just as you'd expect from the applied transforms, the image has been randomly cropped and it's color properties are different.
+
+```py
+>>> import numpy as np
+>>> import matplotlib.pyplot as plt
+
+>>> img = dataset[0]["pixel_values"]
+>>> plt.imshow(img.permute(1, 2, 0))
+```
+
+
+
+## Multimodal
+
+For multimodal tasks. you will use a combination of everything you've learned so far and apply your skills to a automatic speech recognition (ASR) task. This means you will need a:
+
+* Feature extractor to preprocess the audio data.
+* Tokenizer to process the text.
+
+Let's return to the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset:
+
+```py
+>>> from datasets import load_dataset
+
+>>> lj_speech = load_dataset("lj_speech", split="train")
+```
+
+Since you are mainly interested in the `audio` and `text` column, remove the other columns:
+
+```py
+>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"])
+```
+
+Now take a look at the `audio` and `text` columns:
+
+```py
+>>> lj_speech[0]["audio"]
+{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,
+ 7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32),
+ 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',
+ 'sampling_rate': 22050}
+
+>>> lj_speech[0]["text"]
+'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'
+```
+
+Remember from the earlier section on processing audio data, you should always [resample](preprocessing#audio) your audio data's sampling rate to match the sampling rate of the dataset used to pretrain a model:
+
+```py
+>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000))
+```
+
+### Processor
+
+A processor combines a feature extractor and tokenizer. Load a processor with [`AutoProcessor.from_pretrained]:
+
+```py
+>>> from transformers import AutoProcessor
+
+>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
+```
+
+1. Create a function to process the audio data to `input_values`, and tokenizes the text to `labels`. These are your inputs to the model:
+
+```py
+>>> def prepare_dataset(example):
+... audio = example["audio"]
+
+... example["input_values"] = processor(audio["array"], sampling_rate=16000)
+
+... with processor.as_target_processor():
+... example["labels"] = processor(example["text"]).input_ids
+... return example
+```
+
+2. Apply the `prepare_dataset` function to a sample:
+
+```py
+>>> prepare_dataset(lj_speech[0])
+```
+
+Notice the processor has added `input_values` and `labels`. The sampling rate has also been correctly downsampled to 16kHz.
+
+Awesome, you should now be able to preprocess data for any modality and even combine different modalities! In the next tutorial, learn how to fine-tune a model on your newly preprocessed data.
+
## Everything you always wanted to know about padding and truncation
We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and
@@ -273,76 +548,3 @@ any of the following examples, you can replace `truncation=True` by a `STRATEGY`
| | padding to max model input length | Not possible |
| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or |
| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` |
-
-## Pre-tokenized inputs
-
-The tokenizer also accept pre-tokenized inputs. This is particularly useful when you want to compute labels and extract
-predictions in [named entity recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) or
-[part-of-speech tagging (POS tagging)](https://en.wikipedia.org/wiki/Part-of-speech_tagging).
-
-
-
-Pre-tokenized does not mean your inputs are already tokenized (you wouldn't need to pass them through the tokenizer
-if that was the case) but just split into words (which is often the first step in subword tokenization algorithms
-like BPE).
-
-
-
-If you want to use pre-tokenized inputs, just set `is_split_into_words=True` when passing your inputs to the
-tokenizer. For instance, we have:
-
-```py
->>> encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
->>> print(encoded_input)
-{'input_ids': [101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
- 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
- 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
-```
-
-Note that the tokenizer still adds the ids of special tokens (if applicable) unless you pass
-`add_special_tokens=False`.
-
-This works exactly as before for batch of sentences or batch of pairs of sentences. You can encode a batch of sentences
-like this:
-
-```py
-batch_sentences = [
- ["Hello", "I'm", "a", "single", "sentence"],
- ["And", "another", "sentence"],
- ["And", "the", "very", "very", "last", "one"],
-]
-encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
-```
-
-or a batch of pair sentences like this:
-
-```py
-batch_of_second_sentences = [
- ["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
- ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
- ["And", "I", "go", "with", "the", "very", "last", "one"],
-]
-encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
-```
-
-And you can add padding, truncation as well as directly return tensors like before:
-
-```py
-batch = tokenizer(
- batch_sentences,
- batch_of_second_sentences,
- is_split_into_words=True,
- padding=True,
- truncation=True,
- return_tensors="pt",
-)
-===PT-TF-SPLIT===
-batch = tokenizer(
- batch_sentences,
- batch_of_second_sentences,
- is_split_into_words=True,
- padding=True,
- truncation=True,
- return_tensors="tf",
-)
-```