Doc styler examples (#14953)

* Fix bad examples * Add black formatting to style_doc * Use first nonempty line * Put it at the right place * Don't add spaces to empty lines * Better templates * Deal with triple quotes in docstrings * Result of style_doc * Enable mdx treatment and fix code examples in MDXs * Result of doc styler on doc source files * Last fixes * Break copy from
2021-12-27 19:07:46 -05:00
parent e13f72fbff
commit b5e2b183af
211 changed files with 2738 additions and 1711 deletions
--- a/docs/source/add_new_model.mdx
+++ b/docs/source/add_new_model.mdx
@@ -267,7 +267,7 @@ single forward pass using a dummy integer vector of input IDs as an input. Such
 pseudocode):

 ```python
-model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
+model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/")
 input_ids = [0, 4, 5, 2, 3, 7, 9]  # vector of input ids
 original_output = model.predict(input_ids)
 ```
@@ -476,6 +476,7 @@ following command should work:

 ```python
 from transformers import BrandNewBertModel, BrandNewBertConfig
+
 model = BrandNewBertModel(BrandNewBertConfig())
 ```

@@ -502,12 +503,13 @@ PyTorch, called `SimpleModel` as follows:
 ```python
 from torch import nn

+
 class SimpleModel(nn.Module):
    def __init__(self):
-            super().__init__()
-            self.dense = nn.Linear(10, 10)
-            self.intermediate = nn.Linear(10, 10)
-            self.layer_norm = nn.LayerNorm(10)
+        super().__init__()
+        self.dense = nn.Linear(10, 10)
+        self.intermediate = nn.Linear(10, 10)
+        self.layer_norm = nn.LayerNorm(10)
 ```

 Now we can create an instance of this model definition which will fill all weights: `dense`, `intermediate`,
@@ -565,7 +567,7 @@ In the conversion script, you should fill those randomly initialized weights wit
 corresponding layer in the checkpoint. *E.g.*

 ```python
-# retrieve matching layer weights, e.g. by 
+# retrieve matching layer weights, e.g. by
 # recursive algorithm
 layer_name = "dense"
 pretrained_weight = array_of_dense_layer
@@ -622,7 +624,7 @@ pass of the model using the original repository. Now you should write an analogo
 implementation instead of the original one. It should look as follows:

 ```python
-model = BrandNewBertModel.from_pretrained(/path/to/converted/checkpoint/folder)
+model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder")
 input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]
 output = model(input_ids).last_hidden_states
 ```
@@ -668,7 +670,7 @@ fully comply with the required design. To make sure, the implementation is fully
 common tests should pass. The Cookiecutter should have automatically added a test file for your model, probably under
 the same `tests/test_modeling_brand_new_bert.py`. Run this test file to verify that all common tests pass:

-```python
+```bash
 pytest tests/test_modeling_brand_new_bert.py
 ```

@@ -714,7 +716,7 @@ that inputs a string and returns the `input_ids``. It could look similar to this

 ```python
 input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."
-model = BrandNewBertModel.load_pretrained_checkpoint(/path/to/checkpoint/)
+model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/")
 input_ids = model.tokenize(input_str)
 ```

@@ -725,9 +727,10 @@ created. It should look similar to this:

 ```python
 from transformers import BrandNewBertTokenizer
+
 input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words."

-tokenizer = BrandNewBertTokenizer.from_pretrained(/path/to/tokenizer/folder/)
+tokenizer = BrandNewBertTokenizer.from_pretrained("/path/to/tokenizer/folder/")

 input_ids = tokenizer(input_str).input_ids
 ```
--- a/docs/source/add_new_pipeline.mdx
+++ b/docs/source/add_new_pipeline.mdx
@@ -26,6 +26,7 @@ Start by inheriting the base class `Pipeline`. with the 4 methods needed to impl
 ```python
 from transformers import Pipeline

+
 class MyPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
@@ -34,7 +35,7 @@ class MyPipeline(Pipeline):
        return preprocess_kwargs, {}, {}

    def preprocess(self, inputs, maybe_arg=2):
-        model_input = Tensor(....)
+        model_input = Tensor(inputs["input_ids"])
        return {"model_input": model_input}

    def _forward(self, model_inputs):
@@ -90,6 +91,7 @@ def postprocess(self, model_outputs, top_k=5):
    # Add logic to handle top_k
    return best_class

+
 def _sanitize_parameters(self, **kwargs):
    preprocess_kwargs = {}
    if "maybe_arg" in kwargs:
--- a/docs/source/benchmarks.mdx
+++ b/docs/source/benchmarks.mdx
@@ -37,11 +37,12 @@ The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an

 >>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
 >>> benchmark = PyTorchBenchmark(args)
-
 ===PT-TF-SPLIT===
 >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments

->>> args = TensorFlowBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> args = TensorFlowBenchmarkArguments(
+...     models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]
+... )
 >>> benchmark = TensorFlowBenchmark(args)
 ```

@@ -174,7 +175,9 @@ configurations must be inserted with the benchmark args as follows.
 ```py
 >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig

->>> args = PyTorchBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> args = PyTorchBenchmarkArguments(
+...     models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]
+... )
 >>> config_base = BertConfig()
 >>> config_384_hid = BertConfig(hidden_size=384)
 >>> config_6_lay = BertConfig(num_hidden_layers=6)
@@ -244,7 +247,9 @@ bert-6-lay                 8              512            1359
 ===PT-TF-SPLIT===
 >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig

->>> args = TensorFlowBenchmarkArguments(models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512])
+>>> args = TensorFlowBenchmarkArguments(
+...     models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]
+... )
 >>> config_base = BertConfig()
 >>> config_384_hid = BertConfig(hidden_size=384)
 >>> config_6_lay = BertConfig(num_hidden_layers=6)
--- a/docs/source/custom_datasets.mdx
+++ b/docs/source/custom_datasets.mdx
@@ -54,6 +54,7 @@ The 🤗 Datasets library makes it simple to load a dataset:

 ```python
 from datasets import load_dataset
+
 imdb = load_dataset("imdb")
 ```

@@ -61,8 +62,9 @@ This loads a `DatasetDict` object which you can index into to view an example:

 ```python
 imdb["train"][0]
-{'label': 1,
- 'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
+{
+    "label": 1,
+    "text": "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!",
 }
 ```

@@ -74,6 +76,7 @@ model was trained with to ensure appropriately tokenized words. Load the DistilB

 ```python
 from transformers import AutoTokenizer
+
 tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
 ```

@@ -99,6 +102,7 @@ batch. This is known as **dynamic padding**. You can do this with the `DataColla

 ```python
 from transformers import DataCollatorWithPadding
+
 data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
 ```

@@ -108,6 +112,7 @@ Now load your model with the [`AutoModelForSequenceClassification`] class along

 ```python
 from transformers import AutoModelForSequenceClassification
+
 model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
 ```

@@ -121,7 +126,7 @@ At this point, only three steps remain:
 from transformers import TrainingArguments, Trainer

 training_args = TrainingArguments(
-    output_dir='./results',
+    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
@@ -150,6 +155,7 @@ Make sure you set `return_tensors="tf"` to return `tf.Tensor` outputs instead of

 ```python
 from transformers import DataCollatorWithPadding
+
 data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
 ```

@@ -158,14 +164,14 @@ Next, convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`

 ```python
 tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
-    columns=['attention_mask', 'input_ids', 'label'],
+    columns=["attention_mask", "input_ids", "label"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
 )

 tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
-    columns=['attention_mask', 'input_ids', 'label'],
+    columns=["attention_mask", "input_ids", "label"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
@@ -182,17 +188,14 @@ batch_size = 16
 num_epochs = 5
 batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
 total_train_steps = int(batches_per_epoch * num_epochs)
-optimizer, schedule = create_optimizer(
-    init_lr=2e-5, 
-    num_warmup_steps=0, 
-    num_train_steps=total_train_steps
-)
+optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
 ```

 Load your model with the [`TFAutoModelForSequenceClassification`] class along with the number of expected labels:

 ```python
 from transformers import TFAutoModelForSequenceClassification
+
 model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
 ```

@@ -200,6 +203,7 @@ Compile the model:

 ```python
 import tensorflow as tf
+
 model.compile(optimizer=optimizer)
 ```

@@ -234,14 +238,15 @@ or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/no
 Load the WNUT 17 dataset from the 🤗 Datasets library:

 ```python
-from datasets import load_dataset
-wnut = load_dataset("wnut_17")
+>>> from datasets import load_dataset
+
+>>> wnut = load_dataset("wnut_17")
 ```

 A quick look at the dataset shows the labels associated with each word in the sentence:

 ```python
-wnut["train"][0]
+>>> wnut["train"][0]
 {'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
@@ -251,21 +256,22 @@ wnut["train"][0]
 View the specific NER tags by:

 ```python
-label_list = wnut["train"].features[f"ner_tags"].feature.names
-label_list
-['O',
- 'B-corporation',
- 'I-corporation',
- 'B-creative-work',
- 'I-creative-work',
- 'B-group',
- 'I-group',
- 'B-location',
- 'I-location',
- 'B-person',
- 'I-person',
- 'B-product',
- 'I-product'
+>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
+>>> label_list
+[
+    "O",
+    "B-corporation",
+    "I-corporation",
+    "B-creative-work",
+    "I-creative-work",
+    "B-group",
+    "I-group",
+    "B-location",
+    "I-location",
+    "B-person",
+    "I-person",
+    "B-product",
+    "I-product",
 ]
 ```

@@ -282,6 +288,7 @@ Now you need to tokenize the text. Load the DistilBERT tokenizer with an [`AutoT

 ```python
 from transformers import AutoTokenizer
+
 tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
 ```

@@ -289,9 +296,9 @@ Since the input has already been split into words, set `is_split_into_words=True
 subwords:

 ```python
-tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
-tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
-tokens
+>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
+>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
+>>> tokens
 ['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
 ```

@@ -314,10 +321,10 @@ def tokenize_and_align_labels(examples):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
-        for word_idx in word_ids:                            # Set the special tokens to -100.
+        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
-            elif word_idx != previous_word_idx:              # Only label the first token of a given word.
+            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])

        labels.append(label_ids)
@@ -336,6 +343,7 @@ Finally, pad your text and labels, so they are a uniform length:

 ```python
 from transformers import DataCollatorForTokenClassification
+
 data_collator = DataCollatorForTokenClassification(tokenizer)
 ```

@@ -345,6 +353,7 @@ Load your model with the [`AutoModelForTokenClassification`] class along with th

 ```python
 from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
+
 model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
 ```

@@ -352,7 +361,7 @@ Gather your training arguments in [`TrainingArguments`]:

 ```python
 training_args = TrainingArguments(
-    output_dir='./results',
+    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
@@ -387,6 +396,7 @@ Batch your examples together and pad your text and labels, so they are a uniform

 ```python
 from transformers import DataCollatorForTokenClassification
+
 data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
 ```

@@ -412,6 +422,7 @@ Load the model with the [`TFAutoModelForTokenClassification`] class along with t

 ```python
 from transformers import TFAutoModelForTokenClassification
+
 model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
 ```

@@ -435,6 +446,7 @@ Compile the model:

 ```python
 import tensorflow as tf
+
 model.compile(optimizer=optimizer)
 ```

@@ -469,13 +481,14 @@ Load the SQuAD dataset from the 🤗 Datasets library:

 ```python
 from datasets import load_dataset
+
 squad = load_dataset("squad")
 ```

 Take a look at an example from the dataset:

 ```python
-squad["train"][0]
+>>> squad["train"][0]
 {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
@@ -490,6 +503,7 @@ Load the DistilBERT tokenizer with an [`AutoTokenizer`]:

 ```python
 from transformers import AutoTokenizer
+
 tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
 ```

@@ -567,6 +581,7 @@ Batch the processed examples together:

 ```python
 from transformers import default_data_collator
+
 data_collator = default_data_collator
 ```

@@ -576,6 +591,7 @@ Load your model with the [`AutoModelForQuestionAnswering`] class:

 ```python
 from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
+
 model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
 ```

@@ -583,7 +599,7 @@ Gather your training arguments in [`TrainingArguments`]:

 ```python
 training_args = TrainingArguments(
-    output_dir='./results',
+    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
@@ -618,6 +634,7 @@ Batch the processed examples together with a TensorFlow default data collator:

 ```python
 from transformers.data.data_collator import tf_default_collator
+
 data_collator = tf_default_collator
 ```

@@ -650,8 +667,8 @@ batch_size = 16
 num_epochs = 2
 total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
 optimizer, schedule = create_optimizer(
-    init_lr=2e-5, 
-    num_warmup_steps=0, 
+    init_lr=2e-5,
+    num_warmup_steps=0,
    num_train_steps=total_train_steps,
 )
 ```
@@ -660,6 +677,7 @@ Load your model with the [`TFAutoModelForQuestionAnswering`] class:

 ```python
 from transformers import TFAutoModelForQuestionAnswering
+
 model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
 ```

@@ -667,6 +685,7 @@ Compile the model:

 ```python
 import tensorflow as tf
+
 model.compile(optimizer=optimizer)
 ```

--- a/docs/source/debugging.mdx
+++ b/docs/source/debugging.mdx
@@ -49,6 +49,7 @@ If you're using your own training loop or another Trainer you can accomplish the

 ```python
 from .debug_utils import DebugUnderflowOverflow
+
 debug_overflow = DebugUnderflowOverflow(model)
 ```

@@ -200,13 +201,16 @@ def _forward(self, hidden_states):
    hidden_states = self.wo(hidden_states)
    return hidden_states

+
 import torch
+
+
 def forward(self, hidden_states):
    if torch.is_autocast_enabled():
-         with torch.cuda.amp.autocast(enabled=False):
-             return self._forward(hidden_states)
-     else:
-         return self._forward(hidden_states)
+        with torch.cuda.amp.autocast(enabled=False):
+            return self._forward(hidden_states)
+    else:
+        return self._forward(hidden_states)
 ```

 Since the automatic detector only reports on inputs and outputs of full frames, once you know where to look, you may
@@ -216,8 +220,10 @@ want to analyse the intermediary stages of any specific `forward` function as we
 ```python
 from debug_utils import detect_overflow

+
 class T5LayerFF(nn.Module):
    [...]
+
    def forward(self, hidden_states):
        forwarded_states = self.layer_norm(hidden_states)
        detect_overflow(forwarded_states, "after layer_norm")
@@ -237,6 +243,7 @@ its default, e.g.:

 ```python
 from .debug_utils import DebugUnderflowOverflow
+
 debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
 ```

@@ -248,7 +255,7 @@ Let's say you want to watch the absolute min and max values for all the ingredie
 batch, and only do that for batches 1 and 3. Then you instantiate this class as:

 ```python
-debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3])
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
 ```

 And now full batches 1 and 3 will be traced using the same format as the underflow/overflow detector does.
@@ -295,5 +302,5 @@ numbers started to diverge.
 You can also specify the batch number after which to stop the training, with:

 ```python
-debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1,3], abort_after_batch_num=3)
+debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
 ```
--- a/docs/source/glossary.mdx
+++ b/docs/source/glossary.mdx
@@ -58,6 +58,7 @@ tokenizer, which is a [WordPiece](https://arxiv.org/pdf/1609.08144.pdf) tokenize

 ```python
 >>> from transformers import BertTokenizer
+
 >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

 >>> sequence = "A Titan RTX has 24GB of VRAM"
@@ -126,6 +127,7 @@ For example, consider these two sequences:

 ```python
 >>> from transformers import BertTokenizer
+
 >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

 >>> sequence_a = "This is a short sequence."
@@ -190,6 +192,7 @@ arguments (and not a list, like before) like this:

 ```python
 >>> from transformers import BertTokenizer
+
 >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
 >>> sequence_a = "HuggingFace is based in NYC"
 >>> sequence_b = "Where is HuggingFace based?"
@@ -212,7 +215,7 @@ the two types of sequence in the model.
 The tokenizer returns this mask as the "token_type_ids" entry:

 ```python
->>> encoded_dict['token_type_ids']
+>>> encoded_dict["token_type_ids"]
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
 ```

--- a/docs/source/internal/generation_utils.mdx
+++ b/docs/source/internal/generation_utils.mdx
@@ -32,8 +32,8 @@ Here's an example:
 ```python
 from transformers import GPT2Tokenizer, GPT2LMHeadModel

-tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
-model = GPT2LMHeadModel.from_pretrained('gpt2')
+tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
+model = GPT2LMHeadModel.from_pretrained("gpt2")

 inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
 generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
--- a/docs/source/main_classes/callback.mdx
+++ b/docs/source/main_classes/callback.mdx
@@ -79,12 +79,13 @@ class MyCallback(TrainerCallback):
    def on_train_begin(self, args, state, control, **kwargs):
        print("Starting training")

+
 trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
-    callbacks=[MyCallback]  # We can either pass the callback class this way or an instance of it (MyCallback())
+    callbacks=[MyCallback],  # We can either pass the callback class this way or an instance of it (MyCallback())
 )
 ```

--- a/docs/source/main_classes/deepspeed.mdx
+++ b/docs/source/main_classes/deepspeed.mdx
@@ -295,11 +295,12 @@ If you're using only 1 GPU, here is how you'd have to adjust your training code
 # DeepSpeed requires a distributed environment even when only one process is used.
 # This emulates a launcher in the notebook
 import os
-os.environ['MASTER_ADDR'] = 'localhost'
-os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
-os.environ['RANK'] = "0"
-os.environ['LOCAL_RANK'] = "0"
-os.environ['WORLD_SIZE'] = "1"
+
+os.environ["MASTER_ADDR"] = "localhost"
+os.environ["MASTER_PORT"] = "9994"  # modify if RuntimeError: Address already in use
+os.environ["RANK"] = "0"
+os.environ["LOCAL_RANK"] = "0"
+os.environ["WORLD_SIZE"] = "1"

 # Now proceed as normal, plus pass the deepspeed config file
 training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
@@ -316,7 +317,7 @@ at the beginning of this section.
 If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated
 cell with:

-```python
+```python no-style
 %%bash
 cat <<'EOT' > ds_config_zero3.json
 {
@@ -382,14 +383,14 @@ EOT
 If the training script is in a normal file and not in the notebook cells, you can launch `deepspeed` normally via
 shell from a cell. For example, to use `run_translation.py` you would launch it with:

-```python
+```python no-style
 !git clone https://github.com/huggingface/transformers
 !cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
 ```

 or with `%%bash` magic, where you can write a multi-line code for the shell program to run:

-```python
+```python no-style
 %%bash

 git clone https://github.com/huggingface/transformers
@@ -512,7 +513,7 @@ TrainingArguments(..., deepspeed="/path/to/ds_config.json")
 or:

 ```python
-ds_config_dict=dict(scheduler=scheduler_params, optimizer=optimizer_params)
+ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params)
 TrainingArguments(..., deepspeed=ds_config_dict)
 ```

@@ -1430,6 +1431,7 @@ If you have saved at least one checkpoint, and you want to use the latest one, y
 ```python
 from transformers.trainer_utils import get_last_checkpoint
 from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+
 checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
 fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
 ```
@@ -1439,6 +1441,7 @@ checkpoint), then you can finish the training by first saving the final model ex

 ```python
 from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+
 checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
 trainer.deepspeed.save_checkpoint(checkpoint_dir)
 fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
@@ -1461,7 +1464,8 @@ these yourself as is shown in the following example:

 ```python
 from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
-state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+
+state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)  # already on cpu
 model = model.cpu()
 model.load_state_dict(state_dict)
 ```
@@ -1529,9 +1533,10 @@ context manager (which is also a function decorator), like so:
 ```python
 from transformers import T5ForConditionalGeneration, T5Config
 import deepspeed
+
 with deepspeed.zero.Init():
-   config = T5Config.from_pretrained("t5-small")
-   model = T5ForConditionalGeneration(config)
+    config = T5Config.from_pretrained("t5-small")
+    model = T5ForConditionalGeneration(config)
 ```

 As you can see this gives you a randomly initialized model.
@@ -1544,6 +1549,7 @@ section. Thus you must create the [`TrainingArguments`] object **before** callin

 ```python
 from transformers import AutoModel, Trainer, TrainingArguments
+
 training_args = TrainingArguments(..., deepspeed=ds_config)
 model = AutoModel.from_pretrained("t5-small")
 trainer = Trainer(model=model, args=training_args, ...)
@@ -1574,7 +1580,7 @@ limitations.
 Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like:

 ```python
-tensor([1.], device='cuda:0', dtype=torch.float16, requires_grad=True)
+tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True)
 ```

 stress on `tensor([1.])`, or if you get an error where it says the parameter is of size `1`, instead of some much
@@ -1715,9 +1721,9 @@ For example for a pretrained model:
 from transformers.deepspeed import HfDeepSpeedConfig
 from transformers import AutoModel, deepspeed

-ds_config = { ... } # deepspeed config object or path to the file
+ds_config = {...}  # deepspeed config object or path to the file
 # must run before instantiating the model
-dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
 model = AutoModel.from_pretrained("gpt2")
 engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
 ```
@@ -1728,9 +1734,9 @@ or for non-pretrained model:
 from transformers.deepspeed import HfDeepSpeedConfig
 from transformers import AutoModel, AutoConfig, deepspeed

-ds_config = { ... } # deepspeed config object or path to the file
+ds_config = {...}  # deepspeed config object or path to the file
 # must run before instantiating the model
-dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
+dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive
 config = AutoConfig.from_pretrained("gpt2")
 model = AutoModel.from_config(config)
 engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
--- a/docs/source/main_classes/logging.mdx
+++ b/docs/source/main_classes/logging.mdx
@@ -21,6 +21,7 @@ to the INFO level.

 ```python
 import transformers
+
 transformers.logging.set_verbosity_info()
 ```

--- a/docs/source/main_classes/output.mdx
+++ b/docs/source/main_classes/output.mdx
@@ -22,8 +22,8 @@ Let's see of this looks on an example:
 from transformers import BertTokenizer, BertForSequenceClassification
 import torch

-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

 inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
 labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
--- a/docs/source/main_classes/pipelines.mdx
+++ b/docs/source/main_classes/pipelines.mdx
@@ -101,6 +101,7 @@ from transformers import pipeline

 pipe = pipeline("text-classification")

+
 def data():
    while True:
        # This could come from a dataset, a database, a queue or HTTP request
@@ -110,6 +111,7 @@ def data():
        # does the preprocessing while the main runs the big inference
        yield "This is a test"

+
 for out in pipe(data()):
    print(out)
    # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"}
@@ -125,10 +127,10 @@ All pipelines can use batching. This will work
 whenever the pipeline uses its streaming ability (so when passing lists or `Dataset` or `generator`).

 ```python
-from transformers import pipeline                                                   
+from transformers import pipeline
 from transformers.pipelines.base import KeyDataset
 import datasets
-import tqdm                                                                         
+import tqdm

 dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
 pipe = pipeline("text-classification", device=0)
@@ -149,28 +151,28 @@ Example where it's mostly a speedup:
 </Tip>

 ```python
-from transformers import pipeline                                                   
-from torch.utils.data import Dataset                                                
-import tqdm                                                                         
+from transformers import pipeline
+from torch.utils.data import Dataset
+import tqdm


-pipe = pipeline("text-classification", device=0)                                    
+pipe = pipeline("text-classification", device=0)


-class MyDataset(Dataset):                                                           
-    def __len__(self):                                                              
-        return 5000                                                                 
+class MyDataset(Dataset):
+    def __len__(self):
+        return 5000

-    def __getitem__(self, i):                                                       
-        return "This is a test"                                                     
+    def __getitem__(self, i):
+        return "This is a test"


-dataset = MyDataset()   
+dataset = MyDataset()

 for batch_size in [1, 8, 64, 256]:
-    print("-" * 30)                                                                     
-    print(f"Streaming batch_size={batch_size}")    
-    for out in tqdm.tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):              
+    print("-" * 30)
+    print(f"Streaming batch_size={batch_size}")
+    for out in tqdm.tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
        pass
 ```

@@ -194,15 +196,15 @@ Streaming batch_size=256
 Example where it's most a slowdown:

 ```python
-class MyDataset(Dataset):                                                           
-    def __len__(self):                                                              
-        return 5000                                                                 
+class MyDataset(Dataset):
+    def __len__(self):
+        return 5000

-    def __getitem__(self, i):                                                       
-        if i % 64 == 0:                                                          
-            n = 100                                                              
-        else:                                                                    
-            n = 1                                                                
+    def __getitem__(self, i):
+        if i % 64 == 0:
+            n = 100
+        else:
+            n = 1
        return "This is a test" * n
 ```

@@ -298,10 +300,11 @@ If you want to try simply you can:

 ```python
 class MyPipeline(TextClassificationPipeline):
-    def postprocess(...):
-        ...
+    def postprocess():
+        # Your code goes here
        scores = scores * 100
-        ...
+        # And here
+

 my_pipeline = MyPipeline(model=model, tokenizer=tokenizer, ...)
 # or if you use *pipeline* function, then:
--- a/docs/source/main_classes/processors.mdx
+++ b/docs/source/main_classes/processors.mdx
@@ -122,7 +122,7 @@ examples = processor.get_dev_examples(squad_v2_data_dir)
 processor = SquadV1Processor()
 examples = processor.get_dev_examples(squad_v1_data_dir)

-features = squad_convert_examples_to_features( 
+features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
@@ -139,7 +139,7 @@ Using *tensorflow_datasets* is as easy as using a data file:
 tfds_examples = tfds.load("squad")
 examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)

-features = squad_convert_examples_to_features( 
+features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
--- a/docs/source/main_classes/trainer.mdx
+++ b/docs/source/main_classes/trainer.mdx
@@ -53,14 +53,16 @@ Here is an example of how to customize [`Trainer`] using a custom loss function
 from torch import nn
 from transformers import Trainer

+
 class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
-        logits = outputs.get('logits')
+        logits = outputs.get("logits")
        loss_fct = nn.BCEWithLogitsLoss()
-        loss = loss_fct(logits.view(-1, self.model.config.num_labels),
-                        labels.float().view(-1, self.model.config.num_labels))
+        loss = loss_fct(
+            logits.view(-1, self.model.config.num_labels), labels.float().view(-1, self.model.config.num_labels)
+        )
        return (loss, outputs) if return_outputs else loss
 ```

--- a/docs/source/migration.mdx
+++ b/docs/source/migration.mdx
@@ -209,7 +209,7 @@ Here is a `pytorch-pretrained-bert` to 🤗 Transformers conversion example for

 ```python
 # Let's load our model
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

 # If you used to have this line in pytorch-pretrained-bert:
 loss = model(input_ids, labels=labels)
@@ -222,7 +222,7 @@ loss = outputs[0]
 loss, logits = outputs[:2]

 # And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased', output_attentions=True)
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased", output_attentions=True)
 outputs = model(input_ids, labels=labels)
 loss, logits, attentions = outputs
 ```
@@ -241,23 +241,23 @@ Here is an example:

 ```python
 ### Let's load a model and tokenizer
-model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
+tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

 ### Do some stuff to our model and tokenizer
 # Ex: add new tokens to the vocabulary and embeddings of our model
-tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
+tokenizer.add_tokens(["[SPECIAL_TOKEN_1]", "[SPECIAL_TOKEN_2]"])
 model.resize_token_embeddings(len(tokenizer))
 # Train our model
 train(model)

 ### Now let's save our model and tokenizer to a directory
-model.save_pretrained('./my_saved_model_directory/')
-tokenizer.save_pretrained('./my_saved_model_directory/')
+model.save_pretrained("./my_saved_model_directory/")
+tokenizer.save_pretrained("./my_saved_model_directory/")

 ### Reload the model and the tokenizer
-model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
-tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')
+model = BertForSequenceClassification.from_pretrained("./my_saved_model_directory/")
+tokenizer = BertTokenizer.from_pretrained("./my_saved_model_directory/")
 ```

 ### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules
@@ -283,7 +283,13 @@ num_warmup_steps = 100
 warmup_proportion = float(num_warmup_steps) / float(num_training_steps)  # 0.1

 ### Previously BertAdam optimizer was instantiated like this:
-optimizer = BertAdam(model.parameters(), lr=lr, schedule='warmup_linear', warmup=warmup_proportion, num_training_steps=num_training_steps)
+optimizer = BertAdam(
+    model.parameters(),
+    lr=lr,
+    schedule="warmup_linear",
+    warmup=warmup_proportion,
+    num_training_steps=num_training_steps,
+)
 ### and used like this:
 for batch in train_data:
    loss = model(batch)
@@ -291,13 +297,19 @@ for batch in train_data:
    optimizer.step()

 ### In 🤗 Transformers, optimizer and schedules are split and instantiated like this:
-optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
-scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler
+optimizer = AdamW(
+    model.parameters(), lr=lr, correct_bias=False
+)  # To reproduce BertAdam specific behavior set correct_bias=False
+scheduler = get_linear_schedule_with_warmup(
+    optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps
+)  # PyTorch scheduler
 ### and used like this:
 for batch in train_data:
    loss = model(batch)
    loss.backward()
-    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
+    torch.nn.utils.clip_grad_norm_(
+        model.parameters(), max_grad_norm
+    )  # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
    optimizer.step()
    scheduler.step()
 ```
--- a/docs/source/model_doc/bart.mdx
+++ b/docs/source/model_doc/bart.mdx
@@ -64,12 +64,15 @@ The `facebook/bart-base` and `facebook/bart-large` checkpoints can be used to fi

 ```python
 from transformers import BartForConditionalGeneration, BartTokenizer
+
 model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0)
 tok = BartTokenizer.from_pretrained("facebook/bart-large")
 example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
-batch = tok(example_english_phrase, return_tensors='pt')
-generated_ids = model.generate(batch['input_ids'])
-assert tok.batch_decode(generated_ids, skip_special_tokens=True) == ['UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria']
+batch = tok(example_english_phrase, return_tensors="pt")
+generated_ids = model.generate(batch["input_ids"])
+assert tok.batch_decode(generated_ids, skip_special_tokens=True) == [
+    "UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria"
+]
 ```

 ## BartConfig
--- a/docs/source/model_doc/bartpho.mdx
+++ b/docs/source/model_doc/bartpho.mdx
@@ -44,6 +44,7 @@ Example of use:

 >>> # With TensorFlow 2.0+:
 >>> from transformers import TFAutoModel
+
 >>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable")
 >>> input_ids = tokenizer(line, return_tensors="tf")
 >>> features = bartpho(**input_ids)
@@ -58,9 +59,10 @@ Tips:

 ```python
 >>> from transformers import MBartForConditionalGeneration
+
 >>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable")
->>> TXT = 'Chúng tôi là <mask> nghiên cứu viên.'
->>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
+>>> TXT = "Chúng tôi là <mask> nghiên cứu viên."
+>>> input_ids = tokenizer([TXT], return_tensors="pt")["input_ids"]
 >>> logits = bartpho(input_ids).logits
 >>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
 >>> probs = logits[0, masked_index].softmax(dim=0)
--- a/docs/source/model_doc/bert_japanese.mdx
+++ b/docs/source/model_doc/bert_japanese.mdx
@@ -30,7 +30,7 @@ Example of using a model with MeCab and WordPiece tokenization:

 ```python
 >>> import torch
->>> from transformers import AutoModel, AutoTokenizer 
+>>> from transformers import AutoModel, AutoTokenizer

 >>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
 >>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
@@ -40,7 +40,7 @@ Example of using a model with MeCab and WordPiece tokenization:

 >>> inputs = tokenizer(line, return_tensors="pt")

->>> print(tokenizer.decode(inputs['input_ids'][0]))
+>>> print(tokenizer.decode(inputs["input_ids"][0]))
 [CLS] 吾輩 は 猫 で ある 。 [SEP]

 >>> outputs = bertjapanese(**inputs)
@@ -57,7 +57,7 @@ Example of using a model with Character tokenization:

 >>> inputs = tokenizer(line, return_tensors="pt")

->>> print(tokenizer.decode(inputs['input_ids'][0]))
+>>> print(tokenizer.decode(inputs["input_ids"][0]))
 [CLS] 吾 輩 は 猫 で あ る 。 [SEP]

 >>> outputs = bertjapanese(**inputs)
--- a/docs/source/model_doc/bertgeneration.mdx
+++ b/docs/source/model_doc/bertgeneration.mdx
@@ -39,14 +39,18 @@ Usage:
 >>> # use BERT's cls token as BOS token and sep token as EOS token
 >>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102)
 >>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token
->>> decoder = BertGenerationDecoder.from_pretrained("bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102)
+>>> decoder = BertGenerationDecoder.from_pretrained(
+...     "bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102
+... )
 >>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder)

 >>> # create tokenizer...
 >>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")

->>> input_ids = tokenizer('This is a long article to summarize', add_special_tokens=False, return_tensors="pt").input_ids
->>> labels = tokenizer('This is a short summary', return_tensors="pt").input_ids
+>>> input_ids = tokenizer(
+...     "This is a long article to summarize", add_special_tokens=False, return_tensors="pt"
+>>> ).input_ids
+>>> labels = tokenizer("This is a short summary", return_tensors="pt").input_ids

 >>> # train...
 >>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss
@@ -61,7 +65,9 @@ Usage:
 >>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse")
 >>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse")

->>> input_ids = tokenizer('This is the first sentence. This is the second sentence.', add_special_tokens=False, return_tensors="pt").input_ids
+>>> input_ids = tokenizer(
+...     "This is the first sentence. This is the second sentence.", add_special_tokens=False, return_tensors="pt"
+>>> ).input_ids

 >>> outputs = sentence_fuser.generate(input_ids)

--- a/docs/source/model_doc/bertweet.mdx
+++ b/docs/source/model_doc/bertweet.mdx
@@ -28,14 +28,14 @@ Example of use:

 ```python
 >>> import torch
->>> from transformers import AutoModel, AutoTokenizer 
+>>> from transformers import AutoModel, AutoTokenizer

 >>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base")

->>> # For transformers v4.x+: 
+>>> # For transformers v4.x+:
 >>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

->>> # For transformers v3.x: 
+>>> # For transformers v3.x:
 >>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base")

 >>> # INPUT TWEET IS ALREADY NORMALIZED!
--- a/docs/source/model_doc/blenderbot.mdx
+++ b/docs/source/model_doc/blenderbot.mdx
@@ -50,11 +50,12 @@ Here is an example of model usage:

 ```python
 >>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
->>> mname = 'facebook/blenderbot-400M-distill'
+
+>>> mname = "facebook/blenderbot-400M-distill"
 >>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
 >>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
 >>> UTTERANCE = "My friends are cool but they eat too many carbs."
->>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
+>>> inputs = tokenizer([UTTERANCE], return_tensors="pt")
 >>> reply_ids = model.generate(**inputs)
 >>> print(tokenizer.batch_decode(reply_ids))
 ["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
--- a/docs/source/model_doc/byt5.mdx
+++ b/docs/source/model_doc/byt5.mdx
@@ -51,12 +51,14 @@ ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
 from transformers import T5ForConditionalGeneration
 import torch

-model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
+model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")

 input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
-labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens
+labels = (
+    torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3
+)  # add 3 for special tokens

-loss = model(input_ids, labels=labels).loss # forward pass
+loss = model(input_ids, labels=labels).loss  # forward pass
 ```

 For batched inference and training it is however recommended to make use of the tokenizer:
@@ -64,13 +66,17 @@ For batched inference and training it is however recommended to make use of the
 ```python
 from transformers import T5ForConditionalGeneration, AutoTokenizer

-model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
-tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')
+model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
+tokenizer = AutoTokenizer.from_pretrained("google/byt5-small")

-model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
-labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
+model_inputs = tokenizer(
+    ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt"
+)
+labels = tokenizer(
+    ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt"
+).input_ids

-loss = model(**model_inputs, labels=labels).loss # forward pass
+loss = model(**model_inputs, labels=labels).loss  # forward pass
 ```

 ## ByT5Tokenizer
--- a/docs/source/model_doc/canine.mdx
+++ b/docs/source/model_doc/canine.mdx
@@ -64,13 +64,13 @@ CANINE works on raw characters, so it can be used without a tokenizer:
 >>> from transformers import CanineModel
 >>> import torch

->>> model = CanineModel.from_pretrained('google/canine-c') # model pre-trained with autoregressive character loss
+>>> model = CanineModel.from_pretrained("google/canine-c")  # model pre-trained with autoregressive character loss

 >>> text = "hello world"
 >>> # use Python's built-in ord() function to turn each character into its unicode code point id
 >>> input_ids = torch.tensor([[ord(char) for char in text]])

->>> outputs = model(input_ids) # forward pass
+>>> outputs = model(input_ids)  # forward pass
 >>> pooled_output = outputs.pooler_output
 >>> sequence_output = outputs.last_hidden_state
 ```
@@ -81,13 +81,13 @@ sequences to the same length):
 ```python
 >>> from transformers import CanineTokenizer, CanineModel

->>> model = CanineModel.from_pretrained('google/canine-c')
->>> tokenizer = CanineTokenizer.from_pretrained('google/canine-c')
+>>> model = CanineModel.from_pretrained("google/canine-c")
+>>> tokenizer = CanineTokenizer.from_pretrained("google/canine-c")

 >>> inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."]
 >>> encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt")

->>> outputs = model(**encoding) # forward pass
+>>> outputs = model(**encoding)  # forward pass
 >>> pooled_output = outputs.pooler_output
 >>> sequence_output = outputs.last_hidden_state
 ```
--- a/docs/source/model_doc/clip.mdx
+++ b/docs/source/model_doc/clip.mdx
@@ -69,8 +69,8 @@ encode the text and prepare the images. The following example shows how to get t
 >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

 >>> outputs = model(**inputs)
->>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
->>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
+>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
+>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
 ```

 This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP).
--- a/docs/source/model_doc/gpt_neo.mdx
+++ b/docs/source/model_doc/gpt_neo.mdx
@@ -29,16 +29,24 @@ The `generate()` method can be used to generate text using GPT Neo model.

 ```python
 >>> from transformers import GPTNeoForCausalLM, GPT2Tokenizer
+
 >>> model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
 >>> tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

->>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
-...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
-...          "researchers was the fact that the unicorns spoke perfect English."
+>>> prompt = (
+...     "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
+...     "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
+...     "researchers was the fact that the unicorns spoke perfect English."
+... )

 >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

->>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
+>>> gen_tokens = model.generate(
+...     input_ids,
+...     do_sample=True,
+...     temperature=0.9,
+...     max_length=100,
+... )
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```

--- a/docs/source/model_doc/gptj.mdx
+++ b/docs/source/model_doc/gptj.mdx
@@ -33,7 +33,9 @@ Tips:
 >>> from transformers import GPTJForCausalLM
 >>> import torch

->>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
+>>> model = GPTJForCausalLM.from_pretrained(
+...     "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
+... )
 ```

 - The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
@@ -56,16 +58,24 @@ model.

 ```python
 >>> from transformers import AutoModelForCausalLM, AutoTokenizer
+
 >>> model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

->>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
-...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
-...          "researchers was the fact that the unicorns spoke perfect English."
+>>> prompt = (
+...     "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
+...     "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
+...     "researchers was the fact that the unicorns spoke perfect English."
+... )

 >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

->>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
+>>> gen_tokens = model.generate(
+...     input_ids,
+...     do_sample=True,
+...     temperature=0.9,
+...     max_length=100,
+... )
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```

@@ -78,13 +88,20 @@ model.
 >>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
 >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

->>> prompt = "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " \
-...          "previously unexplored valley, in the Andes Mountains. Even more surprising to the " \
-...          "researchers was the fact that the unicorns spoke perfect English."
+>>> prompt = (
+...     "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
+...     "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
+...     "researchers was the fact that the unicorns spoke perfect English."
+... )

 >>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids

->>> gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
+>>> gen_tokens = model.generate(
+...     input_ids,
+...     do_sample=True,
+...     temperature=0.9,
+...     max_length=100,
+... )
 >>> gen_text = tokenizer.batch_decode(gen_tokens)[0]
 ```

--- a/docs/source/model_doc/herbert.mdx
+++ b/docs/source/model_doc/herbert.mdx
@@ -41,7 +41,7 @@ Examples of use:
 >>> tokenizer = HerbertTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
 >>> model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")

->>> encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
+>>> encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors="pt")
 >>> outputs = model(encoded_input)

 >>> # HerBERT can also be loaded using AutoTokenizer and AutoModel:
--- a/docs/source/model_doc/layoutlm.mdx
+++ b/docs/source/model_doc/layoutlm.mdx
@@ -53,12 +53,12 @@ Tips:

 ```python
 def normalize_bbox(bbox, width, height):
-     return [
-         int(1000 * (bbox[0] / width)),
-         int(1000 * (bbox[1] / height)),
-         int(1000 * (bbox[2] / width)),
-         int(1000 * (bbox[3] / height)),
-     ]
+    return [
+        int(1000 * (bbox[0] / width)),
+        int(1000 * (bbox[1] / height)),
+        int(1000 * (bbox[2] / width)),
+        int(1000 * (bbox[3] / height)),
+    ]
 ```

 Here, `width` and `height` correspond to the width and height of the original document in which the token
--- a/docs/source/model_doc/layoutlmv2.mdx
+++ b/docs/source/model_doc/layoutlmv2.mdx
@@ -70,12 +70,12 @@ Tips:

 ```python
 def normalize_bbox(bbox, width, height):
-     return [
-         int(1000 * (bbox[0] / width)),
-         int(1000 * (bbox[1] / height)),
-         int(1000 * (bbox[2] / width)),
-         int(1000 * (bbox[3] / height)),
-     ]
+    return [
+        int(1000 * (bbox[0] / width)),
+        int(1000 * (bbox[1] / height)),
+        int(1000 * (bbox[2] / width)),
+        int(1000 * (bbox[3] / height)),
+    ]
 ```

 Here, `width` and `height` correspond to the width and height of the original document in which the token
@@ -123,7 +123,7 @@ modality.
 ```python
 from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor

-feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default
+feature_extractor = LayoutLMv2FeatureExtractor()  # apply_ocr is set to True by default
 tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
 processor = LayoutLMv2Processor(feature_extractor, tokenizer)
 ```
@@ -158,7 +158,9 @@ from PIL import Image
 processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased")

 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
-encoding = processor(image, return_tensors="pt") # you can also add all tokenizer parameters here such as padding, truncation
+encoding = processor(
+    image, return_tensors="pt"
+)  # you can also add all tokenizer parameters here such as padding, truncation
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 ```
@@ -177,7 +179,7 @@ processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncas

 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 words = ["hello", "world"]
-boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
+boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
 encoding = processor(image, words, boxes=boxes, return_tensors="pt")
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
@@ -199,7 +201,7 @@ processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncas

 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 words = ["hello", "world"]
-boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
+boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
 word_labels = [1, 2]
 encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
 print(encoding.keys())
@@ -219,7 +221,7 @@ processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncas

 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 question = "What's his name?"
-encoding = processor(image, question, return_tensors="pt") 
+encoding = processor(image, question, return_tensors="pt")
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 ```
@@ -238,8 +240,8 @@ processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncas
 image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
 question = "What's his name?"
 words = ["hello", "world"]
-boxes = [[1, 2, 3, 4], [5, 6, 7, 8]] # make sure to normalize your bounding boxes
-encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")  
+boxes = [[1, 2, 3, 4], [5, 6, 7, 8]]  # make sure to normalize your bounding boxes
+encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
 print(encoding.keys())
 # dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'bbox', 'image'])
 ```
--- a/docs/source/model_doc/layoutxlm.mdx
+++ b/docs/source/model_doc/layoutxlm.mdx
@@ -34,7 +34,7 @@ One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like
 ```python
 from transformers import LayoutLMv2Model

-model = LayoutLMv2Model.from_pretrained('microsoft/layoutxlm-base')
+model = LayoutLMv2Model.from_pretrained("microsoft/layoutxlm-base")
 ```

 Note that LayoutXLM has its own tokenizer, based on
@@ -44,7 +44,7 @@ follows:
 ```python
 from transformers import LayoutXLMTokenizer

-tokenizer = LayoutXLMTokenizer.from_pretrained('microsoft/layoutxlm-base')
+tokenizer = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base")
 ```

 Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies
--- a/docs/source/model_doc/longformer.mdx
+++ b/docs/source/model_doc/longformer.mdx
@@ -75,8 +75,8 @@ For more information, please refer to the official [paper](https://arxiv.org/pdf
 trained and should be used as follows:

 ```python
-input_ids = tokenizer.encode('This is a sentence from [MASK] training data', return_tensors='pt')
-mlm_labels = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+input_ids = tokenizer.encode("This is a sentence from [MASK] training data", return_tensors="pt")
+mlm_labels = tokenizer.encode("This is a sentence from the training data", return_tensors="pt")

 loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0]
 ```
--- a/docs/source/model_doc/luke.mdx
+++ b/docs/source/model_doc/luke.mdx
@@ -84,24 +84,27 @@ Example:

 >>> model = LukeModel.from_pretrained("studio-ousia/luke-base")
 >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
-
 # Example 1: Computing the contextualized entity representation corresponding to the entity mention "Beyoncé"
+
 >>> text = "Beyoncé lives in Los Angeles."
 >>> entity_spans = [(0, 7)]  # character-based entity span corresponding to "Beyoncé"
 >>> inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
 >>> outputs = model(**inputs)
 >>> word_last_hidden_state = outputs.last_hidden_state
 >>> entity_last_hidden_state = outputs.entity_last_hidden_state
-
 # Example 2: Inputting Wikipedia entities to obtain enriched contextualized representations
->>> entities = ["Beyoncé", "Los Angeles"]  # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
+
+>>> entities = [
+...     "Beyoncé",
+...     "Los Angeles",
+>>> ]  # Wikipedia entity titles corresponding to the entity mentions "Beyoncé" and "Los Angeles"
 >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
 >>> inputs = tokenizer(text, entities=entities, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
 >>> outputs = model(**inputs)
 >>> word_last_hidden_state = outputs.last_hidden_state
 >>> entity_last_hidden_state = outputs.entity_last_hidden_state
-
 # Example 3: Classifying the relationship between two entities using LukeForEntityPairClassification head model
+
 >>> model = LukeForEntityPairClassification.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
 >>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large-finetuned-tacred")
 >>> entity_spans = [(0, 7), (17, 28)]  # character-based entity spans corresponding to "Beyoncé" and "Los Angeles"
--- a/docs/source/model_doc/m2m_100.mdx
+++ b/docs/source/model_doc/m2m_100.mdx
@@ -49,8 +49,8 @@ examples. To install `sentencepiece` run `pip install sentencepiece`.
 ```python
 from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer

-model = M2M100ForConditionalGeneration.from_pretrained('facebook/m2m100_418M')
-tokenizer = M2M100Tokenizer.from_pretrained('facebook/m2m100_418M', src_lang="en", tgt_lang="fr")
+model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
+tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="en", tgt_lang="fr")

 src_text = "Life is like a box of chocolates."
 tgt_text = "La vie est comme une boîte de chocolat."
@@ -59,7 +59,7 @@ model_inputs = tokenizer(src_text, return_tensors="pt")
 with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids

-loss = model(**model_inputs, labels=labels) # forward pass
+loss = model(**model_inputs, labels=labels)  # forward pass
 ```

 - Generation
--- a/docs/source/model_doc/marian.mdx
+++ b/docs/source/model_doc/marian.mdx
@@ -65,13 +65,14 @@ require 3 character language codes:

 ```python
 >>> from transformers import MarianMTModel, MarianTokenizer
->>> src_text = [
-...     '>>fra<< this is a sentence in english that we want to translate to french',
-...     '>>por<< This should go to portuguese',
-...     '>>esp<< And this to Spanish'
->>> ]

->>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
+>>> src_text = [
+...     ">>fra<< this is a sentence in english that we want to translate to french",
+...     ">>por<< This should go to portuguese",
+...     ">>esp<< And this to Spanish",
+... ]
+
+>>> model_name = "Helsinki-NLP/opus-mt-en-roa"
 >>> tokenizer = MarianTokenizer.from_pretrained(model_name)
 >>> print(tokenizer.supported_language_codes)
 ['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
@@ -88,11 +89,12 @@ Here is the code to see all available pretrained models on the hub:

 ```python
 from huggingface_hub import list_models
+
 model_list = list_models()
 org = "Helsinki-NLP"
 model_ids = [x.modelId for x in model_list if x.modelId.startswith(org)]
-suffix = [x.split('/')[1] for x in model_ids]
-old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
+suffix = [x.split("/")[1] for x in model_ids]
+old_style_multi_models = [f"{org}/{s}" for s in suffix if s != s.lower()]
 ```

 ## Old Style Multi-Lingual Models
@@ -100,7 +102,7 @@ old_style_multi_models = [f'{org}/{s}' for s in suffix if s != s.lower()]
 These are the old style multi-lingual models ported from the OPUS-MT-Train repo: and the members of each language
 group:

-```python
+```python no-style
 ['Helsinki-NLP/opus-mt-NORTH_EU-NORTH_EU',
 'Helsinki-NLP/opus-mt-ROMANCE-en',
 'Helsinki-NLP/opus-mt-SCANDINAVIA-SCANDINAVIA',
@@ -129,13 +131,14 @@ Example of translating english to many romance languages, using old-style 2 char

 ```python
 >>> from transformers import MarianMTModel, MarianTokenizer
->>> src_text = [
-...     '>>fr<< this is a sentence in english that we want to translate to french',
-...     '>>pt<< This should go to portuguese',
-...     '>>es<< And this to Spanish'
->>> ]

->>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
+>>> src_text = [
+...     ">>fr<< this is a sentence in english that we want to translate to french",
+...     ">>pt<< This should go to portuguese",
+...     ">>es<< And this to Spanish",
+... ]
+
+>>> model_name = "Helsinki-NLP/opus-mt-en-ROMANCE"
 >>> tokenizer = MarianTokenizer.from_pretrained(model_name)

 >>> model = MarianMTModel.from_pretrained(model_name)
--- a/docs/source/model_doc/mbart.mdx
+++ b/docs/source/model_doc/mbart.mdx
@@ -52,7 +52,7 @@ inside the context manager [`~MBartTokenizer.as_target_tokenizer`] to encode tar

 >>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
 >>> # forward pass
->>> model(**inputs, labels=batch['labels'])
+>>> model(**inputs, labels=batch["labels"])
 ```

 - Generation
@@ -106,13 +106,13 @@ model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
 tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")

 src_text = " UN Chief Says There Is No Military Solution in Syria"
-tgt_text =  "Şeful ONU declară că nu există o soluţie militară în Siria"
+tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"

 model_inputs = tokenizer(src_text, return_tensors="pt")
 with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_text, return_tensors="pt").input_ids

-model(**model_inputs, labels=labels) # forward pass
+model(**model_inputs, labels=labels)  # forward pass
 ```

 - Generation
--- a/docs/source/model_doc/mluke.mdx
+++ b/docs/source/model_doc/mluke.mdx
@@ -38,7 +38,7 @@ One can directly plug in the weights of mLUKE into a LUKE model, like so:
 ```python
 from transformers import LukeModel

-model = LukeModel.from_pretrained('studio-ousia/mluke-base')
+model = LukeModel.from_pretrained("studio-ousia/mluke-base")
 ```

 Note that mLUKE has its own tokenizer, [`MLukeTokenizer`]. You can initialize it as follows:
@@ -46,7 +46,7 @@ Note that mLUKE has its own tokenizer, [`MLukeTokenizer`]. You can initialize it
 ```python
 from transformers import MLukeTokenizer

-tokenizer = MLukeTokenizer.from_pretrained('studio-ousia/mluke-base')
+tokenizer = MLukeTokenizer.from_pretrained("studio-ousia/mluke-base")
 ```

 As mLUKE's architecture is equivalent to that of LUKE, one can refer to [LUKE's documentation page](luke) for all
--- a/docs/source/model_doc/pegasus.mdx
+++ b/docs/source/model_doc/pegasus.mdx
@@ -69,18 +69,22 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun
 ```python
 >>> from transformers import PegasusForConditionalGeneration, PegasusTokenizer
 >>> import torch
+
 >>> src_text = [
 ...     """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
->>> ]
+... ]

->>> model_name = 'google/pegasus-xsum'
->>> device = 'cuda' if torch.cuda.is_available() else 'cpu'
->>> tokenizer = PegasusTokenizer.from_pretrained(model_name)
->>> model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
->>> batch = tokenizer(src_text, truncation=True, padding='longest', return_tensors="pt").to(device)
->>> translated = model.generate(**batch)
->>> tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
->>> assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
+... model_name = "google/pegasus-xsum"
+... device = "cuda" if torch.cuda.is_available() else "cpu"
+... tokenizer = PegasusTokenizer.from_pretrained(model_name)
+... model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
+... batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
+... translated = model.generate(**batch)
+... tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
+... assert (
+...     tgt_text[0]
+...     == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
+... )
 ```

 ## PegasusConfig
--- a/docs/source/model_doc/qdqbert.mdx
+++ b/docs/source/model_doc/qdqbert.mdx
@@ -75,9 +75,9 @@ tensors. After setting up the tensor quantizers, one can use the following examp
 ```python
 >>> # Find the TensorQuantizer and enable calibration
 >>> for name, module in model.named_modules():
->>>     if name.endswith('_input_quantizer'):
->>>         module.enable_calib()
->>>         module.disable_quant()  # Use full precision data to calibrate
+...     if name.endswith("_input_quantizer"):
+...         module.enable_calib()
+...         module.disable_quant()  # Use full precision data to calibrate

 >>> # Feeding data samples
 >>> model(x)
@@ -85,9 +85,9 @@ tensors. After setting up the tensor quantizers, one can use the following examp

 >>> # Finalize calibration
 >>> for name, module in model.named_modules():
->>>     if name.endswith('_input_quantizer'):
->>>         module.load_calib_amax()
->>>         module.enable_quant()
+...     if name.endswith("_input_quantizer"):
+...         module.load_calib_amax()
+...         module.enable_quant()

 >>> # If running on GPU, it needs to call .cuda() again because new tensors will be created by calibration process
 >>> model.cuda()
@@ -105,6 +105,7 @@ the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Exa

 ```python
 >>> from pytorch_quantization.nn import TensorQuantizer
+
 >>> TensorQuantizer.use_fb_fake_quant = True

 >>> # Load the calibrated model
--- a/docs/source/model_doc/reformer.mdx
+++ b/docs/source/model_doc/reformer.mdx
@@ -134,7 +134,7 @@ easily be trained on sequences as long as 64000 tokens.
 For training, the [`ReformerModelWithLMHead`] should be used as follows:

 ```python
-input_ids = tokenizer.encode('This is a sentence from the training data', return_tensors='pt')
+input_ids = tokenizer.encode("This is a sentence from the training data", return_tensors="pt")
 loss = model(input_ids, labels=input_ids)[0]
 ```

--- a/docs/source/model_doc/speech_to_text.mdx
+++ b/docs/source/model_doc/speech_to_text.mdx
@@ -52,11 +52,13 @@ be installed as follows: `apt install libsndfile1-dev`
 >>> model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
 >>> processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")

+
 >>> def map_to_array(batch):
 ...     speech, _ = sf.read(batch["file"])
 ...     batch["speech"] = speech
 ...     return batch

+
 >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
 >>> ds = ds.map(map_to_array)

@@ -83,16 +85,22 @@ be installed as follows: `apt install libsndfile1-dev`
 >>> model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
 >>> processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")

+
 >>> def map_to_array(batch):
 ...     speech, _ = sf.read(batch["file"])
 ...     batch["speech"] = speech
 ...     return batch

+
 >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
 >>> ds = ds.map(map_to_array)

 >>> inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
->>> generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask], forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"])
+>>> generated_ids = model.generate(
+...     input_ids=inputs["input_features"],
+...     attention_mask=inputs["attention_mask"],
+...     forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"],
+... )

 >>> translation = processor.batch_decode(generated_ids)
 ```
--- a/docs/source/model_doc/speech_to_text_2.mdx
+++ b/docs/source/model_doc/speech_to_text_2.mdx
@@ -58,11 +58,13 @@ predicted token ids.
 >>> model = SpeechEncoderDecoderModel.from_pretrained("facebook/s2t-wav2vec2-large-en-de")
 >>> processor = Speech2Text2Processor.from_pretrained("facebook/s2t-wav2vec2-large-en-de")

+
 >>> def map_to_array(batch):
 ...     speech, _ = sf.read(batch["file"])
 ...     batch["speech"] = speech
 ...     return batch

+
 >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
 >>> ds = ds.map(map_to_array)

@@ -81,7 +83,11 @@ predicted token ids.
 >>> from transformers import pipeline

 >>> librispeech_en = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
->>> asr = pipeline("automatic-speech-recognition", model="facebook/s2t-wav2vec2-large-en-de", feature_extractor="facebook/s2t-wav2vec2-large-en-de")
+>>> asr = pipeline(
+...     "automatic-speech-recognition",
+...     model="facebook/s2t-wav2vec2-large-en-de",
+...     feature_extractor="facebook/s2t-wav2vec2-large-en-de",
+... )

 >>> translation_de = asr(librispeech_en[0]["file"])
 ```
--- a/docs/source/model_doc/t5.mdx
+++ b/docs/source/model_doc/t5.mdx
@@ -98,8 +98,8 @@ language modeling head on top of the decoder.
  tokenizer = T5Tokenizer.from_pretrained("t5-small")
  model = T5ForConditionalGeneration.from_pretrained("t5-small")

-  input_ids = tokenizer('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt').input_ids
-  labels = tokenizer('<extra_id_0> cute dog <extra_id_1> the <extra_id_2>', return_tensors='pt').input_ids
+  input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
+  labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
  # the forward function automatically creates the correct decoder_input_ids
  loss = model(input_ids=input_ids, labels=labels).loss
  ```
@@ -120,8 +120,8 @@ language modeling head on top of the decoder.
  tokenizer = T5Tokenizer.from_pretrained("t5-small")
  model = T5ForConditionalGeneration.from_pretrained("t5-small")

-  input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
-  labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids
+  input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
+  labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids
  # the forward function automatically creates the correct decoder_input_ids
  loss = model(input_ids=input_ids, labels=labels).loss
  ```
@@ -148,7 +148,7 @@ language modeling head on top of the decoder.
  ignored. The code example below illustrates all of this.

  ```python
-  from transformers import T5Tokenizer, T5ForConditionalGeneration 
+  from transformers import T5Tokenizer, T5ForConditionalGeneration
  import torch

  tokenizer = T5Tokenizer.from_pretrained("t5-small")
@@ -168,18 +168,19 @@ language modeling head on top of the decoder.
  # encode the inputs
  task_prefix = "translate English to French: "
  input_sequences = [input_sequence_1, input_sequence_2]
-  encoding = tokenizer([task_prefix + sequence for sequence in input_sequences], 
-                      padding='longest', 
-                      max_length=max_source_length, 
-                      truncation=True, 
-                      return_tensors="pt")
+  encoding = tokenizer(
+      [task_prefix + sequence for sequence in input_sequences],
+      padding="longest",
+      max_length=max_source_length,
+      truncation=True,
+      return_tensors="pt",
+  )
  input_ids, attention_mask = encoding.input_ids, encoding.attention_mask

  # encode the targets
-  target_encoding = tokenizer([output_sequence_1, output_sequence_2], 
-                              padding='longest', 
-                              max_length=max_target_length, 
-                              truncation=True)
+  target_encoding = tokenizer(
+      [output_sequence_1, output_sequence_2], padding="longest", max_length=max_target_length, truncation=True
+  )
  labels = target_encoding.input_ids

  # replace padding token id's of the labels by -100
@@ -218,12 +219,12 @@ There's also [this blog post](https://huggingface.co/blog/encoder-decoder#encode
 generation works in general in encoder-decoder models.

 ```python
-from transformers import T5Tokenizer, T5ForConditionalGeneration 
+from transformers import T5Tokenizer, T5ForConditionalGeneration

 tokenizer = T5Tokenizer.from_pretrained("t5-small")
 model = T5ForConditionalGeneration.from_pretrained("t5-small")

-input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids
+input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
 outputs = model.generate(input_ids)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 # Das Haus ist wunderbar.
@@ -242,17 +243,17 @@ model = T5ForConditionalGeneration.from_pretrained("t5-small")

 # when generating, we will use the logits of right-most token to predict the next token
 # so the padding should be on the left
-tokenizer.padding_side = "left" 
-tokenizer.pad_token = tokenizer.eos_token # to avoid an error
+tokenizer.padding_side = "left"
+tokenizer.pad_token = tokenizer.eos_token  # to avoid an error

-task_prefix = 'translate English to German: '
-sentences = ['The house is wonderful.', 'I like to work in NYC.'] # use different length sentences to test batching
+task_prefix = "translate English to German: "
+sentences = ["The house is wonderful.", "I like to work in NYC."]  # use different length sentences to test batching
 inputs = tokenizer([task_prefix + sentence for sentence in sentences], return_tensors="pt", padding=True)

 output_sequences = model.generate(
-    input_ids=inputs['input_ids'],
-    attention_mask=inputs['attention_mask'],
-    do_sample=False, # disable sampling to test if batching affects output
+    input_ids=inputs["input_ids"],
+    attention_mask=inputs["attention_mask"],
+    do_sample=False,  # disable sampling to test if batching affects output
 )

 print(tokenizer.batch_decode(output_sequences, skip_special_tokens=True))
--- a/docs/source/model_doc/t5v1.1.mdx
+++ b/docs/source/model_doc/t5v1.1.mdx
@@ -22,7 +22,7 @@ One can directly plug in the weights of T5v1.1 into a T5 model, like so:
 ```python
 from transformers import T5ForConditionalGeneration

-model = T5ForConditionalGeneration.from_pretrained('google/t5-v1_1-base')
+model = T5ForConditionalGeneration.from_pretrained("google/t5-v1_1-base")
 ```

 T5 Version 1.1 includes the following improvements compared to the original T5 model:
--- a/docs/source/model_doc/tapas.mdx
+++ b/docs/source/model_doc/tapas.mdx
@@ -75,28 +75,28 @@ dependency in case you're using Tensorflow:
 >>> from transformers import TapasConfig, TapasForQuestionAnswering

 >>> # for example, the base sized model with default SQA configuration
->>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base')
+>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base")

 >>> # or, the base sized model with WTQ configuration
->>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
->>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> config = TapasConfig.from_pretrained("google/tapas-base-finetuned-wtq")
+>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

 >>> # or, the base sized model with WikiSQL configuration
->>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
->>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> config = TapasConfig("google-base-finetuned-wikisql-supervised")
+>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
 ===PT-TF-SPLIT===
 >>> from transformers import TapasConfig, TFTapasForQuestionAnswering

 >>> # for example, the base sized model with default SQA configuration
->>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base')
+>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base")

 >>> # or, the base sized model with WTQ configuration
->>> config = TapasConfig.from_pretrained('google/tapas-base-finetuned-wtq')
->>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> config = TapasConfig.from_pretrained("google/tapas-base-finetuned-wtq")
+>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

 >>> # or, the base sized model with WikiSQL configuration
->>> config = TapasConfig('google-base-finetuned-wikisql-supervised')
->>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> config = TapasConfig("google-base-finetuned-wikisql-supervised")
+>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
 ```

 Of course, you don't necessarily have to follow one of these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters you want when initializing [`TapasConfig`], and then create a [`TapasForQuestionAnswering`] based on that configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it this way. Here's an example:
@@ -107,14 +107,14 @@ Of course, you don't necessarily have to follow one of these three ways in which
 >>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
 >>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
 >>> # initializing the pre-trained base sized model with our custom classification heads
->>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
 ===PT-TF-SPLIT===
 >>> from transformers import TapasConfig, TFTapasForQuestionAnswering

 >>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
 >>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True)
 >>> # initializing the pre-trained base sized model with our custom classification heads
->>> model = TFTapasForQuestionAnswering.from_pretrained('google/tapas-base', config=config)
+>>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)
 ```

 What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues due to the L2-loss which is somewhat brittle. See [here](https://github.com/google-research/tapas/issues/91#issuecomment-735719340) for more info.
@@ -154,15 +154,26 @@ inputs to be fine-tuned:
 >>> from transformers import TapasTokenizer
 >>> import pandas as pd

->>> model_name = 'google/tapas-base'
+>>> model_name = "google/tapas-base"
 >>> tokenizer = TapasTokenizer.from_pretrained(model_name)

->>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
->>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
+>>> queries = [
+...     "What is the name of the first actor?",
+...     "How many movies has George Clooney played in?",
+...     "What is the total number of movies?",
+... ]
 >>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
 >>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
 >>> table = pd.DataFrame.from_dict(data)
->>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
+>>> inputs = tokenizer(
+...     table=table,
+...     queries=queries,
+...     answer_coordinates=answer_coordinates,
+...     answer_text=answer_text,
+...     padding="max_length",
+...     return_tensors="pt",
+... )
 >>> inputs
 {'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
 'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
@@ -170,15 +181,26 @@ inputs to be fine-tuned:
 >>> from transformers import TapasTokenizer
 >>> import pandas as pd

->>> model_name = 'google/tapas-base'
+>>> model_name = "google/tapas-base"
 >>> tokenizer = TapasTokenizer.from_pretrained(model_name)

->>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
->>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
+>>> queries = [
+...     "What is the name of the first actor?",
+...     "How many movies has George Clooney played in?",
+...     "What is the total number of movies?",
+... ]
 >>> answer_coordinates = [[(0, 0)], [(2, 1)], [(0, 1), (1, 1), (2, 1)]]
 >>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
 >>> table = pd.DataFrame.from_dict(data)
->>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='tf')
+>>> inputs = tokenizer(
+...     table=table,
+...     queries=queries,
+...     answer_coordinates=answer_coordinates,
+...     answer_text=answer_text,
+...     padding="max_length",
+...     return_tensors="tf",
+... )
 >>> inputs
 {'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
 'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), labels: tensor([[ ... ]])}
@@ -194,32 +216,37 @@ Of course, this only shows how to encode a single training example. It is advise
 >>> tsv_path = "your_path_to_the_tsv_file"
 >>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"

+
 >>> class TableDataset(torch.utils.data.Dataset):
 ...     def __init__(self, data, tokenizer):
 ...         self.data = data
 ...         self.tokenizer = tokenizer
-...
+
 ...     def __getitem__(self, idx):
 ...         item = data.iloc[idx]
-...         table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
-...         encoding = self.tokenizer(table=table, 
-...                                   queries=item.question, 
-...                                   answer_coordinates=item.answer_coordinates, 
-...                                   answer_text=item.answer_text,
-...                                   truncation=True,
-...                                   padding="max_length",
-...                                   return_tensors="pt"
+...         table = pd.read_csv(table_csv_path + item.table_file).astype(
+...             str
+...         )  # be sure to make your table data text only
+...         encoding = self.tokenizer(
+...             table=table,
+...             queries=item.question,
+...             answer_coordinates=item.answer_coordinates,
+...             answer_text=item.answer_text,
+...             truncation=True,
+...             padding="max_length",
+...             return_tensors="pt",
 ...         )
 ...         # remove the batch dimension which the tokenizer adds by default
 ...         encoding = {key: val.squeeze(0) for key, val in encoding.items()}
 ...         # add the float_answer which is also required (weak supervision for aggregation case)
-...         encoding["float_answer"] = torch.tensor(item.float_answer) 
+...         encoding["float_answer"] = torch.tensor(item.float_answer)
 ...         return encoding
-...
-...     def __len__(self):
-...        return len(self.data)

->>> data = pd.read_csv(tsv_path, sep='\t')
+...     def __len__(self):
+...         return len(self.data)
+
+
+>>> data = pd.read_csv(tsv_path, sep="\t")
 >>> train_dataset = TableDataset(data, tokenizer)
 >>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
 ===PT-TF-SPLIT===
@@ -229,44 +256,50 @@ Of course, this only shows how to encode a single training example. It is advise
 >>> tsv_path = "your_path_to_the_tsv_file"
 >>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"

+
 >>> class TableDataset:
 ...     def __init__(self, data, tokenizer):
 ...         self.data = data
 ...         self.tokenizer = tokenizer
-...
+
 ...     def __iter__(self):
 ...         for idx in range(self.__len__()):
 ...             item = self.data.iloc[idx]
-...             table = pd.read_csv(table_csv_path + item.table_file).astype(str) # be sure to make your table data text only
-...             encoding = self.tokenizer(table=table, 
-...                                   queries=item.question, 
-...                                   answer_coordinates=item.answer_coordinates, 
-...                                   answer_text=item.answer_text,
-...                                   truncation=True,
-...                                   padding="max_length",
-...                                   return_tensors="tf"
+...             table = pd.read_csv(table_csv_path + item.table_file).astype(
+...                 str
+...             )  # be sure to make your table data text only
+...             encoding = self.tokenizer(
+...                 table=table,
+...                 queries=item.question,
+...                 answer_coordinates=item.answer_coordinates,
+...                 answer_text=item.answer_text,
+...                 truncation=True,
+...                 padding="max_length",
+...                 return_tensors="tf",
 ...             )
 ...             # remove the batch dimension which the tokenizer adds by default
-...             encoding = {key: tf.squeeze(val,0) for key, val in encoding.items()}
+...             encoding = {key: tf.squeeze(val, 0) for key, val in encoding.items()}
 ...             # add the float_answer which is also required (weak supervision for aggregation case)
-...             encoding["float_answer"] = tf.convert_to_tensor(item.float_answer,dtype=tf.float32)
-...             yield encoding['input_ids'], encoding['attention_mask'], encoding['numeric_values'], \
-...                   encoding['numeric_values_scale'], encoding['token_type_ids'], encoding['labels'], \
-...                   encoding['float_answer']
-...
-...     def __len__(self):
-...        return len(self.data)
+...             encoding["float_answer"] = tf.convert_to_tensor(item.float_answer, dtype=tf.float32)
+...             yield encoding["input_ids"], encoding["attention_mask"], encoding["numeric_values"], encoding[
+...                 "numeric_values_scale"
+...             ], encoding["token_type_ids"], encoding["labels"], encoding["float_answer"]

->>> data = pd.read_csv(tsv_path, sep='\t')
+...     def __len__(self):
+...         return len(self.data)
+
+
+>>> data = pd.read_csv(tsv_path, sep="\t")
 >>> train_dataset = TableDataset(data, tokenizer)
 >>> output_signature = (
-... tf.TensorSpec(shape=(512,), dtype=tf.int32),
-... tf.TensorSpec(shape=(512,), dtype=tf.int32),
-... tf.TensorSpec(shape=(512,), dtype=tf.float32),
-... tf.TensorSpec(shape=(512,), dtype=tf.float32),
-... tf.TensorSpec(shape=(512,7), dtype=tf.int32),
-... tf.TensorSpec(shape=(512,), dtype=tf.int32),
-... tf.TensorSpec(shape=(512,), dtype=tf.float32))
+...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
+...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
+...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
+...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
+...     tf.TensorSpec(shape=(512, 7), dtype=tf.int32),
+...     tf.TensorSpec(shape=(512,), dtype=tf.int32),
+...     tf.TensorSpec(shape=(512,), dtype=tf.float32),
+... )
 >>> train_dataloader = tf.data.Dataset.from_generator(train_dataset, output_signature=output_signature).batch(32)
 ```

@@ -282,15 +315,15 @@ You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnsw

 >>> # this is the default WTQ configuration
 >>> config = TapasConfig(
-...            num_aggregation_labels = 4,
-...            use_answer_as_supervision = True,
-...            answer_loss_cutoff = 0.664694,
-...            cell_selection_preference = 0.207951,
-...            huber_loss_delta = 0.121194,
-...            init_cell_selection_weights_to_zero = True,
-...            select_one_column = True,
-...            allow_empty_column_selection = False,
-...            temperature = 0.0352513,
+...     num_aggregation_labels=4,
+...     use_answer_as_supervision=True,
+...     answer_loss_cutoff=0.664694,
+...     cell_selection_preference=0.207951,
+...     huber_loss_delta=0.121194,
+...     init_cell_selection_weights_to_zero=True,
+...     select_one_column=True,
+...     allow_empty_column_selection=False,
+...     temperature=0.0352513,
 ... )
 >>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

@@ -298,8 +331,8 @@ You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnsw

 >>> model.train()
 >>> for epoch in range(2):  # loop over the dataset multiple times
-...    for batch in train_dataloader:
-...         # get the inputs; 
+...     for batch in train_dataloader:
+...         # get the inputs;
 ...         input_ids = batch["input_ids"]
 ...         attention_mask = batch["attention_mask"]
 ...         token_type_ids = batch["token_type_ids"]
@@ -312,9 +345,15 @@ You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnsw
 ...         optimizer.zero_grad()

 ...         # forward + backward + optimize
-...         outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
-...                        labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
-...                        float_answer=float_answer)
+...         outputs = model(
+...             input_ids=input_ids,
+...             attention_mask=attention_mask,
+...             token_type_ids=token_type_ids,
+...             labels=labels,
+...             numeric_values=numeric_values,
+...             numeric_values_scale=numeric_values_scale,
+...             float_answer=float_answer,
+...         )
 ...         loss = outputs.loss
 ...         loss.backward()
 ...         optimizer.step()
@@ -324,23 +363,23 @@ You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnsw

 >>> # this is the default WTQ configuration
 >>> config = TapasConfig(
-...            num_aggregation_labels = 4,
-...            use_answer_as_supervision = True,
-...            answer_loss_cutoff = 0.664694,
-...            cell_selection_preference = 0.207951,
-...            huber_loss_delta = 0.121194,
-...            init_cell_selection_weights_to_zero = True,
-...            select_one_column = True,
-...            allow_empty_column_selection = False,
-...            temperature = 0.0352513,
+...     num_aggregation_labels=4,
+...     use_answer_as_supervision=True,
+...     answer_loss_cutoff=0.664694,
+...     cell_selection_preference=0.207951,
+...     huber_loss_delta=0.121194,
+...     init_cell_selection_weights_to_zero=True,
+...     select_one_column=True,
+...     allow_empty_column_selection=False,
+...     temperature=0.0352513,
 ... )
 >>> model = TFTapasForQuestionAnswering.from_pretrained("google/tapas-base", config=config)

 >>> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

 >>> for epoch in range(2):  # loop over the dataset multiple times
-...    for batch in train_dataloader:
-...         # get the inputs; 
+...     for batch in train_dataloader:
+...         # get the inputs;
 ...         input_ids = batch[0]
 ...         attention_mask = batch[1]
 ...         token_type_ids = batch[4]
@@ -351,9 +390,15 @@ You can then fine-tune [`TapasForQuestionAnswering`] or [`TFTapasForQuestionAnsw

 ...         # forward + backward + optimize
 ...         with tf.GradientTape() as tape:
-...              outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
-...                        labels=labels, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
-...                        float_answer=float_answer )
+...             outputs = model(
+...                 input_ids=input_ids,
+...                 attention_mask=attention_mask,
+...                 token_type_ids=token_type_ids,
+...                 labels=labels,
+...                 numeric_values=numeric_values,
+...                 numeric_values_scale=numeric_values_scale,
+...                 float_answer=float_answer,
+...             )
 ...         grads = tape.gradient(outputs.loss, model.trainable_weights)
 ...         optimizer.apply_gradients(zip(grads, model.trainable_weights))
 ```
@@ -366,47 +411,49 @@ However, note that inference is **different** depending on whether or not the se

 ```py
 >>> from transformers import TapasTokenizer, TapasForQuestionAnswering
->>> import pandas as pd 
+>>> import pandas as pd

->>> model_name = 'google/tapas-base-finetuned-wtq'
+>>> model_name = "google/tapas-base-finetuned-wtq"
 >>> model = TapasForQuestionAnswering.from_pretrained(model_name)
 >>> tokenizer = TapasTokenizer.from_pretrained(model_name)

->>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
->>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
+>>> queries = [
+...     "What is the name of the first actor?",
+...     "How many movies has George Clooney played in?",
+...     "What is the total number of movies?",
+... ]
 >>> table = pd.DataFrame.from_dict(data)
->>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt") 
+>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
 >>> outputs = model(**inputs)
 >>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
-...         inputs, 
-...         outputs.logits.detach(), 
-...         outputs.logits_aggregation.detach()
+...     inputs, outputs.logits.detach(), outputs.logits_aggregation.detach()
 ... )

 >>> # let's print out the results:
->>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
+>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
 >>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]

 >>> answers = []
 >>> for coordinates in predicted_answer_coordinates:
-...   if len(coordinates) == 1:
-...     # only a single cell:
-...     answers.append(table.iat[coordinates[0]])
-...   else:
-...     # multiple cells
-...     cell_values = []
-...     for coordinate in coordinates:
-...        cell_values.append(table.iat[coordinate])
-...     answers.append(", ".join(cell_values))
+...     if len(coordinates) == 1:
+...         # only a single cell:
+...         answers.append(table.iat[coordinates[0]])
+...     else:
+...         # multiple cells
+...         cell_values = []
+...         for coordinate in coordinates:
+...             cell_values.append(table.iat[coordinate])
+...         answers.append(", ".join(cell_values))

 >>> display(table)
 >>> print("")
 >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
-...   print(query)
-...   if predicted_agg == "NONE":
-...     print("Predicted answer: " + answer)
-...   else:
-...     print("Predicted answer: " + predicted_agg + " > " + answer)    
+...     print(query)
+...     if predicted_agg == "NONE":
+...         print("Predicted answer: " + answer)
+...     else:
+...         print("Predicted answer: " + predicted_agg + " > " + answer)
 What is the name of the first actor?
 Predicted answer: Brad Pitt
 How many movies has George Clooney played in?
@@ -415,47 +462,49 @@ What is the total number of movies?
 Predicted answer: SUM > 87, 53, 69
 ===PT-TF-SPLIT===
 >>> from transformers import TapasTokenizer, TFTapasForQuestionAnswering
->>> import pandas as pd 
+>>> import pandas as pd

->>> model_name = 'google/tapas-base-finetuned-wtq'
+>>> model_name = "google/tapas-base-finetuned-wtq"
 >>> model = TFTapasForQuestionAnswering.from_pretrained(model_name)
 >>> tokenizer = TapasTokenizer.from_pretrained(model_name)

->>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
->>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+>>> data = {"Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], "Number of movies": ["87", "53", "69"]}
+>>> queries = [
+...     "What is the name of the first actor?",
+...     "How many movies has George Clooney played in?",
+...     "What is the total number of movies?",
+... ]
 >>> table = pd.DataFrame.from_dict(data)
->>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="tf") 
+>>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="tf")
 >>> outputs = model(**inputs)
 >>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
-...         inputs, 
-...         outputs.logits, 
-...         outputs.logits_aggregation
+...     inputs, outputs.logits, outputs.logits_aggregation
 ... )

 >>> # let's print out the results:
->>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
+>>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3: "COUNT"}
 >>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]

 >>> answers = []
 >>> for coordinates in predicted_answer_coordinates:
-...   if len(coordinates) == 1:
-...     # only a single cell:
-...     answers.append(table.iat[coordinates[0]])
-...   else:
-...     # multiple cells
-...     cell_values = []
-...     for coordinate in coordinates:
-...        cell_values.append(table.iat[coordinate])
-...     answers.append(", ".join(cell_values))
+...     if len(coordinates) == 1:
+...         # only a single cell:
+...         answers.append(table.iat[coordinates[0]])
+...     else:
+...         # multiple cells
+...         cell_values = []
+...         for coordinate in coordinates:
+...             cell_values.append(table.iat[coordinate])
+...         answers.append(", ".join(cell_values))

 >>> display(table)
 >>> print("")
 >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
-...   print(query)
-...   if predicted_agg == "NONE":
-...     print("Predicted answer: " + answer)
-...   else:
-...     print("Predicted answer: " + predicted_agg + " > " + answer)    
+...     print(query)
+...     if predicted_agg == "NONE":
+...         print("Predicted answer: " + answer)
+...     else:
+...         print("Predicted answer: " + predicted_agg + " > " + answer)
 What is the name of the first actor?
 Predicted answer: Brad Pitt
 How many movies has George Clooney played in?
--- a/docs/source/model_doc/visual_bert.mdx
+++ b/docs/source/model_doc/visual_bert.mdx
@@ -77,11 +77,13 @@ The following example shows how to get the last hidden state using [`VisualBertM

 >>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
 >>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
->>> inputs.update({
-...     "visual_embeds": visual_embeds,
-...     "visual_token_type_ids": visual_token_type_ids,
-...     "visual_attention_mask": visual_attention_mask
-... })
+>>> inputs.update(
+...     {
+...         "visual_embeds": visual_embeds,
+...         "visual_token_type_ids": visual_token_type_ids,
+...         "visual_attention_mask": visual_attention_mask,
+...     }
+... )
 >>> outputs = model(**inputs)
 >>> last_hidden_state = outputs.last_hidden_state
 ```
--- a/docs/source/model_sharing.mdx
+++ b/docs/source/model_sharing.mdx
@@ -50,9 +50,8 @@ For instance:

 ```python
 >>> model = AutoModel.from_pretrained(
->>>     "julien-c/EsperBERTo-small",
->>>     revision="v2.0.1" # tag name, or branch name, or commit hash
->>> )
+...     "julien-c/EsperBERTo-small", revision="v2.0.1"  # tag name, or branch name, or commit hash
+... )
 ```

 ## Push your model from Python
@@ -344,9 +343,8 @@ You may specify a revision by using the `revision` flag in the `from_pretrained`

 ```python
 >>> tokenizer = AutoTokenizer.from_pretrained(
->>>   "julien-c/EsperBERTo-small",
->>>   revision="v2.0.1" # tag name, or branch name, or commit hash
->>> )
+...     "julien-c/EsperBERTo-small", revision="v2.0.1"  # tag name, or branch name, or commit hash
+... )
 ```

 ## Workflow in a Colab notebook
--- a/docs/source/multilingual.mdx
+++ b/docs/source/multilingual.mdx
@@ -62,18 +62,18 @@ The different languages this model/tokenizer handles, as well as the ids of thes
 These ids should be used when passing a language parameter during a model pass. Let's define our inputs:

 ```py
->>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
+>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")])  # batch size of 1
 ```

 We should now define the language embedding by using the previously defined language id. We want to create a tensor
 filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:

 ```py
->>> language_id = tokenizer.lang2id['en']  # 0
+>>> language_id = tokenizer.lang2id["en"]  # 0
 >>> langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])

 >>> # We reshape it to be of size (batch_size, sequence_length)
->>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
+>>> langs = langs.view(1, -1)  # is now of shape [1, sequence_length] (we have a batch size of 1)
 ```

 You can then feed it all as input to your model:
--- a/docs/source/perplexity.mdx
+++ b/docs/source/perplexity.mdx
@@ -69,8 +69,9 @@ Let's demonstrate this process with GPT-2.

 ```python
 from transformers import GPT2LMHeadModel, GPT2TokenizerFast
-device = 'cuda'
-model_id = 'gpt2-large'
+
+device = "cuda"
+model_id = "gpt2-large"
 model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
 tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
 ```
@@ -81,8 +82,9 @@ dataset in memory.

 ```python
 from datasets import load_dataset
-test = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
-encodings = tokenizer('\n\n'.join(test['text']), return_tensors='pt')
+
+test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
+encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")
 ```

 With 🤗 Transformers, we can simply pass the `input_ids` as the `labels` to our model, and the average negative
@@ -104,10 +106,10 @@ nlls = []
 for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
-    trg_len = end_loc - i    # may be different from stride on last loop
-    input_ids = encodings.input_ids[:,begin_loc:end_loc].to(device)
+    trg_len = end_loc - i  # may be different from stride on last loop
+    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
-    target_ids[:,:-trg_len] = -100
+    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
--- a/docs/source/preprocessing.mdx
+++ b/docs/source/preprocessing.mdx
@@ -36,7 +36,8 @@ To automatically download the vocab used during pretraining or fine-tuning a giv

 ```py
 from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
+
+tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
 ```

 ## Base use
@@ -75,9 +76,7 @@ If you have several sentences you want to process, you can do this efficiently b
 tokenizer:

 ```py
->>> batch_sentences = ["Hello I'm a single sentence",
-...                    "And another sentence",
-...                    "And the very very last one"]
+>>> batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
 >>> encoded_inputs = tokenizer(batch_sentences)
 >>> print(encoded_inputs)
 {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
@@ -174,12 +173,12 @@ If you have a list of pairs of sequences you want to process, you should feed th
 list of first sentences and the list of second sentences:

 ```py
->>> batch_sentences = ["Hello I'm a single sentence",
-...                    "And another sentence",
-...                    "And the very very last one"]
->>> batch_of_second_sentences = ["I'm a sentence that goes with the first sentence",
-...                              "And I should be encoded with the second sentence",
-...                              "And I go with the very last one"]
+>>> batch_sentences = ["Hello I'm a single sentence", "And another sentence", "And the very very last one"]
+>>> batch_of_second_sentences = [
+...     "I'm a sentence that goes with the first sentence",
+...     "And I should be encoded with the second sentence",
+...     "And I go with the very last one",
+... ]
 >>> encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences)
 >>> print(encoded_inputs)
 {'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102, 146, 112, 182, 170, 5650, 1115, 2947, 1114, 1103, 1148, 5650, 102], 
@@ -199,7 +198,7 @@ To double-check what is fed to the model, we can decode each list in _input_ids_

 ```py
 >>> for ids in encoded_inputs["input_ids"]:
->>>     print(tokenizer.decode(ids))
+...     print(tokenizer.decode(ids))
 [CLS] Hello I'm a single sentence [SEP] I'm a sentence that goes with the first sentence [SEP]
 [CLS] And another sentence [SEP] And I should be encoded with the second sentence [SEP]
 [CLS] And the very very last one [SEP] And I go with the very last one [SEP]
@@ -307,35 +306,43 @@ This works exactly as before for batch of sentences or batch of pairs of sentenc
 like this:

 ```py
-batch_sentences = [["Hello", "I'm", "a", "single", "sentence"],
-                   ["And", "another", "sentence"],
-                   ["And", "the", "very", "very", "last", "one"]]
+batch_sentences = [
+    ["Hello", "I'm", "a", "single", "sentence"],
+    ["And", "another", "sentence"],
+    ["And", "the", "very", "very", "last", "one"],
+]
 encoded_inputs = tokenizer(batch_sentences, is_split_into_words=True)
 ```

 or a batch of pair sentences like this:

 ```py
-batch_of_second_sentences = [["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
-                             ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
-                             ["And", "I", "go", "with", "the", "very", "last", "one"]]
+batch_of_second_sentences = [
+    ["I'm", "a", "sentence", "that", "goes", "with", "the", "first", "sentence"],
+    ["And", "I", "should", "be", "encoded", "with", "the", "second", "sentence"],
+    ["And", "I", "go", "with", "the", "very", "last", "one"],
+]
 encoded_inputs = tokenizer(batch_sentences, batch_of_second_sentences, is_split_into_words=True)
 ```

 And you can add padding, truncation as well as directly return tensors like before:

 ```py
-batch = tokenizer(batch_sentences,
-                  batch_of_second_sentences,
-                  is_split_into_words=True,
-                  padding=True,
-                  truncation=True,
-                  return_tensors="pt")
+batch = tokenizer(
+    batch_sentences,
+    batch_of_second_sentences,
+    is_split_into_words=True,
+    padding=True,
+    truncation=True,
+    return_tensors="pt",
+)
 ===PT-TF-SPLIT===
-batch = tokenizer(batch_sentences,
-                  batch_of_second_sentences,
-                  is_split_into_words=True,
-                  padding=True,
-                  truncation=True,
-                  return_tensors="tf")
+batch = tokenizer(
+    batch_sentences,
+    batch_of_second_sentences,
+    is_split_into_words=True,
+    padding=True,
+    truncation=True,
+    return_tensors="tf",
+)
 ```
--- a/docs/source/quicktour.mdx
+++ b/docs/source/quicktour.mdx
@@ -57,7 +57,8 @@ pip install tensorflow

 ```py
 >>> from transformers import pipeline
->>> classifier = pipeline('sentiment-analysis')
+
+>>> classifier = pipeline("sentiment-analysis")
 ```

 When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will
@@ -67,7 +68,7 @@ make them readable. For instance:


 ```py
->>> classifier('We are very happy to show you the 🤗 Transformers library.')
+>>> classifier("We are very happy to show you the 🤗 Transformers library.")
 [{'label': 'POSITIVE', 'score': 0.9998}]
 ```

@@ -75,8 +76,7 @@ That's encouraging! You can use it on a list of sentences, which will be preproc
 a list of dictionaries like this one:

 ```py
->>> results = classifier(["We are very happy to show you the 🤗 Transformers library.",
-...            "We hope you don't hate it."])
+>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
 >>> for result in results:
 ...     print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
 label: POSITIVE, with score: 0.9998
@@ -102,7 +102,7 @@ see how we can use it.
 You can directly pass the name of the model to use to [`pipeline`]:

 ```py
->>> classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
+>>> classifier = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")
 ```

 This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
@@ -125,13 +125,13 @@ any other model from the model hub):
 >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 >>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
->>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
+>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
 ===PT-TF-SPLIT===
 >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
 >>> # This model only exists in PyTorch, so we use the _from_pt_ flag to import that model in TensorFlow.
 >>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
->>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
+>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
 ```

 If you don't find a model that has been pretrained on some data similar to yours, you will need to fine-tune a
@@ -150,11 +150,13 @@ As we saw, the model and tokenizer are created using the `from_pretrained` metho

 ```py
 >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
+
 >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
 >>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
 ===PT-TF-SPLIT===
 >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
+
 >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
 >>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -199,7 +201,7 @@ and get tensors back. You can specify all of that to the tokenizer:
 ...     padding=True,
 ...     truncation=True,
 ...     max_length=512,
-...     return_tensors="pt"
+...     return_tensors="pt",
 ... )
 ===PT-TF-SPLIT===
 >>> tf_batch = tokenizer(
@@ -207,7 +209,7 @@ and get tensors back. You can specify all of that to the tokenizer:
 ...     padding=True,
 ...     truncation=True,
 ...     max_length=512,
-...     return_tensors="tf"
+...     return_tensors="tf",
 ... )
 ```

@@ -267,9 +269,11 @@ Let's apply the SoftMax activation to get predictions.

 ```py
 >>> from torch import nn
+
 >>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
 ===PT-TF-SPLIT===
 >>> import tensorflow as tf
+
 >>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
 ```

@@ -291,13 +295,15 @@ attribute:

 ```py
 >>> import torch
->>> pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))
+
+>>> pt_outputs = pt_model(**pt_batch, labels=torch.tensor([1, 0]))
 >>> print(pt_outputs)
 SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
 ===PT-TF-SPLIT===
 >>> import tensorflow as tf
->>> tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]))
+
+>>> tf_outputs = tf_model(tf_batch, labels=tf.constant([1, 0]))
 >>> print(tf_outputs)
 TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051e-04, 6.3326e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
 array([[-4.0833 ,  4.3364  ],
@@ -317,11 +323,11 @@ case the attributes not set (that have `None` values) are ignored.
 Once your model is fine-tuned, you can save it with its tokenizer in the following way:

 ```py
->>> pt_save_directory = './pt_save_pretrained'
+>>> pt_save_directory = "./pt_save_pretrained"
 >>> tokenizer.save_pretrained(pt_save_directory)
 >>> pt_model.save_pretrained(pt_save_directory)
 ===PT-TF-SPLIT===
->>> tf_save_directory = './tf_save_pretrained'
+>>> tf_save_directory = "./tf_save_pretrained"
 >>> tokenizer.save_pretrained(tf_save_directory)
 >>> tf_model.save_pretrained(tf_save_directory)
 ```
@@ -343,10 +349,12 @@ Then, use the corresponding Auto class to load it like this:

 ```py
 >>> from transformers import AutoModel
+
 >>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
 >>> pt_model = AutoModel.from_pretrained(tf_save_directory, from_tf=True)
 ===PT-TF-SPLIT===
 >>> from transformers import TFAutoModel
+
 >>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
 >>> tf_model = TFAutoModel.from_pretrained(pt_save_directory, from_pt=True)
 ```
@@ -356,11 +364,11 @@ Lastly, you can also ask the model to return all hidden states and all attention

 ```py
 >>> pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
->>> all_hidden_states  = pt_outputs.hidden_states 
+>>> all_hidden_states = pt_outputs.hidden_states
 >>> all_attentions = pt_outputs.attentions
 ===PT-TF-SPLIT===
 >>> tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
->>> all_hidden_states =  tf_outputs.hidden_states
+>>> all_hidden_states = tf_outputs.hidden_states
 >>> all_attentions = tf_outputs.attentions
 ```

@@ -376,11 +384,13 @@ directly instantiate model and tokenizer without the auto magic:

 ```py
 >>> from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
+
 >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
 >>> model = DistilBertForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
 ===PT-TF-SPLIT===
 >>> from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
+
 >>> model_name = "distilbert-base-uncased-finetuned-sst-2-english"
 >>> model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
 >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
@@ -401,13 +411,15 @@ the model from scratch. Therefore, we instantiate the model from a configuration

 ```py
 >>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
->>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
->>> tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+
+>>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4 * 512)
+>>> tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
 >>> model = DistilBertForSequenceClassification(config)
 ===PT-TF-SPLIT===
 >>> from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
->>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
->>> tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
+
+>>> config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4 * 512)
+>>> tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
 >>> model = TFDistilBertForSequenceClassification(config)
 ```

@@ -419,11 +431,13 @@ configuration appropriately:

 ```py
 >>> from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
+
 >>> model_name = "distilbert-base-uncased"
 >>> model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
 >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
 ===PT-TF-SPLIT===
 >>> from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
+
 >>> model_name = "distilbert-base-uncased"
 >>> model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
 >>> tokenizer = DistilBertTokenizer.from_pretrained(model_name)
--- a/docs/source/serialization.mdx
+++ b/docs/source/serialization.mdx
@@ -109,6 +109,7 @@ This export can now be used in the ONNX inference runtime:
 import onnxruntime as ort

 from transformers import BertTokenizerFast
+
 tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

 ort_session = ort.InferenceSession("onnx/bert-base-cased/model.onnx")
@@ -382,7 +383,7 @@ tokenized_text = enc.tokenize(text)

 # Masking one of the input tokens
 masked_index = 8
-tokenized_text[masked_index] = '[MASK]'
+tokenized_text[masked_index] = "[MASK]"
 indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
 segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

@@ -393,8 +394,14 @@ dummy_input = [tokens_tensor, segments_tensors]

 # Initializing the model with the torchscript flag
 # Flag set to True even though it is not necessary as this model does not have an LM Head.
-config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
-    num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)
+config = BertConfig(
+    vocab_size_or_config_json_file=32000,
+    hidden_size=768,
+    num_hidden_layers=12,
+    num_attention_heads=12,
+    intermediate_size=3072,
+    torchscript=True,
+)

 # Instantiating the model
 model = BertModel(config)
--- a/docs/source/task_summary.mdx
+++ b/docs/source/task_summary.mdx
@@ -188,11 +188,15 @@ positions of the extracted answer in the text.

 ```py
 >>> result = question_answerer(question="What is extractive question answering?", context=context)
->>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
+>>> print(
+...     f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
+... )
 Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95

 >>> result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
->>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
+>>> print(
+...     f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
+... )
 Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160
 ```

@@ -232,18 +236,20 @@ Here is an example of question answering using a model and a tokenizer. The proc
 >>> for question in questions:
 ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
 ...     input_ids = inputs["input_ids"].tolist()[0]
-...
+
 ...     outputs = model(**inputs)
 ...     answer_start_scores = outputs.start_logits
 ...     answer_end_scores = outputs.end_logits
-...
+
 ...     # Get the most likely beginning of answer with the argmax of the score
 ...     answer_start = torch.argmax(answer_start_scores)
-...     # Get the most likely end of answer with the argmax of the score 
+...     # Get the most likely end of answer with the argmax of the score
 ...     answer_end = torch.argmax(answer_end_scores) + 1
-...
-...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
-...
+
+...     answer = tokenizer.convert_tokens_to_string(
+...         tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
+...     )
+
 ...     print(f"Question: {question}")
 ...     print(f"Answer: {answer}")
 Question: How many pretrained models are available in 🤗 Transformers?
@@ -275,18 +281,20 @@ Answer: tensorflow 2. 0 and pytorch
 >>> for question in questions:
 ...     inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
 ...     input_ids = inputs["input_ids"].numpy()[0]
-...
+
 ...     outputs = model(inputs)
 ...     answer_start_scores = outputs.start_logits
 ...     answer_end_scores = outputs.end_logits
-...
+
 ...     # Get the most likely beginning of answer with the argmax of the score
 ...     answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
 ...     # Get the most likely end of answer with the argmax of the score
 ...     answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1
-...
-...     answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
-...
+
+...     answer = tokenizer.convert_tokens_to_string(
+...         tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
+...     )
+
 ...     print(f"Question: {question}")
 ...     print(f"Answer: {answer}")
 Question: How many pretrained models are available in 🤗 Transformers?
@@ -327,7 +335,12 @@ This outputs the sequences with the mask filled, the confidence score, and the t

 ```py
 >>> from pprint import pprint
->>> pprint(unmasker(f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."))
+
+>>> pprint(
+...     unmasker(
+...         f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."
+...     )
+... )
 [{'score': 0.1793,
  'sequence': 'HuggingFace is creating a tool that the community uses to solve '
              'NLP tasks.',
@@ -374,8 +387,10 @@ Here is an example of doing masked language modeling using a model and a tokeniz
 >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
 >>> model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

->>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
+>>> sequence = (
+...     "Distilled models are smaller than the models they mimic. Using them instead of the large "
 ...     f"versions would help {tokenizer.mask_token} our carbon footprint."
+... )

 >>> inputs = tokenizer(sequence, return_tensors="pt")
 >>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
@@ -399,8 +414,10 @@ Distilled models are smaller than the models they mimic. Using them instead of t
 >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
 >>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

->>> sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
+>>> sequence = (
+...     "Distilled models are smaller than the models they mimic. Using them instead of the large "
 ...     f"versions would help {tokenizer.mask_token} our carbon footprint."
+... )

 >>> inputs = tokenizer(sequence, return_tensors="tf")
 >>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
@@ -544,7 +561,7 @@ Below is an example of text generation using `XLNet` and its tokenizer, which in

 >>> prompt_length = len(tokenizer.decode(inputs[0]))
 >>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
->>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]
+>>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

 >>> print(generated)
 Today the weather is really nice and I am planning ...
@@ -571,7 +588,7 @@ Today the weather is really nice and I am planning ...

 >>> prompt_length = len(tokenizer.decode(inputs[0]))
 >>> outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
->>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]
+>>> generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

 >>> print(generated)
 Today the weather is really nice and I am planning ...
@@ -660,8 +677,10 @@ Here is an example of doing named entity recognition, using a model and a tokeni
 >>> model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
 >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

->>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
-...            "therefore very close to the Manhattan Bridge."
+>>> sequence = (
+...     "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
+...     "therefore very close to the Manhattan Bridge."
+... )

 >>> inputs = tokenizer(sequence, return_tensors="pt")
 >>> tokens = inputs.tokens()
@@ -675,8 +694,10 @@ Here is an example of doing named entity recognition, using a model and a tokeni
 >>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
 >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

->>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, " \
-...            "therefore very close to the Manhattan Bridge."
+>>> sequence = (
+...     "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
+...     "therefore very close to the Manhattan Bridge."
+... )

 >>> inputs = tokenizer(sequence, return_tensors="tf")
 >>> tokens = inputs.tokens()
@@ -863,7 +884,7 @@ Here is an example of doing translation using a model and a tokenizer. The proce

 >>> inputs = tokenizer(
 ...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
-...     return_tensors="pt"
+...     return_tensors="pt",
 ... )
 >>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

@@ -877,7 +898,7 @@ Here is an example of doing translation using a model and a tokenizer. The proce

 >>> inputs = tokenizer(
 ...     "translate English to German: Hugging Face is a technology company based in New York and Paris",
-...     return_tensors="tf"
+...     return_tensors="tf",
 ... )
 >>> outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

--- a/docs/source/testing.mdx
+++ b/docs/source/testing.mdx
@@ -422,14 +422,14 @@ Let's depict the GPU requirements in the following table:

 For example, here is a test that must be run only when there are 2 or more GPUs available and pytorch is installed:

-```python
+```python no-style
@require_torch_multi_gpu
 def test_example_with_multi_gpu():
 ```

 If a test requires `tensorflow` use the `require_tf` decorator. For example:

-```python
+```python no-style
@require_tf
 def test_tf_thing_with_tensorflow():
 ```
@@ -437,7 +437,7 @@ def test_tf_thing_with_tensorflow():
 These decorators can be stacked. For example, if a test is slow and requires at least one GPU under pytorch, here is
 how to set it up:

-```python
+```python no-style
@require_torch_gpu
@slow
 def test_example_slow_on_gpu():
@@ -446,7 +446,7 @@ def test_example_slow_on_gpu():
 Some decorators like `@parametrized` rewrite test names, therefore `@require_*` skip decorators have to be listed
 last for them to work correctly. Here is an example of the correct usage:

-```python
+```python no-style
@parameterized.expand(...)
@require_torch_multi_gpu
 def test_integration_foo():
@@ -461,7 +461,8 @@ Inside tests:

 ```python
 from transformers.testing_utils import get_gpu_count
-n_gpu = get_gpu_count() # works with torch and tf
+
+n_gpu = get_gpu_count()  # works with torch and tf
 ```

 ### Distributed training
@@ -544,12 +545,16 @@ the test, but then there is no way of running that test for just one set of argu
 # test_this1.py
 import unittest
 from parameterized import parameterized
+
+
 class TestMathUnitTest(unittest.TestCase):
-    @parameterized.expand([
-        ("negative", -1.5, -2.0),
-        ("integer", 1, 1.0),
-        ("large fraction", 1.6, 1),
-    ])
+    @parameterized.expand(
+        [
+            ("negative", -1.5, -2.0),
+            ("integer", 1, 1.0),
+            ("large fraction", 1.6, 1),
+        ]
+    )
    def test_floor(self, name, input, expected):
        assert_equal(math.floor(input), expected)
 ```
@@ -601,6 +606,8 @@ Here is the same example, this time using `pytest`'s `parametrize` marker:
 ```python
 # test_this2.py
 import pytest
+
+
@pytest.mark.parametrize(
    "name, input, expected",
    [
@@ -669,6 +676,8 @@ To start using those all you need is to make sure that the test resides in a sub

 ```python
 from transformers.testing_utils import TestCasePlus
+
+
 class PathExampleTest(TestCasePlus):
    def test_something_involving_local_locations(self):
        data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro"
@@ -679,6 +688,8 @@ If you don't need to manipulate paths via `pathlib` or you just need a path as a

 ```python
 from transformers.testing_utils import TestCasePlus
+
+
 class PathExampleTest(TestCasePlus):
    def test_something_involving_stringified_locations(self):
        examples_dir = self.examples_dir_str
@@ -700,6 +711,8 @@ Here is an example of its usage:

 ```python
 from transformers.testing_utils import TestCasePlus
+
+
 class ExamplesTests(TestCasePlus):
    def test_whatever(self):
        tmp_dir = self.get_auto_remove_tmp_dir()
@@ -759,6 +772,7 @@ If you need to temporary override `sys.path` to import from another test for exa
 ```python
 import os
 from transformers.testing_utils import ExtendSysPath
+
 bindir = os.path.abspath(os.path.dirname(__file__))
 with ExtendSysPath(f"{bindir}/.."):
    from test_trainer import TrainerIntegrationCommon  # noqa
@@ -786,20 +800,20 @@ code that's buggy causes some bad state that will affect other tests, do not use

 - Here is how to skip whole test unconditionally:

-```python
+```python no-style
@unittest.skip("this bug needs to be fixed")
 def test_feature_x():
 ```

 or via pytest:

-```python
+```python no-style
@pytest.mark.skip(reason="this bug needs to be fixed")
 ```

 or the `xfail` way:

-```python
+```python no-style
@pytest.mark.xfail
 def test_feature_x():
 ```
@@ -816,6 +830,7 @@ or the whole module:

 ```python
 import pytest
+
 if not pytest.config.getoption("--custom-flag"):
    pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True)
 ```
@@ -835,21 +850,21 @@ docutils = pytest.importorskip("docutils", minversion="0.3")

 -  Skip a test based on a condition:

-```python
+```python no-style
@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher")
 def test_feature_x():
 ```

 or:

-```python
+```python no-style
@unittest.skipIf(torch_device == "cpu", "Can't do half precision")
 def test_feature_x():
 ```

 or skip the whole module:

-```python
+```python no-style
@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows")
 class TestClass():
    def test_feature_x(self):
@@ -863,7 +878,7 @@ The library of tests is ever-growing, and some of the tests take minutes to run,
 an hour for the test suite to complete on CI. Therefore, with some exceptions for essential tests, slow tests should be
 marked as in the example below:

-```python
+```python no-style
 from transformers.testing_utils import slow
@slow
 def test_integration_foo():
@@ -878,8 +893,8 @@ RUN_SLOW=1 pytest tests
 Some decorators like `@parameterized` rewrite test names, therefore `@slow` and the rest of the skip decorators
 `@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage:

-```python
-@parameterized.expand(...)
+```python no-style
+@parameteriz ed.expand(...)
@slow
 def test_integration_foo():
 ```
@@ -935,13 +950,21 @@ In order to test functions that write to `stdout` and/or `stderr`, the test can

 ```python
 import sys
-def print_to_stdout(s): print(s)
-def print_to_stderr(s): sys.stderr.write(s)
+
+
+def print_to_stdout(s):
+    print(s)
+
+
+def print_to_stderr(s):
+    sys.stderr.write(s)
+
+
 def test_result_and_stdout(capsys):
    msg = "Hello"
    print_to_stdout(msg)
    print_to_stderr(msg)
-    out, err = capsys.readouterr() # consume the captured output streams
+    out, err = capsys.readouterr()  # consume the captured output streams
    # optional: if you want to replay the consumed streams:
    sys.stdout.write(out)
    sys.stderr.write(err)
@@ -954,10 +977,13 @@ And, of course, most of the time, `stderr` will come as a part of an exception,
 a case:

 ```python
-def raise_exception(msg): raise ValueError(msg)
+def raise_exception(msg):
+    raise ValueError(msg)
+
+
 def test_something_exception():
    msg = "Not a good value"
-    error = ''
+    error = ""
    try:
        raise_exception(msg)
    except Exception as e:
@@ -970,7 +996,12 @@ Another approach to capturing stdout is via `contextlib.redirect_stdout`:
 ```python
 from io import StringIO
 from contextlib import redirect_stdout
-def print_to_stdout(s): print(s)
+
+
+def print_to_stdout(s):
+    print(s)
+
+
 def test_result_and_stdout():
    msg = "Hello"
    buffer = StringIO()
@@ -993,6 +1024,7 @@ some `\r`'s in it or not, so it's a simple:

 ```python
 from transformers.testing_utils import CaptureStdout
+
 with CaptureStdout() as cs:
    function_that_writes_to_stdout()
 print(cs.out)
@@ -1002,17 +1034,19 @@ Here is a full test example:

 ```python
 from transformers.testing_utils import CaptureStdout
+
 msg = "Secret message\r"
 final = "Hello World"
 with CaptureStdout() as cs:
    print(msg + final)
-assert cs.out == final+"\n", f"captured: {cs.out}, expecting {final}"
+assert cs.out == final + "\n", f"captured: {cs.out}, expecting {final}"
 ```

 If you'd like to capture `stderr` use the `CaptureStderr` class instead:

 ```python
 from transformers.testing_utils import CaptureStderr
+
 with CaptureStderr() as cs:
    function_that_writes_to_stderr()
 print(cs.err)
@@ -1022,6 +1056,7 @@ If you need to capture both streams at once, use the parent `CaptureStd` class:

 ```python
 from transformers.testing_utils import CaptureStd
+
 with CaptureStd() as cs:
    function_that_writes_to_stdout_and_stderr()
 print(cs.err, cs.out)
@@ -1044,7 +1079,7 @@ logging.set_verbosity_info()
 logger = logging.get_logger("transformers.models.bart.tokenization_bart")
 with CaptureLogger(logger) as cl:
    logger.info(msg)
-assert cl.out, msg+"\n"
+assert cl.out, msg + "\n"
 ```

 ### Testing with environment variables
@@ -1054,6 +1089,8 @@ If you want to test the impact of environment variables for a specific test you

 ```python
 from transformers.testing_utils import mockenv
+
+
 class HfArgumentParserTest(unittest.TestCase):
    @mockenv(TRANSFORMERS_VERBOSITY="error")
    def test_env_override(self):
@@ -1065,6 +1102,8 @@ multiple local paths. A helper class `transformers.test_utils.TestCasePlus` come

 ```python
 from transformers.testing_utils import TestCasePlus
+
+
 class EnvExampleTest(TestCasePlus):
    def test_external_prog(self):
        env = self.get_env()
@@ -1089,16 +1128,20 @@ seed = 42

 # python RNG
 import random
+
 random.seed(seed)

 # pytorch RNGs
 import torch
+
 torch.manual_seed(seed)
 torch.backends.cudnn.deterministic = True
-if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
+if torch.cuda.is_available():
+    torch.cuda.manual_seed_all(seed)

 # numpy RNG
 import numpy as np
+
 np.random.seed(seed)

 # tf RNG
--- a/docs/source/tokenizer_summary.mdx
+++ b/docs/source/tokenizer_summary.mdx
@@ -104,6 +104,7 @@ seen before, by decomposing them into known subwords. For instance, the [`~trans

 ```py
 >>> from transformers import BertTokenizer
+
 >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
 >>> tokenizer.tokenize("I have a new GPU!")
 ["i", "have", "a", "new", "gp", "##u", "!"]
@@ -117,6 +118,7 @@ As another example, [`~transformers.XLNetTokenizer`] tokenizes our previously ex

 ```py
 >>> from transformers import XLNetTokenizer
+
 >>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
 >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
 ["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."]
--- a/docs/source/training.mdx
+++ b/docs/source/training.mdx
@@ -74,6 +74,7 @@ However, we can instead apply these preprocessing steps to all the splits of our
 def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

+
 tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
 ```

@@ -82,8 +83,8 @@ You can learn more about the map method or the other ways to preprocess the data
 Next we will generate a small subset of the training and validation set, to enable faster training:

 ```python
-small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
-small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
+small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
+small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
 full_train_dataset = tokenized_datasets["train"]
 full_eval_dataset = tokenized_datasets["test"]
 ```
@@ -130,9 +131,7 @@ Then we can instantiate a [`Trainer`] like this:
 ```python
 from transformers import Trainer

-trainer = Trainer(
-    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
-)
+trainer = Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset)
 ```

 To fine-tune our model, we just need to call
@@ -160,6 +159,7 @@ from datasets import load_metric

 metric = load_metric("accuracy")

+
 def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
@@ -322,12 +322,7 @@ from transformers import get_scheduler

 num_epochs = 3
 num_training_steps = num_epochs * len(train_dataloader)
-lr_scheduler = get_scheduler(
-    "linear",
-    optimizer=optimizer,
-    num_warmup_steps=0,
-    num_training_steps=num_training_steps
-)
+lr_scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)
 ```

 One last thing, we will want to use the GPU if we have access to one (otherwise training might take several hours
@@ -372,7 +367,7 @@ use a metric from the datasets library. Here we accumulate the predictions at ea
 result when the loop is finished.

 ```python
-metric= load_metric("accuracy")
+metric = load_metric("accuracy")
 model.eval()
 for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}