Doc styler examples (#14953)

* Fix bad examples

* Add black formatting to style_doc

* Use first nonempty line

* Put it at the right place

* Don't add spaces to empty lines

* Better templates

* Deal with triple quotes in docstrings

* Result of style_doc

* Enable mdx treatment and fix code examples in MDXs

* Result of doc styler on doc source files

* Last fixes

* Break copy from
This commit is contained in:
Sylvain Gugger
2021-12-27 19:07:46 -05:00
committed by GitHub
parent e13f72fbff
commit b5e2b183af
211 changed files with 2738 additions and 1711 deletions

View File

@@ -54,6 +54,7 @@ The 🤗 Datasets library makes it simple to load a dataset:
```python
from datasets import load_dataset
imdb = load_dataset("imdb")
```
@@ -61,8 +62,9 @@ This loads a `DatasetDict` object which you can index into to view an example:
```python
imdb["train"][0]
{'label': 1,
'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
{
"label": 1,
"text": "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!",
}
```
@@ -74,6 +76,7 @@ model was trained with to ensure appropriately tokenized words. Load the DistilB
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```
@@ -99,6 +102,7 @@ batch. This is known as **dynamic padding**. You can do this with the `DataColla
```python
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
```
@@ -108,6 +112,7 @@ Now load your model with the [`AutoModelForSequenceClassification`] class along
```python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
```
@@ -121,7 +126,7 @@ At this point, only three steps remain:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir='./results',
output_dir="./results",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
@@ -150,6 +155,7 @@ Make sure you set `return_tensors="tf"` to return `tf.Tensor` outputs instead of
```python
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
```
@@ -158,14 +164,14 @@ Next, convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`
```python
tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
columns=['attention_mask', 'input_ids', 'label'],
columns=["attention_mask", "input_ids", "label"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
columns=['attention_mask', 'input_ids', 'label'],
columns=["attention_mask", "input_ids", "label"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
@@ -182,17 +188,14 @@ batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(
init_lr=2e-5,
num_warmup_steps=0,
num_train_steps=total_train_steps
)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
```
Load your model with the [`TFAutoModelForSequenceClassification`] class along with the number of expected labels:
```python
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
```
@@ -200,6 +203,7 @@ Compile the model:
```python
import tensorflow as tf
model.compile(optimizer=optimizer)
```
@@ -234,14 +238,15 @@ or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/no
Load the WNUT 17 dataset from the 🤗 Datasets library:
```python
from datasets import load_dataset
wnut = load_dataset("wnut_17")
>>> from datasets import load_dataset
>>> wnut = load_dataset("wnut_17")
```
A quick look at the dataset shows the labels associated with each word in the sentence:
```python
wnut["train"][0]
>>> wnut["train"][0]
{'id': '0',
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
@@ -251,21 +256,22 @@ wnut["train"][0]
View the specific NER tags by:
```python
label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list
['O',
'B-corporation',
'I-corporation',
'B-creative-work',
'I-creative-work',
'B-group',
'I-group',
'B-location',
'I-location',
'B-person',
'I-person',
'B-product',
'I-product'
>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
>>> label_list
[
"O",
"B-corporation",
"I-corporation",
"B-creative-work",
"I-creative-work",
"B-group",
"I-group",
"B-location",
"I-location",
"B-person",
"I-person",
"B-product",
"I-product",
]
```
@@ -282,6 +288,7 @@ Now you need to tokenize the text. Load the DistilBERT tokenizer with an [`AutoT
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```
@@ -289,9 +296,9 @@ Since the input has already been split into words, set `is_split_into_words=True
subwords:
```python
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens
>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
>>> tokens
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
```
@@ -314,10 +321,10 @@ def tokenize_and_align_labels(examples):
word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # Set the special tokens to -100.
for word_idx in word_ids: # Set the special tokens to -100.
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(label[word_idx])
labels.append(label_ids)
@@ -336,6 +343,7 @@ Finally, pad your text and labels, so they are a uniform length:
```python
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)
```
@@ -345,6 +353,7 @@ Load your model with the [`AutoModelForTokenClassification`] class along with th
```python
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
```
@@ -352,7 +361,7 @@ Gather your training arguments in [`TrainingArguments`]:
```python
training_args = TrainingArguments(
output_dir='./results',
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
@@ -387,6 +396,7 @@ Batch your examples together and pad your text and labels, so they are a uniform
```python
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
```
@@ -412,6 +422,7 @@ Load the model with the [`TFAutoModelForTokenClassification`] class along with t
```python
from transformers import TFAutoModelForTokenClassification
model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
```
@@ -435,6 +446,7 @@ Compile the model:
```python
import tensorflow as tf
model.compile(optimizer=optimizer)
```
@@ -469,13 +481,14 @@ Load the SQuAD dataset from the 🤗 Datasets library:
```python
from datasets import load_dataset
squad = load_dataset("squad")
```
Take a look at an example from the dataset:
```python
squad["train"][0]
>>> squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
'id': '5733be284776f41900661182',
@@ -490,6 +503,7 @@ Load the DistilBERT tokenizer with an [`AutoTokenizer`]:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```
@@ -567,6 +581,7 @@ Batch the processed examples together:
```python
from transformers import default_data_collator
data_collator = default_data_collator
```
@@ -576,6 +591,7 @@ Load your model with the [`AutoModelForQuestionAnswering`] class:
```python
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
```
@@ -583,7 +599,7 @@ Gather your training arguments in [`TrainingArguments`]:
```python
training_args = TrainingArguments(
output_dir='./results',
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
@@ -618,6 +634,7 @@ Batch the processed examples together with a TensorFlow default data collator:
```python
from transformers.data.data_collator import tf_default_collator
data_collator = tf_default_collator
```
@@ -650,8 +667,8 @@ batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
init_lr=2e-5,
num_warmup_steps=0,
init_lr=2e-5,
num_warmup_steps=0,
num_train_steps=total_train_steps,
)
```
@@ -660,6 +677,7 @@ Load your model with the [`TFAutoModelForQuestionAnswering`] class:
```python
from transformers import TFAutoModelForQuestionAnswering
model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
```
@@ -667,6 +685,7 @@ Compile the model:
```python
import tensorflow as tf
model.compile(optimizer=optimizer)
```