Doc styler examples (#14953)
* Fix bad examples * Add black formatting to style_doc * Use first nonempty line * Put it at the right place * Don't add spaces to empty lines * Better templates * Deal with triple quotes in docstrings * Result of style_doc * Enable mdx treatment and fix code examples in MDXs * Result of doc styler on doc source files * Last fixes * Break copy from
This commit is contained in:
@@ -54,6 +54,7 @@ The 🤗 Datasets library makes it simple to load a dataset:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
imdb = load_dataset("imdb")
|
||||
```
|
||||
|
||||
@@ -61,8 +62,9 @@ This loads a `DatasetDict` object which you can index into to view an example:
|
||||
|
||||
```python
|
||||
imdb["train"][0]
|
||||
{'label': 1,
|
||||
'text': 'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
|
||||
{
|
||||
"label": 1,
|
||||
"text": "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!",
|
||||
}
|
||||
```
|
||||
|
||||
@@ -74,6 +76,7 @@ model was trained with to ensure appropriately tokenized words. Load the DistilB
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
@@ -99,6 +102,7 @@ batch. This is known as **dynamic padding**. You can do this with the `DataColla
|
||||
|
||||
```python
|
||||
from transformers import DataCollatorWithPadding
|
||||
|
||||
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
@@ -108,6 +112,7 @@ Now load your model with the [`AutoModelForSequenceClassification`] class along
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForSequenceClassification
|
||||
|
||||
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
|
||||
```
|
||||
|
||||
@@ -121,7 +126,7 @@ At this point, only three steps remain:
|
||||
from transformers import TrainingArguments, Trainer
|
||||
|
||||
training_args = TrainingArguments(
|
||||
output_dir='./results',
|
||||
output_dir="./results",
|
||||
learning_rate=2e-5,
|
||||
per_device_train_batch_size=16,
|
||||
per_device_eval_batch_size=16,
|
||||
@@ -150,6 +155,7 @@ Make sure you set `return_tensors="tf"` to return `tf.Tensor` outputs instead of
|
||||
|
||||
```python
|
||||
from transformers import DataCollatorWithPadding
|
||||
|
||||
data_collator = DataCollatorWithPadding(tokenizer, return_tensors="tf")
|
||||
```
|
||||
|
||||
@@ -158,14 +164,14 @@ Next, convert your datasets to the `tf.data.Dataset` format with `to_tf_dataset`
|
||||
|
||||
```python
|
||||
tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
|
||||
columns=['attention_mask', 'input_ids', 'label'],
|
||||
columns=["attention_mask", "input_ids", "label"],
|
||||
shuffle=True,
|
||||
batch_size=16,
|
||||
collate_fn=data_collator,
|
||||
)
|
||||
|
||||
tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
|
||||
columns=['attention_mask', 'input_ids', 'label'],
|
||||
columns=["attention_mask", "input_ids", "label"],
|
||||
shuffle=False,
|
||||
batch_size=16,
|
||||
collate_fn=data_collator,
|
||||
@@ -182,17 +188,14 @@ batch_size = 16
|
||||
num_epochs = 5
|
||||
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
|
||||
total_train_steps = int(batches_per_epoch * num_epochs)
|
||||
optimizer, schedule = create_optimizer(
|
||||
init_lr=2e-5,
|
||||
num_warmup_steps=0,
|
||||
num_train_steps=total_train_steps
|
||||
)
|
||||
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
|
||||
```
|
||||
|
||||
Load your model with the [`TFAutoModelForSequenceClassification`] class along with the number of expected labels:
|
||||
|
||||
```python
|
||||
from transformers import TFAutoModelForSequenceClassification
|
||||
|
||||
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
|
||||
```
|
||||
|
||||
@@ -200,6 +203,7 @@ Compile the model:
|
||||
|
||||
```python
|
||||
import tensorflow as tf
|
||||
|
||||
model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
@@ -234,14 +238,15 @@ or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/no
|
||||
Load the WNUT 17 dataset from the 🤗 Datasets library:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
wnut = load_dataset("wnut_17")
|
||||
>>> from datasets import load_dataset
|
||||
|
||||
>>> wnut = load_dataset("wnut_17")
|
||||
```
|
||||
|
||||
A quick look at the dataset shows the labels associated with each word in the sentence:
|
||||
|
||||
```python
|
||||
wnut["train"][0]
|
||||
>>> wnut["train"][0]
|
||||
{'id': '0',
|
||||
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||
'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
|
||||
@@ -251,21 +256,22 @@ wnut["train"][0]
|
||||
View the specific NER tags by:
|
||||
|
||||
```python
|
||||
label_list = wnut["train"].features[f"ner_tags"].feature.names
|
||||
label_list
|
||||
['O',
|
||||
'B-corporation',
|
||||
'I-corporation',
|
||||
'B-creative-work',
|
||||
'I-creative-work',
|
||||
'B-group',
|
||||
'I-group',
|
||||
'B-location',
|
||||
'I-location',
|
||||
'B-person',
|
||||
'I-person',
|
||||
'B-product',
|
||||
'I-product'
|
||||
>>> label_list = wnut["train"].features[f"ner_tags"].feature.names
|
||||
>>> label_list
|
||||
[
|
||||
"O",
|
||||
"B-corporation",
|
||||
"I-corporation",
|
||||
"B-creative-work",
|
||||
"I-creative-work",
|
||||
"B-group",
|
||||
"I-group",
|
||||
"B-location",
|
||||
"I-location",
|
||||
"B-person",
|
||||
"I-person",
|
||||
"B-product",
|
||||
"I-product",
|
||||
]
|
||||
```
|
||||
|
||||
@@ -282,6 +288,7 @@ Now you need to tokenize the text. Load the DistilBERT tokenizer with an [`AutoT
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
@@ -289,9 +296,9 @@ Since the input has already been split into words, set `is_split_into_words=True
|
||||
subwords:
|
||||
|
||||
```python
|
||||
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
|
||||
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
|
||||
tokens
|
||||
>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
|
||||
>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
|
||||
>>> tokens
|
||||
['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']
|
||||
```
|
||||
|
||||
@@ -314,10 +321,10 @@ def tokenize_and_align_labels(examples):
|
||||
word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
|
||||
previous_word_idx = None
|
||||
label_ids = []
|
||||
for word_idx in word_ids: # Set the special tokens to -100.
|
||||
for word_idx in word_ids: # Set the special tokens to -100.
|
||||
if word_idx is None:
|
||||
label_ids.append(-100)
|
||||
elif word_idx != previous_word_idx: # Only label the first token of a given word.
|
||||
elif word_idx != previous_word_idx: # Only label the first token of a given word.
|
||||
label_ids.append(label[word_idx])
|
||||
|
||||
labels.append(label_ids)
|
||||
@@ -336,6 +343,7 @@ Finally, pad your text and labels, so they are a uniform length:
|
||||
|
||||
```python
|
||||
from transformers import DataCollatorForTokenClassification
|
||||
|
||||
data_collator = DataCollatorForTokenClassification(tokenizer)
|
||||
```
|
||||
|
||||
@@ -345,6 +353,7 @@ Load your model with the [`AutoModelForTokenClassification`] class along with th
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
|
||||
|
||||
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
|
||||
```
|
||||
|
||||
@@ -352,7 +361,7 @@ Gather your training arguments in [`TrainingArguments`]:
|
||||
|
||||
```python
|
||||
training_args = TrainingArguments(
|
||||
output_dir='./results',
|
||||
output_dir="./results",
|
||||
evaluation_strategy="epoch",
|
||||
learning_rate=2e-5,
|
||||
per_device_train_batch_size=16,
|
||||
@@ -387,6 +396,7 @@ Batch your examples together and pad your text and labels, so they are a uniform
|
||||
|
||||
```python
|
||||
from transformers import DataCollatorForTokenClassification
|
||||
|
||||
data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors="tf")
|
||||
```
|
||||
|
||||
@@ -412,6 +422,7 @@ Load the model with the [`TFAutoModelForTokenClassification`] class along with t
|
||||
|
||||
```python
|
||||
from transformers import TFAutoModelForTokenClassification
|
||||
|
||||
model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
|
||||
```
|
||||
|
||||
@@ -435,6 +446,7 @@ Compile the model:
|
||||
|
||||
```python
|
||||
import tensorflow as tf
|
||||
|
||||
model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
@@ -469,13 +481,14 @@ Load the SQuAD dataset from the 🤗 Datasets library:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
squad = load_dataset("squad")
|
||||
```
|
||||
|
||||
Take a look at an example from the dataset:
|
||||
|
||||
```python
|
||||
squad["train"][0]
|
||||
>>> squad["train"][0]
|
||||
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
|
||||
'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
|
||||
'id': '5733be284776f41900661182',
|
||||
@@ -490,6 +503,7 @@ Load the DistilBERT tokenizer with an [`AutoTokenizer`]:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
@@ -567,6 +581,7 @@ Batch the processed examples together:
|
||||
|
||||
```python
|
||||
from transformers import default_data_collator
|
||||
|
||||
data_collator = default_data_collator
|
||||
```
|
||||
|
||||
@@ -576,6 +591,7 @@ Load your model with the [`AutoModelForQuestionAnswering`] class:
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
|
||||
|
||||
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
@@ -583,7 +599,7 @@ Gather your training arguments in [`TrainingArguments`]:
|
||||
|
||||
```python
|
||||
training_args = TrainingArguments(
|
||||
output_dir='./results',
|
||||
output_dir="./results",
|
||||
evaluation_strategy="epoch",
|
||||
learning_rate=2e-5,
|
||||
per_device_train_batch_size=16,
|
||||
@@ -618,6 +634,7 @@ Batch the processed examples together with a TensorFlow default data collator:
|
||||
|
||||
```python
|
||||
from transformers.data.data_collator import tf_default_collator
|
||||
|
||||
data_collator = tf_default_collator
|
||||
```
|
||||
|
||||
@@ -650,8 +667,8 @@ batch_size = 16
|
||||
num_epochs = 2
|
||||
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
|
||||
optimizer, schedule = create_optimizer(
|
||||
init_lr=2e-5,
|
||||
num_warmup_steps=0,
|
||||
init_lr=2e-5,
|
||||
num_warmup_steps=0,
|
||||
num_train_steps=total_train_steps,
|
||||
)
|
||||
```
|
||||
@@ -660,6 +677,7 @@ Load your model with the [`TFAutoModelForQuestionAnswering`] class:
|
||||
|
||||
```python
|
||||
from transformers import TFAutoModelForQuestionAnswering
|
||||
|
||||
model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
|
||||
```
|
||||
|
||||
@@ -667,6 +685,7 @@ Compile the model:
|
||||
|
||||
```python
|
||||
import tensorflow as tf
|
||||
|
||||
model.compile(optimizer=optimizer)
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user