Re-enable doctests for the quicktour (#15828)

* Re-enable doctests for the quicktour

* Re-enable doctests for task_summary (#15830)

* Remove &
This commit is contained in:
Sylvain Gugger
2022-02-25 17:46:38 +01:00
committed by GitHub
parent fd5b05eb81
commit 0118c4f6a8
5 changed files with 98 additions and 37 deletions

View File

@@ -80,7 +80,7 @@ The pipeline downloads and caches a default [pretrained model](https://huggingfa
```py
>>> classifier("We are very happy to show you the 🤗 Transformers library.")
[{"label": "POSITIVE", "score": 0.9998}]
[{'label': 'POSITIVE', 'score': 0.9998}]
```
For more than one sentence, pass a list of sentences to the [`pipeline`] which returns a list of dictionaries:
@@ -112,20 +112,22 @@ Next, load a dataset (see the 🤗 Datasets [Quick Start](https://huggingface.co
```py
>>> import datasets
>>> dataset = datasets.load_dataset("superb", name="asr", split="test")
>>> dataset = datasets.load_dataset("superb", name="asr", split="test") # doctest: +IGNORE_RESULT
```
Now you can iterate over the dataset with the pipeline. `KeyDataset` retrieves the item in the dictionary returned by the dataset:
You can pass a whole dataset pipeline:
```py
>>> from transformers.pipelines.pt_utils import KeyDataset
>>> from tqdm.auto import tqdm
>>> for out in tqdm(speech_recognizer(KeyDataset(dataset, "file"))):
... print(out)
{"text": "HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE"}
>>> files = dataset["file"]
>>> speech_recognizer(files[:4])
[{'text': 'HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES TO BE LADLED OUT IN THICK PEPPERED FLOWER FAT AND SAUCE'},
{'text': 'STUFFERED INTO YOU HIS BELLY COUNSELLED HIM'},
{'text': 'AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS'},
{'text': 'HO BERTIE ANY GOOD IN YOUR MIND'}]
```
For a larger dataset where the inputs are big (like in speech or vision), you will want to pass along a generator instead of a list that loads all the inputs in memory. See the [pipeline documentation](main_classes/pipeline) for more information.
### Use another model and tokenizer in the pipeline
The [`pipeline`] can accommodate any model from the [Model Hub](https://huggingface.co/models), making it easy to adapt the [`pipeline`] for other use-cases. For example, if you'd like a model capable of handling French text, use the tags on the Model Hub to filter for an appropriate model. The top filtered result returns a multilingual [BERT model](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) fine-tuned for sentiment analysis. Great, let's use this model!
@@ -141,7 +143,7 @@ Use the [`AutoModelForSequenceClassification`] and ['AutoTokenizer'] to load the
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
>>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
@@ -153,7 +155,7 @@ Then you can specify the model and tokenizer in the [`pipeline`], and apply the
```py
>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
>>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")
[{"label": "5 stars", "score": 0.7272651791572571}]
[{'label': '5 stars', 'score': 0.7273}]
```
If you can't find a model for your use-case, you will need to fine-tune a pretrained model on your data. Take a look at our [fine-tuning tutorial](./training) to learn how. Finally, after you've fine-tuned your pretrained model, please consider sharing it (see tutorial [here](./model_sharing)) with the community on the Model Hub to democratize NLP for everyone! 🤗
@@ -186,8 +188,9 @@ Pass your text to the tokenizer:
```py
>>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
>>> print(encoding)
{"input_ids": [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102],
"attention_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
```
The tokenizer will return a dictionary containing:
@@ -205,7 +208,7 @@ Just like the [`pipeline`], the tokenizer will accept a list of inputs. In addit
... max_length=512,
... return_tensors="pt",
... )
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> tf_batch = tokenizer(
... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
... padding=True,
@@ -226,7 +229,7 @@ Read the [preprocessing](./preprocessing) tutorial for more details about tokeni
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModelForSequenceClassification
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
@@ -243,7 +246,7 @@ Now you can pass your preprocessed batch of inputs directly to the model. If you
```py
>>> pt_outputs = pt_model(**pt_batch)
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> tf_outputs = tf_model(tf_batch)
```
@@ -254,16 +257,17 @@ The model outputs the final activations in the `logits` attribute. Apply the sof
>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)
>>> print(pt_predictions)
tensor([[2.2043e-04, 9.9978e-01],
[5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)
===PT-TF-SPLIT===
tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725],
[0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=<SoftmaxBackward0>)
>>> # ===PT-TF-SPLIT===
>>> import tensorflow as tf
>>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1)
>>> print(tf_predictions)
tf.Tensor(
[[2.2043e-04 9.9978e-01]
[5.3086e-01 4.6914e-01]], shape=(2, 2), dtype=float32)
[[0.00206 0.00177 0.01155 0.21209 0.77253]
[0.20842 0.18262 0.19693 0.1755 0.23652]], shape=(2, 5), dtype=float32)
```
<Tip>
@@ -288,11 +292,11 @@ Once your model is fine-tuned, you can save it with its tokenizer using [`PreTra
```py
>>> pt_save_directory = "./pt_save_pretrained"
>>> tokenizer.save_pretrained(pt_save_directory)
>>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT
>>> pt_model.save_pretrained(pt_save_directory)
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> tf_save_directory = "./tf_save_pretrained"
>>> tokenizer.save_pretrained(tf_save_directory)
>>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT
>>> tf_model.save_pretrained(tf_save_directory)
```
@@ -300,7 +304,7 @@ When you are ready to use the model again, reload it with [`PreTrainedModel.from
```py
>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained")
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained")
```
@@ -311,7 +315,7 @@ One particularly cool 🤗 Transformers feature is the ability to save a model a
>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory)
>>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True)
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModel
>>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)

View File

@@ -122,7 +122,8 @@ is paraphrase: 90%
... print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")
not paraphrase: 94%
is paraphrase: 6%
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
>>> import tensorflow as tf
@@ -258,7 +259,8 @@ Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
>>> import tensorflow as tf
@@ -407,7 +409,8 @@ Distilled models are smaller than the models they mimic. Using them instead of t
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModelForMaskedLM, AutoTokenizer
>>> import tensorflow as tf
@@ -481,7 +484,8 @@ of tokens.
>>> resulting_string = tokenizer.decode(generated.tolist()[0])
>>> print(resulting_string)
Hugging Face is based in DUMBO, New York City, and ...
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
>>> import tensorflow as tf
@@ -565,7 +569,8 @@ Below is an example of text generation using `XLNet` and its tokenizer, which in
>>> print(generated)
Today the weather is really nice and I am planning ...
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModelForCausalLM, AutoTokenizer
>>> model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
@@ -687,7 +692,7 @@ Here is an example of doing named entity recognition, using a model and a tokeni
>>> outputs = model(**inputs).logits
>>> predictions = torch.argmax(outputs, dim=2)
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModelForTokenClassification, AutoTokenizer
>>> import tensorflow as tf
@@ -827,7 +832,8 @@ CNN / Daily Mail), it yields very good results.
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.</s>
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
@@ -890,7 +896,8 @@ Here is an example of doing translation using a model and a tokenizer. The proce
>>> print(tokenizer.decode(outputs[0]))
<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>
===PT-TF-SPLIT===
>>> # ===PT-TF-SPLIT===
>>> from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")