Improve task summary docs (#11513)
* fix task summary docs * refactor to use model.config.id2label instead of list * fix nit * Update docs/source/task_summary.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -85,9 +85,8 @@ each other. The process is the following:
|
|||||||
|
|
||||||
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
|
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
|
||||||
with the weights stored in the checkpoint.
|
with the weights stored in the checkpoint.
|
||||||
2. Build a sequence from the two sentences, with the correct model-specific separators token type ids and attention
|
2. Build a sequence from the two sentences, with the correct model-specific separators, token type ids and attention
|
||||||
masks (:func:`~transformers.PreTrainedTokenizer.encode` and :func:`~transformers.PreTrainedTokenizer.__call__` take
|
masks (which will be created automatically by the tokenizer).
|
||||||
care of this).
|
|
||||||
3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
|
3. Pass this sequence through the model so that it is classified in one of the two available classes: 0 (not a
|
||||||
paraphrase) and 1 (is a paraphrase).
|
paraphrase) and 1 (is a paraphrase).
|
||||||
4. Compute the softmax of the result to get probabilities over the classes.
|
4. Compute the softmax of the result to get probabilities over the classes.
|
||||||
@@ -108,6 +107,7 @@ each other. The process is the following:
|
|||||||
>>> sequence_1 = "Apples are especially bad for your health"
|
>>> sequence_1 = "Apples are especially bad for your health"
|
||||||
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
|
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
|
||||||
|
|
||||||
|
>>> # The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
|
||||||
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
|
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
|
||||||
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
|
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
|
||||||
|
|
||||||
@@ -141,6 +141,7 @@ each other. The process is the following:
|
|||||||
>>> sequence_1 = "Apples are especially bad for your health"
|
>>> sequence_1 = "Apples are especially bad for your health"
|
||||||
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
|
>>> sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
|
||||||
|
|
||||||
|
>>> # The tokekenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to the sequence, as well as compute the attention masks.
|
||||||
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
|
>>> paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
|
||||||
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
|
>>> not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
|
||||||
|
|
||||||
@@ -504,8 +505,8 @@ This outputs a (hopefully) coherent next token following the original sequence,
|
|||||||
>>> print(resulting_string)
|
>>> print(resulting_string)
|
||||||
Hugging Face is based in DUMBO, New York City, and has
|
Hugging Face is based in DUMBO, New York City, and has
|
||||||
|
|
||||||
In the next section, we show how this functionality is leveraged in :func:`~transformers.PreTrainedModel.generate` to
|
In the next section, we show how :func:`~transformers.PreTrainedModel.generate` can be used to generate multiple tokens
|
||||||
generate multiple tokens up to a user-defined length.
|
up to a specified length instead of one token at a time.
|
||||||
|
|
||||||
Text Generation
|
Text Generation
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
@@ -526,10 +527,11 @@ As a default all models apply *Top-K* sampling when used in pipelines, as config
|
|||||||
|
|
||||||
|
|
||||||
Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
|
Here, the model generates a random text with a total maximal length of *50* tokens from context *"As far as I am
|
||||||
concerned, I will"*. The default arguments of ``PreTrainedModel.generate()`` can be directly overridden in the
|
concerned, I will"*. Behind the scenes, the pipeline object calls the method
|
||||||
pipeline, as is shown above for the argument ``max_length``.
|
:func:`~transformers.PreTrainedModel.generate` to generate text. The default arguments for this method can be
|
||||||
|
overridden in the pipeline, as is shown above for the arguments ``max_length`` and ``do_sample``.
|
||||||
|
|
||||||
Here is an example of text generation using ``XLNet`` and its tokenizer.
|
Below is an example of text generation using ``XLNet`` and its tokenizer, which includes calling ``generate`` directly:
|
||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
@@ -627,8 +629,8 @@ It leverages a fine-tuned model on CoNLL-2003, fine-tuned by `@stefan-it <https:
|
|||||||
|
|
||||||
>>> nlp = pipeline("ner")
|
>>> nlp = pipeline("ner")
|
||||||
|
|
||||||
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
|
>>> sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
|
||||||
... "close to the Manhattan Bridge which is visible from the window."
|
... therefore very close to the Manhattan Bridge which is visible from the window."""
|
||||||
|
|
||||||
|
|
||||||
This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
|
This outputs a list of all words that have been identified as one of the entities from the 9 classes defined above.
|
||||||
@@ -659,15 +661,14 @@ Here is an example of doing named entity recognition, using a model and a tokeni
|
|||||||
|
|
||||||
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
|
1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model and loads it
|
||||||
with the weights stored in the checkpoint.
|
with the weights stored in the checkpoint.
|
||||||
2. Define the label list with which the model was trained on.
|
2. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
|
||||||
3. Define a sequence with known entities, such as "Hugging Face" as an organisation and "New York City" as a location.
|
3. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
|
||||||
4. Split words into tokens so that they can be mapped to predictions. We use a small hack by, first, completely
|
|
||||||
encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
|
encoding and decoding the sequence, so that we're left with a string that contains the special tokens.
|
||||||
5. Encode that sequence into IDs (special tokens are added automatically).
|
4. Encode that sequence into IDs (special tokens are added automatically).
|
||||||
6. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
|
5. Retrieve the predictions by passing the input to the model and getting the first output. This results in a
|
||||||
distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
|
distribution over the 9 possible classes for each token. We take the argmax to retrieve the most likely class for
|
||||||
each token.
|
each token.
|
||||||
7. Zip together each token with its prediction and print it.
|
6. Zip together each token with its prediction and print it.
|
||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
@@ -706,18 +707,6 @@ Here is an example of doing named entity recognition, using a model and a tokeni
|
|||||||
>>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
|
>>> model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
|
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
|
||||||
|
|
||||||
>>> label_list = [
|
|
||||||
... "O", # Outside of a named entity
|
|
||||||
... "B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
|
|
||||||
... "I-MISC", # Miscellaneous entity
|
|
||||||
... "B-PER", # Beginning of a person's name right after another person's name
|
|
||||||
... "I-PER", # Person's name
|
|
||||||
... "B-ORG", # Beginning of an organisation right after another organisation
|
|
||||||
... "I-ORG", # Organisation
|
|
||||||
... "B-LOC", # Beginning of a location right after another location
|
|
||||||
... "I-LOC" # Location
|
|
||||||
... ]
|
|
||||||
|
|
||||||
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
|
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
|
||||||
... "close to the Manhattan Bridge."
|
... "close to the Manhattan Bridge."
|
||||||
|
|
||||||
@@ -731,12 +720,49 @@ Here is an example of doing named entity recognition, using a model and a tokeni
|
|||||||
|
|
||||||
This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
|
This outputs a list of each token mapped to its corresponding prediction. Differently from the pipeline, here every
|
||||||
token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
|
token has a prediction as we didn't remove the "0"th class, which means that no particular entity was found on that
|
||||||
token. The following array should be the output:
|
token.
|
||||||
|
|
||||||
|
In the above example, ``predictions`` is an integer that corresponds to the predicted class. We can use the
|
||||||
|
``model.config.id2label`` property in order to recover the class name corresponding to the class number, which is
|
||||||
|
illustrated below:
|
||||||
|
|
||||||
.. code-block::
|
.. code-block::
|
||||||
|
|
||||||
>>> print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])
|
>>> for token, prediction in zip(tokens, predictions[0].numpy()):
|
||||||
[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]
|
... print((token, model.config.id2label[prediction]))
|
||||||
|
('[CLS]', 'O')
|
||||||
|
('Hu', 'I-ORG')
|
||||||
|
('##gging', 'I-ORG')
|
||||||
|
('Face', 'I-ORG')
|
||||||
|
('Inc', 'I-ORG')
|
||||||
|
('.', 'O')
|
||||||
|
('is', 'O')
|
||||||
|
('a', 'O')
|
||||||
|
('company', 'O')
|
||||||
|
('based', 'O')
|
||||||
|
('in', 'O')
|
||||||
|
('New', 'I-LOC')
|
||||||
|
('York', 'I-LOC')
|
||||||
|
('City', 'I-LOC')
|
||||||
|
('.', 'O')
|
||||||
|
('Its', 'O')
|
||||||
|
('headquarters', 'O')
|
||||||
|
('are', 'O')
|
||||||
|
('in', 'O')
|
||||||
|
('D', 'I-LOC')
|
||||||
|
('##UM', 'I-LOC')
|
||||||
|
('##BO', 'I-LOC')
|
||||||
|
(',', 'O')
|
||||||
|
('therefore', 'O')
|
||||||
|
('very', 'O')
|
||||||
|
('##c', 'O')
|
||||||
|
('##lose', 'O')
|
||||||
|
('to', 'O')
|
||||||
|
('the', 'O')
|
||||||
|
('Manhattan', 'I-LOC')
|
||||||
|
('Bridge', 'I-LOC')
|
||||||
|
('.', 'O')
|
||||||
|
('[SEP]', 'O')
|
||||||
|
|
||||||
Summarization
|
Summarization
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
@@ -819,6 +845,12 @@ CNN / Daily Mail), it yields very good results.
|
|||||||
>>> inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
|
>>> inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
|
||||||
>>> outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
|
>>> outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
>>> print(tokenizer.decode(outputs[0]))
|
||||||
|
<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them between 1999 and 2002.</s>
|
||||||
|
|
||||||
|
|
||||||
Translation
|
Translation
|
||||||
-----------------------------------------------------------------------------------------------------------------------
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user