Refactor Code samples; Test code samples (#5036)
* Refactor code samples * Test docstrings * Style * Tokenization examples * Run rust of tests * First step to testing source docs * Style and BART comment * Test the remainder of the code samples * Style * let to const * Formatting fixes * Ready for merge * Fix fixture + Style * Fix last tests * Update docs/source/quicktour.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Addressing @sgugger's comments + Fix MobileBERT in TF Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -36,10 +36,11 @@ Here is an example using the ``xlm-clm-enfr-1024`` checkpoint (Causal language m
|
||||
|
||||
.. code-block::
|
||||
|
||||
import torch
|
||||
from transformers import XLMTokenizer, XLMWithLMHeadModel
|
||||
>>> import torch
|
||||
>>> from transformers import XLMTokenizer, XLMWithLMHeadModel
|
||||
|
||||
tokenizer = XLMTokenizer.from_pretrained("xlm-clm-1024-enfr")
|
||||
>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
|
||||
>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
|
||||
|
||||
|
||||
The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
|
||||
@@ -47,16 +48,15 @@ The different languages this model/tokenizer handles, as well as the ids of thes
|
||||
|
||||
.. code-block::
|
||||
|
||||
# Continuation of the previous script
|
||||
print(tokenizer.lang2id) # {'en': 0, 'fr': 1}
|
||||
>>> print(tokenizer.lang2id)
|
||||
{'en': 0, 'fr': 1}
|
||||
|
||||
|
||||
These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
|
||||
|
||||
.. code-block::
|
||||
|
||||
# Continuation of the previous script
|
||||
input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
|
||||
>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
|
||||
|
||||
|
||||
We should now define the language embedding by using the previously defined language id. We want to create a tensor
|
||||
@@ -64,20 +64,18 @@ filled with the appropriate language ids, of the same size as input_ids. For eng
|
||||
|
||||
.. code-block::
|
||||
|
||||
# Continuation of the previous script
|
||||
language_id = tokenizer.lang2id['en'] # 0
|
||||
langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
|
||||
>>> language_id = tokenizer.lang2id['en'] # 0
|
||||
>>> langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
|
||||
|
||||
# We reshape it to be of size (batch_size, sequence_length)
|
||||
langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
|
||||
>>> # We reshape it to be of size (batch_size, sequence_length)
|
||||
>>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
|
||||
|
||||
|
||||
You can then feed it all as input to your model:
|
||||
|
||||
.. code-block::
|
||||
|
||||
# Continuation of the previous script
|
||||
outputs = model(input_ids, langs=langs)
|
||||
>>> outputs = model(input_ids, langs=langs)
|
||||
|
||||
|
||||
The example `run_generation.py <https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py>`__
|
||||
|
||||
Reference in New Issue
Block a user