Convert tutorials (#14665)
* Convert a few docs * And another * Last tutorials * New syntax for colab links * Convert a few docs * And another * Last tutorials * New syntax for colab links
This commit is contained in:
118
docs/source/multilingual.mdx
Normal file
118
docs/source/multilingual.mdx
Normal file
@@ -0,0 +1,118 @@
|
||||
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Multi-lingual models
|
||||
|
||||
[[open-in-colab]]
|
||||
|
||||
Most of the models available in this library are mono-lingual models (English, Chinese and German). A few multi-lingual
|
||||
models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
|
||||
models.
|
||||
|
||||
## XLM
|
||||
|
||||
XLM has a total of 10 different checkpoints, only one of which is mono-lingual. The 9 remaining model checkpoints can
|
||||
be split in two categories: the checkpoints that make use of language embeddings, and those that don't
|
||||
|
||||
### XLM & Language Embeddings
|
||||
|
||||
This section concerns the following checkpoints:
|
||||
|
||||
- `xlm-mlm-ende-1024` (Masked language modeling, English-German)
|
||||
- `xlm-mlm-enfr-1024` (Masked language modeling, English-French)
|
||||
- `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian)
|
||||
- `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages)
|
||||
- `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + Translation, XNLI languages)
|
||||
- `xlm-clm-enfr-1024` (Causal language modeling, English-French)
|
||||
- `xlm-clm-ende-1024` (Causal language modeling, English-German)
|
||||
|
||||
These checkpoints require language embeddings that will specify the language used at inference time. These language
|
||||
embeddings are represented as a tensor that is of the same shape as the input ids passed to the model. The values in
|
||||
these tensors depend on the language used and are identifiable using the `lang2id` and `id2lang` attributes from
|
||||
the tokenizer.
|
||||
|
||||
Here is an example using the `xlm-clm-enfr-1024` checkpoint (Causal language modeling, English-French):
|
||||
|
||||
|
||||
```py
|
||||
>>> import torch
|
||||
>>> from transformers import XLMTokenizer, XLMWithLMHeadModel
|
||||
|
||||
>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
|
||||
>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
|
||||
```
|
||||
|
||||
The different languages this model/tokenizer handles, as well as the ids of these languages are visible using the
|
||||
`lang2id` attribute:
|
||||
|
||||
```py
|
||||
>>> print(tokenizer.lang2id)
|
||||
{'en': 0, 'fr': 1}
|
||||
```
|
||||
|
||||
These ids should be used when passing a language parameter during a model pass. Let's define our inputs:
|
||||
|
||||
```py
|
||||
>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size of 1
|
||||
```
|
||||
|
||||
We should now define the language embedding by using the previously defined language id. We want to create a tensor
|
||||
filled with the appropriate language ids, of the same size as input_ids. For english, the id is 0:
|
||||
|
||||
```py
|
||||
>>> language_id = tokenizer.lang2id['en'] # 0
|
||||
>>> langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0])
|
||||
|
||||
>>> # We reshape it to be of size (batch_size, sequence_length)
|
||||
>>> langs = langs.view(1, -1) # is now of shape [1, sequence_length] (we have a batch size of 1)
|
||||
```
|
||||
|
||||
You can then feed it all as input to your model:
|
||||
|
||||
```py
|
||||
>>> outputs = model(input_ids, langs=langs)
|
||||
```
|
||||
|
||||
The example [run_generation.py](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-generation/run_generation.py) can generate text
|
||||
using the CLM checkpoints from XLM, using the language embeddings.
|
||||
|
||||
### XLM without Language Embeddings
|
||||
|
||||
This section concerns the following checkpoints:
|
||||
|
||||
- `xlm-mlm-17-1280` (Masked language modeling, 17 languages)
|
||||
- `xlm-mlm-100-1280` (Masked language modeling, 100 languages)
|
||||
|
||||
These checkpoints do not require language embeddings at inference time. These models are used to have generic sentence
|
||||
representations, differently from previously-mentioned XLM checkpoints.
|
||||
|
||||
|
||||
## BERT
|
||||
|
||||
BERT has two checkpoints that can be used for multi-lingual tasks:
|
||||
|
||||
- `bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages)
|
||||
- `bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages)
|
||||
|
||||
These checkpoints do not require language embeddings at inference time. They should identify the language used in the
|
||||
context and infer accordingly.
|
||||
|
||||
## XLM-RoBERTa
|
||||
|
||||
XLM-RoBERTa was trained on 2.5TB of newly created clean CommonCrawl data in 100 languages. It provides strong gains
|
||||
over previously released multi-lingual models like mBERT or XLM on downstream tasks like classification, sequence
|
||||
labeling and question answering.
|
||||
|
||||
Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
|
||||
|
||||
- `xlm-roberta-base` (Masked language modeling, 100 languages)
|
||||
- `xlm-roberta-large` (Masked language modeling, 100 languages)
|
||||
Reference in New Issue
Block a user