NLLB tokenizer (#18126)
* NLLB tokenizer * Apply suggestions from code review - Thanks Stefan! Co-authored-by: Stefan Schweter <stefan@schweter.it> * Final touches * Style :) * Update docs/source/en/model_doc/nllb.mdx Co-authored-by: Stefan Schweter <stefan@schweter.it> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * PR reviews * Auto models Co-authored-by: Stefan Schweter <stefan@schweter.it> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -320,6 +320,8 @@
|
||||
title: MVP
|
||||
- local: model_doc/nezha
|
||||
title: NEZHA
|
||||
- local: model_doc/nllb
|
||||
title: NLLB
|
||||
- local: model_doc/nystromformer
|
||||
title: Nyströmformer
|
||||
- local: model_doc/opt
|
||||
|
||||
@@ -127,6 +127,7 @@ The library currently contains JAX, PyTorch and TensorFlow implementations, pret
|
||||
1. **[MT5](model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.
|
||||
1. **[MVP](model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen.
|
||||
1. **[Nezha](model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu.
|
||||
1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team.
|
||||
1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh.
|
||||
1. **[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
|
||||
1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
|
||||
|
||||
99
docs/source/en/model_doc/nllb.mdx
Normal file
99
docs/source/en/model_doc/nllb.mdx
Normal file
@@ -0,0 +1,99 @@
|
||||
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# NLLB
|
||||
|
||||
**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=bug&template=bug-report.yml) and assign
|
||||
@LysandreJik
|
||||
|
||||
## Overview of NLLB
|
||||
|
||||
The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by Marta R. Costa-jussà, James Cross, Onur Çelebi,
|
||||
Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula,
|
||||
Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews,
|
||||
Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers,
|
||||
Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
|
||||
|
||||
The abstract of the paper is the following:
|
||||
|
||||
*Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today.
|
||||
However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the
|
||||
200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by
|
||||
first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed
|
||||
at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of
|
||||
Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training
|
||||
improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using
|
||||
a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety.
|
||||
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.*
|
||||
|
||||
This implementation contains the dense models available on release. Let us know via a GitHub issue if you would like to see the MoE models as well.
|
||||
|
||||
This model was contributed by [Lysandre](https://huggingface.co/lysandre). The authors' code can be found [here](https://github.com/facebookresearch/fairseq/tree/nllb).
|
||||
|
||||
## Generating with NLLB
|
||||
|
||||
While generating the target text set the `forced_bos_token_id` to the target language id. The following
|
||||
example shows how to translate English to French using the *facebook/nllb-200-distilled-600M* model.
|
||||
|
||||
Note that we're using the BCP-47 code for French `fra_Latn`. See [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200)
|
||||
for the list of all BCP-47 in the Flores 200 dataset.
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
|
||||
>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
|
||||
|
||||
>>> article = "UN Chief says there is no military solution in Syria"
|
||||
>>> inputs = tokenizer(article, return_tensors="pt")
|
||||
|
||||
>>> translated_tokens = model.generate(
|
||||
... **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"], max_length=30
|
||||
... )
|
||||
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
|
||||
Le chef de l'ONU dit qu'il n'y a pas de solution militaire en Syrie
|
||||
```
|
||||
|
||||
### Generating from any other language than English
|
||||
|
||||
English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language,
|
||||
you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization.
|
||||
|
||||
See example below for a translation from romanian to german:
|
||||
|
||||
```py
|
||||
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained(
|
||||
... "facebook/nllb-200-distilled-600M", use_auth_token=True, src_lang="ron_Latn"
|
||||
... )
|
||||
>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", use_auth_token=True)
|
||||
|
||||
>>> article = "Şeful ONU spune că nu există o soluţie militară în Siria"
|
||||
>>> inputs = tokenizer(article, return_tensors="pt")
|
||||
|
||||
>>> translated_tokens = model.generate(
|
||||
... **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30
|
||||
... )
|
||||
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
|
||||
UN-Chef sagt, es gibt keine militärische Lösung in Syrien
|
||||
```
|
||||
|
||||
## NllbTokenizer
|
||||
|
||||
[[autodoc]] NllbTokenizer
|
||||
- as_target_tokenizer
|
||||
- build_inputs_with_special_tokens
|
||||
|
||||
## NllbTokenizerFast
|
||||
|
||||
[[autodoc]] NllbTokenizerFast
|
||||
Reference in New Issue
Block a user