Add XLM-V to Model Doc (#21498)
* doc: introduce new section for XLM-V model * doc: mention more details for XLM-V integration * docs: paper abstract in italics, model identifier for base model added * doc: mention new XLM-V support * auto: add XLM-V mapping * doc: run make fix-copies ;)
This commit is contained in:
@@ -391,6 +391,8 @@
|
||||
title: XLM-RoBERTa
|
||||
- local: model_doc/xlm-roberta-xl
|
||||
title: XLM-RoBERTa-XL
|
||||
- local: model_doc/xlm-v
|
||||
title: XLM-V
|
||||
- local: model_doc/xlnet
|
||||
title: XLNet
|
||||
- local: model_doc/yoso
|
||||
|
||||
@@ -221,6 +221,7 @@ The documentation is organized into five sections:
|
||||
1. **[XLM-ProphetNet](model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[XLM-RoBERTa](model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
|
||||
1. **[XLM-RoBERTa-XL](model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau.
|
||||
1. **[XLM-V](model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa.
|
||||
1. **[XLNet](model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
1. **[XLS-R](model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli.
|
||||
1. **[XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
|
||||
43
docs/source/en/model_doc/xlm-v.mdx
Normal file
43
docs/source/en/model_doc/xlm-v.mdx
Normal file
@@ -0,0 +1,43 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# XLM-V
|
||||
|
||||
## Overview
|
||||
|
||||
XLM-V is multilingual language model with a one million token vocabulary trained on 2.5TB of data from Common Crawl (same as XLM-R).
|
||||
It was introduced in the [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472)
|
||||
paper by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer and Madian Khabsa.
|
||||
|
||||
From the abstract of the XLM-V paper:
|
||||
|
||||
*Large multilingual language models typically rely on a single vocabulary shared across 100+ languages.
|
||||
As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged.
|
||||
This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R.
|
||||
In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by
|
||||
de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity
|
||||
to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically
|
||||
more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V,
|
||||
a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we
|
||||
tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and
|
||||
named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).*
|
||||
|
||||
Tips:
|
||||
|
||||
- XLM-V is compatible with the XLM-RoBERTa model architecture, only model weights from [`fairseq`](https://github.com/facebookresearch/fairseq)
|
||||
library had to be converted.
|
||||
- The `XLMTokenizer` implementation is used to load the vocab and performs tokenization.
|
||||
|
||||
A XLM-V (base size) model is available under the [`facebook/xlm-v-base`](https://huggingface.co/facebook/xlm-v-base) identifier.
|
||||
|
||||
This model was contributed by [stefan-it](https://huggingface.co/stefan-it), including detailed experiments with XLM-V on downstream tasks.
|
||||
The experiments repository can be found [here](https://github.com/stefan-it/xlm-v-experiments).
|
||||
Reference in New Issue
Block a user