Add mLUKE (#14640)
* implement MLukeTokenizer and LukeForMaskedLM * update tests * update docs * add LukeForMaskedLM to check_repo.py * update README * fix test and specify the entity pad id in tokenization_(m)luke * fix EntityPredictionHeadTransform
This commit is contained in:
@@ -135,6 +135,7 @@ conversion utilities for the following models.
|
||||
1. **[LED](model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||
1. **[Longformer](model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.
|
||||
1. **[LUKE](model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto.
|
||||
1. **[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka.
|
||||
1. **[LXMERT](model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
||||
1. **[M2M100](model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
|
||||
1. **[MarianMT](model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
|
||||
|
||||
@@ -20,7 +20,7 @@ Rust library `tokenizers <https://github.com/huggingface/tokenizers>`__. The "Fa
|
||||
1. a significant speed-up in particular when doing batched tokenization and
|
||||
2. additional methods to map between the original string (character and words) and the token space (e.g. getting the
|
||||
index of the token comprising a given character or the span of characters corresponding to a given token). Currently
|
||||
no "Fast" implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLMRoBERTa
|
||||
no "Fast" implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLM-RoBERTa
|
||||
and XLNet models).
|
||||
|
||||
The base classes :class:`~transformers.PreTrainedTokenizer` and :class:`~transformers.PreTrainedTokenizerFast`
|
||||
|
||||
@@ -137,6 +137,12 @@ LukeModel
|
||||
.. autoclass:: transformers.LukeModel
|
||||
:members: forward
|
||||
|
||||
LukeForMaskedLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LukeForMaskedLM
|
||||
:members: forward
|
||||
|
||||
|
||||
LukeForEntityClassification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
66
docs/source/model_doc/mluke.rst
Normal file
66
docs/source/model_doc/mluke.rst
Normal file
@@ -0,0 +1,66 @@
|
||||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
mLUKE
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The mLUKE model was proposed in `mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models
|
||||
<https://arxiv.org/abs/2110.08151>`__ by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. It's a multilingual extension
|
||||
of the `LUKE model <https://arxiv.org/abs/2010.01057>`__ trained on the basis of XLM-RoBERTa.
|
||||
|
||||
It is based on XLM-RoBERTa and adds entity embeddings, which helps improve performance on various downstream tasks
|
||||
involving reasoning about entities such as named entity recognition, extractive question answering, relation
|
||||
classification, cloze-style knowledge completion.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual
|
||||
alignment information from Wikipedia entities. However, existing methods only exploit entity information in pretraining
|
||||
and do not explicitly use entities in downstream tasks. In this study, we explore the effectiveness of leveraging
|
||||
entity representations for downstream cross-lingual tasks. We train a multilingual language model with 24 languages
|
||||
with entity representations and show the model consistently outperforms word-based pretrained models in various
|
||||
cross-lingual transfer tasks. We also analyze the model and the key insight is that incorporating entity
|
||||
representations into the input allows us to extract more language-agnostic features. We also evaluate the model with a
|
||||
multilingual cloze prompt task with the mLAMA dataset. We show that entity-based prompt elicits correct factual
|
||||
knowledge more likely than using only word representations.*
|
||||
|
||||
One can directly plug in the weights of mLUKE into a LUKE model, like so:
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import LukeModel
|
||||
|
||||
model = LukeModel.from_pretrained('studio-ousia/mluke-base')
|
||||
|
||||
Note that mLUKE has its own tokenizer, :class:`~transformers.MLukeTokenizer`. You can initialize it as follows:
|
||||
|
||||
.. code-block::
|
||||
|
||||
from transformers import MLukeTokenizer
|
||||
|
||||
tokenizer = MLukeTokenizer.from_pretrained('studio-ousia/mluke-base')
|
||||
|
||||
|
||||
As mLUKE's architecture is equivalent to that of LUKE, one can refer to :doc:`LUKE's documentation page <luke>` for all
|
||||
tips, code examples and notebooks.
|
||||
|
||||
This model was contributed by `ryo0634 <https://huggingface.co/ryo0634>`__. The original code can be found `here
|
||||
<https://github.com/studio-ousia/luke>`__.
|
||||
|
||||
MLukeTokenizer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.MLukeTokenizer
|
||||
:members: __call__, save_vocabulary
|
||||
@@ -17,8 +17,6 @@ Most of the models available in this library are mono-lingual models (English, C
|
||||
models are available and have a different mechanisms than mono-lingual models. This page details the usage of these
|
||||
models.
|
||||
|
||||
The two models that currently support multiple languages are BERT and XLM.
|
||||
|
||||
XLM
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
@@ -127,3 +125,17 @@ Two XLM-RoBERTa checkpoints can be used for multi-lingual tasks:
|
||||
|
||||
- ``xlm-roberta-base`` (Masked language modeling, 100 languages)
|
||||
- ``xlm-roberta-large`` (Masked language modeling, 100 languages)
|
||||
|
||||
mLUKE
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
mLUKE is based on XLM-RoBERTa and further trained on Wikipedia articles in 24 languages with masked language modeling
|
||||
as well as masked entity prediction objective.
|
||||
|
||||
The model can be used in the same way as other models solely based on word-piece inputs, but also can be used with
|
||||
entity representations to achieve further performance gain, with entity-related tasks such as relation extraction,
|
||||
named entity recognition and question answering (see :doc:`LUKE <model_doc/luke>`).
|
||||
|
||||
Currently, one mLUKE checkpoint is available:
|
||||
|
||||
- ``studio-ousia/mluke-base`` (Masked language modeling + Masked entity prediction, 100 languages)
|
||||
|
||||
Reference in New Issue
Block a user