Adding the LXMERT pretraining model (MultiModal languageXvision) to HuggingFace's suite of models (#5793)
* added template files for LXMERT and competed the configuration_lxmert.py * added modeling, tokization, testing, and finishing touched for lxmert [yet to be tested] * added model card for lxmert * cleaning up lxmert code * Update src/transformers/modeling_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_tf_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_tf_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * tested torch lxmert, changed documtention, updated outputs, and other small fixes * Update src/transformers/convert_pytorch_checkpoint_to_tf2.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/convert_pytorch_checkpoint_to_tf2.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/convert_pytorch_checkpoint_to_tf2.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * renaming, other small issues, did not change TF code in this commit * added lxmert question answering model in pytorch * added capability to edit number of qa labels for lxmert * made answer optional for lxmert question answering * add option to return hidden_states for lxmert * changed default qa labels for lxmert * changed config archive path * squshing 3 commits: merged UI + testing improvments + more UI and testing * changed some variable names for lxmert * TF LXMERT * Various fixes to LXMERT * Final touches to LXMERT * AutoTokenizer order * Add LXMERT to index.rst and README.md * Merge commit test fixes + Style update * TensorFlow 2.3.0 sequential model changes variable names Remove inherited test * Update src/transformers/modeling_tf_pytorch_utils.py * Update docs/source/model_doc/lxmert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/model_doc/lxmert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_tf_lxmert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * added suggestions * Fixes * Final fixes for TF model * Fix docs Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
4ebb52afdb
commit
ea2c6f1afc
@@ -172,8 +172,9 @@ for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimi
|
||||
Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
||||
23. **[Pegasus](https://github.com/google-research/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777)> by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
|
||||
24. **[MBart](https://github.com/pytorch/fairseq/tree/master/examples/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||
25. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
|
||||
26. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
25. **[LXMERT](https://github.com/airsplay/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal.
|
||||
26. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
|
||||
27. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
|
||||
|
||||
These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Pearson R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html).
|
||||
|
||||
|
||||
@@ -128,7 +128,10 @@ conversion utilities for the following models:
|
||||
<https://arxiv.org/abs/1912.08777>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
|
||||
24. `MBart <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`_ (from Facebook) released with the paper `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov,
|
||||
Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
|
||||
25. `Other community models <https://huggingface.co/models>`_, contributed by the `community
|
||||
25. `LXMERT <https://github.com/airsplay/lxmert>`_ (from UNC Chapel Hill) released with the paper `LXMERT: Learning
|
||||
Cross-Modality Encoder Representations from Transformers for Open-Domain Question
|
||||
Answering <https://arxiv.org/abs/1908.07490>`_ by Hao Tan and Mohit Bansal.
|
||||
26. `Other community models <https://huggingface.co/models>`_, contributed by the `community
|
||||
<https://huggingface.co/users>`_.
|
||||
|
||||
.. toctree::
|
||||
@@ -213,6 +216,7 @@ conversion utilities for the following models:
|
||||
model_doc/dpr
|
||||
model_doc/pegasus
|
||||
model_doc/mbart
|
||||
model_doc/lxmert
|
||||
internal/modeling_utils
|
||||
internal/tokenization_utils
|
||||
internal/pipelines_utils
|
||||
|
||||
109
docs/source/model_doc/lxmert.rst
Normal file
109
docs/source/model_doc/lxmert.rst
Normal file
@@ -0,0 +1,109 @@
|
||||
LXMERT
|
||||
----------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers <https://arxiv.org/abs/1908.07490>`__
|
||||
by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders (one for the vision modality, one for the language modality, and then one to fuse both modalities)
|
||||
pre-trained using a combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.
|
||||
The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two
|
||||
modalities. We thus propose the LXMERT
|
||||
(Learning Cross-Modality Encoder Representations from Transformers) framework to learn
|
||||
these vision-and-language connections. In
|
||||
LXMERT, we build a large-scale Transformer
|
||||
model that consists of three encoders: an object relationship encoder, a language encoder,
|
||||
and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we
|
||||
pre-train the model with large amounts of
|
||||
image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction
|
||||
(feature regression and label classification),
|
||||
cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the
|
||||
state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA).
|
||||
We also show the generalizability of our pretrained cross-modality model by adapting it to
|
||||
a challenging visual-reasoning task, NLVR
|
||||
,
|
||||
and improve the previous best result by 22%
|
||||
absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that
|
||||
both our novel model components and pretraining strategies significantly contribute to
|
||||
our strong results; and also present several
|
||||
attention visualizations for the different encoders*
|
||||
|
||||
Tips:
|
||||
|
||||
- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features will work.
|
||||
- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the cross-modality layer, so they
|
||||
contain information from both modalities. To access a modality that only attends to itself, select the vision/language hidden states from the first input in the tuple.
|
||||
- The bi-directional cross-modality encoder attention only returns attention values when the language modality is used as the input and the vision modality is used as the context vector. Further,
|
||||
while the cross-modality encoder contains self-attention for each respective modality and cross-attention, only the cross attention is returned and both self attention outputs are disregarded.
|
||||
|
||||
The code can be found `here <https://github.com/airsplay/lxmert>`__
|
||||
|
||||
|
||||
LxmertConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LxmertConfig
|
||||
:members:
|
||||
|
||||
|
||||
LxmertTokenizer
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LxmertTokenizer
|
||||
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
||||
create_token_type_ids_from_sequences, save_vocabulary
|
||||
|
||||
|
||||
Lxmert specific outputs
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.modeling_lxmert.LxmertModelOutput
|
||||
:members:
|
||||
|
||||
.. autoclass:: transformers.modeling_lxmert.LxmertForPreTrainingOutput
|
||||
:members:
|
||||
|
||||
.. autoclass:: transformers.modeling_lxmert.LxmertForQuestionAnsweringOutput
|
||||
:members:
|
||||
|
||||
.. autoclass:: transformers.modeling_tf_lxmert.TFLxmertModelOutput
|
||||
:members:
|
||||
|
||||
.. autoclass:: transformers.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
|
||||
:members:
|
||||
|
||||
|
||||
LxmertModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LxmertModel
|
||||
:members:
|
||||
|
||||
LxmertForPreTraining
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LxmertForPreTraining
|
||||
:members:
|
||||
|
||||
LxmertForQuestionAnswering
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.LxmertForQuestionAnswering
|
||||
:members:
|
||||
|
||||
|
||||
TFLxmertModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.TFLxmertModel
|
||||
:members:
|
||||
|
||||
TFLxmertForPreTraining
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.TFLxmertForPreTraining
|
||||
:members:
|
||||
@@ -364,3 +364,7 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
|
||||
| | ``facebook/mbart-large-en-ro`` | | 24-layer, 1024-hidden, 16-heads, 610M parameters |
|
||||
| | | | mbart-large-cc25 model finetuned on WMT english romanian translation. |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
| Lxmert | ``lxmert-base-uncased`` | | 9-language layers, 9-relationship layers, and 12-cross-modality layers |
|
||||
| | | | 768-hidden, 12-heads (for each layer) ~ 228M parameters |
|
||||
| | | | Starting from lxmert-base checkpoint, trained on over 9 million image-text couplets from COCO, VisualGenome, GQA, VQA |
|
||||
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
|
||||
|
||||
21
model_cards/uncnlp/lxmert-base-uncased/LICENSE
Normal file
21
model_cards/uncnlp/lxmert-base-uncased/LICENSE
Normal file
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2019 Hao Tan
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
34
model_cards/uncnlp/lxmert-base-uncased/README.md
Normal file
34
model_cards/uncnlp/lxmert-base-uncased/README.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# LXMERT
|
||||
|
||||
## Model Description
|
||||
|
||||
[LXMERT](https://arxiv.org/abs/1908.07490) is a pre-trained multimodal transformer. The model takes an image and a sentence as input and compute cross-modal representions. The model is converted from [LXMERT github](https://github.com/airsplay/lxmert) by [Antonio Mendoza](https://avmendoza.info/) and is authored by [Hao Tan](https://www.cs.unc.edu/~airsplay/).
|
||||
|
||||

|
||||
|
||||
## Usage
|
||||
|
||||
|
||||
## Training Data and Prodcedure
|
||||
The model is jointly trained on multiple vision-and-language datasets.
|
||||
We included two image captioning datsets (i.e., [MS COCO](http://cocodataset.org/#home), [Visual Genome](https://visualgenome.org/)) and three image-question answering datasets (i.e., [VQA](https://visualqa.org/), [GQA](https://cs.stanford.edu/people/dorarad/gqa/), [VG QA](https://github.com/yukezhu/visual7w-toolkit)). The model is pre-trained on the above datasets for 20 epochs (roughly 670K iterations with batch size 256), which takes around 8 days on 4 Titan V cards. The details of training could be found in the [LXMERT paper](https://arxiv.org/pdf/1908.07490.pdf).
|
||||
|
||||
## Eval Results
|
||||
| Split | [VQA](https://visualqa.org/) | [GQA](https://cs.stanford.edu/people/dorarad/gqa/) | [NLVR2](http://lil.nlp.cornell.edu/nlvr/) |
|
||||
|----------- |:----: |:---: |:------:|
|
||||
| Local Validation | 69.90% | 59.80% | 74.95% |
|
||||
| Test-Dev | 72.42% | 60.00% | 74.45% (Test-P) |
|
||||
| Test-Standard | 72.54% | 60.33% | 76.18% (Test-U) |
|
||||
|
||||
|
||||
## Reference
|
||||
```bibtex
|
||||
@inproceedings{tan2019lxmert,
|
||||
title={LXMERT: Learning Cross-Modality Encoder Representations from Transformers},
|
||||
author={Tan, Hao and Bansal, Mohit},
|
||||
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
|
||||
year={2019}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
BIN
model_cards/uncnlp/lxmert-base-uncased/lxmert_model-1.jpg
Normal file
BIN
model_cards/uncnlp/lxmert-base-uncased/lxmert_model-1.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 275 KiB |
@@ -31,6 +31,7 @@ from .configuration_encoder_decoder import EncoderDecoderConfig
|
||||
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
|
||||
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
|
||||
from .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig
|
||||
from .configuration_lxmert import LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP, LxmertConfig
|
||||
from .configuration_marian import MarianConfig
|
||||
from .configuration_mbart import MBartConfig
|
||||
from .configuration_mmbt import MMBTConfig
|
||||
@@ -156,6 +157,7 @@ from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
|
||||
from .tokenization_flaubert import FlaubertTokenizer
|
||||
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
|
||||
from .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast
|
||||
from .tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast
|
||||
from .tokenization_mbart import MBartTokenizer
|
||||
from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
|
||||
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
|
||||
@@ -343,6 +345,15 @@ if is_torch_available():
|
||||
LongformerModel,
|
||||
LongformerSelfAttention,
|
||||
)
|
||||
from .modeling_lxmert import (
|
||||
LxmertEncoder,
|
||||
LxmertForPreTraining,
|
||||
LxmertForQuestionAnswering,
|
||||
LxmertModel,
|
||||
LxmertPreTrainedModel,
|
||||
LxmertVisualFeatureEncoder,
|
||||
LxmertXLayer,
|
||||
)
|
||||
from .modeling_marian import MarianMTModel
|
||||
from .modeling_mbart import MBartForConditionalGeneration
|
||||
from .modeling_mmbt import MMBTForClassification, MMBTModel, ModalEmbeddings
|
||||
@@ -573,6 +584,14 @@ if is_tf_available():
|
||||
TFLongformerModel,
|
||||
TFLongformerSelfAttention,
|
||||
)
|
||||
from .modeling_tf_lxmert import (
|
||||
TF_LXMERT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
TFLxmertForPreTraining,
|
||||
TFLxmertMainLayer,
|
||||
TFLxmertModel,
|
||||
TFLxmertPreTrainedModel,
|
||||
TFLxmertVisualFeatureEncoder,
|
||||
)
|
||||
from .modeling_tf_mobilebert import (
|
||||
TF_MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
TFMobileBertForMaskedLM,
|
||||
|
||||
@@ -155,5 +155,13 @@ class ConvertCommand(BaseTransformersCLICommand):
|
||||
)
|
||||
|
||||
convert_xlm_checkpoint_to_pytorch(self._tf_checkpoint, self._pytorch_dump_output)
|
||||
elif self._model_type == "lxmert":
|
||||
from transformers.convert_lxmert_original_pytorch_checkpoint_to_pytorch import (
|
||||
convert_lxmert_checkpoint_to_pytorch,
|
||||
)
|
||||
|
||||
convert_lxmert_checkpoint_to_pytorch(self._tf_checkpoint, self._pytorch_dump_output)
|
||||
else:
|
||||
raise ValueError("--model_type should be selected in the list [bert, gpt, gpt2, transfo_xl, xlnet, xlm]")
|
||||
raise ValueError(
|
||||
"--model_type should be selected in the list [bert, gpt, gpt2, transfo_xl, xlnet, xlm, lxmert]"
|
||||
)
|
||||
|
||||
@@ -28,6 +28,7 @@ from .configuration_encoder_decoder import EncoderDecoderConfig
|
||||
from .configuration_flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig
|
||||
from .configuration_gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config
|
||||
from .configuration_longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig
|
||||
from .configuration_lxmert import LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP, LxmertConfig
|
||||
from .configuration_marian import MarianConfig
|
||||
from .configuration_mbart import MBART_PRETRAINED_CONFIG_ARCHIVE_MAP, MBartConfig
|
||||
from .configuration_mobilebert import MobileBertConfig
|
||||
@@ -66,6 +67,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
|
||||
ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
]
|
||||
for key, value, in pretrained_map.items()
|
||||
)
|
||||
@@ -166,6 +168,10 @@ CONFIG_MAPPING = OrderedDict(
|
||||
"encoder-decoder",
|
||||
EncoderDecoderConfig,
|
||||
),
|
||||
(
|
||||
"lxmert",
|
||||
LxmertConfig,
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
179
src/transformers/configuration_lxmert.py
Normal file
179
src/transformers/configuration_lxmert.py
Normal file
@@ -0,0 +1,179 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018, Hao Tan, Mohit Bansal
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" LXMERT model configuration """
|
||||
|
||||
|
||||
import logging
|
||||
|
||||
from .configuration_utils import PretrainedConfig
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"unc-nlp/lxmert-base-uncased": "",
|
||||
}
|
||||
|
||||
|
||||
class LxmertConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a :class:`~transformers.BertModel`.
|
||||
It is used to instantiate an Lxmert model according to the specified arguments, defining the model
|
||||
architecture.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (:obj:`int`, optional, defaults to 30522):
|
||||
Vocabulary size of the BERT model. Defines the different tokens that
|
||||
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
|
||||
hidden_size (:obj:`int`, optional, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
r_layers (:obj:`int`, optional, defaults to 5):
|
||||
Number of hidden layers in the Transformer visual encoder.
|
||||
l_layers (:obj:`int`, optional, defaults to 9):
|
||||
Number of hidden layers in the Transformer language encoder.
|
||||
x_layers (:obj:`int`, optional, defaults to 5):
|
||||
Number of hidden layers in the Transformer cross modality encoder.
|
||||
num_attention_heads (:obj:`int`, optional, defaults to 5):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
intermediate_size (:obj:`int`, optional, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
|
||||
The non-linear activation function (function or string) in the encoder and pooler.
|
||||
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
|
||||
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
|
||||
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
|
||||
The dropout ratio for the attention probabilities.
|
||||
max_position_embeddings (:obj:`int`, optional, defaults to 512):
|
||||
The maximum sequence length that this model might ever be used with.
|
||||
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
|
||||
type_vocab_size (:obj:`int`, optional, defaults to 2):
|
||||
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
|
||||
initializer_range (:obj:`float`, optional, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
|
||||
The epsilon used by the layer normalization layers.
|
||||
visual_feat_dim (:obj:`int`, optional, defaults to 2048):
|
||||
This represents the last dimension of the pooled-object features used as input for the model,
|
||||
representing the size of each object feature itself.
|
||||
visual_pos_dim (:obj:`int`, optional, defaults to 4):
|
||||
This represents the number of spacial features that are mixed into the visual features.
|
||||
The default is set to 4 because most commonly this will represent the location of a bounding box.
|
||||
i.e. (x, y, width, height)
|
||||
visual_loss_normalizer (:obj:`float`, optional, defaults to 1/15):
|
||||
This represents the scaling factor in which each visual loss is multiplied by if during pretraining,
|
||||
one decided to train with multiple vision-based loss objectives.
|
||||
num_qa_labels (:obj:`int`, optional, defaults to 9500):
|
||||
This represents the total number of different question answering (QA) labels there are. If using more than one dataset with QA,
|
||||
the user will need to account for the total number of labels that all of the datasets have in total.
|
||||
num_object_labels (:obj:`int`, optional, defaults to 1600):
|
||||
This represents the total number of semantically unique objects that lxmert will be able to classify a pooled-object feature
|
||||
as belonging too.
|
||||
num_attr_labels (:obj:`int`, optional, defaults to 400):
|
||||
This represents the total number of semantically unique attributes that lxmert will be able to classify a pooled-object feature
|
||||
as possessing.
|
||||
task_matched (:obj:`bool`, optional, defaults to True):
|
||||
This task is used for sentence-image matching. If the sentence correctly describes the image the label will be 1.
|
||||
If the sentence does not correctly describe the image, the label will be 0.
|
||||
task_mask_lm (:obj:`bool`, optional, defaults to True):
|
||||
This task is the defacto masked langauge modeling used in pretraining models such as BERT.
|
||||
task_obj_predict (:obj:`bool`, optional, defaults to True):
|
||||
This task is set to true if the user would like to perform one of the following loss objectives:
|
||||
object predicition, atrribute predicition, feature regression
|
||||
task_qa (:obj:`bool`, optional, defaults to True):
|
||||
This task specifies whether or not Lxmert will calculate the question-asnwering loss objective
|
||||
visual_obj_loss (:obj:`bool`, optional, defaults to True):
|
||||
This task specifies whether or not Lxmert will calculate the object-prediction loss objective
|
||||
visual_attr_loss (:obj:`bool`, optional, defaults to True):
|
||||
This task specifies whether or not Lxmert will calculate the attribute-prediction loss objective
|
||||
visual_feat_loss (:obj:`bool`, optional, defaults to True):
|
||||
This task specifies whether or not Lxmert will calculate the feature-regression loss objective
|
||||
output_attentions (:obj:`bool`, optional, defaults to False):
|
||||
if True, the vision, langauge, and cross-modality layers will be returned
|
||||
output_hidden_states (:obj:`bool`, optional, defaults to False):
|
||||
if True, final cross-modality hidden states for language and vision features will be returned
|
||||
|
||||
"""
|
||||
|
||||
model_type = "lxmert"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=30522,
|
||||
hidden_size=768,
|
||||
num_attention_heads=12,
|
||||
num_labels=2,
|
||||
num_qa_labels=9500,
|
||||
num_object_labels=1600,
|
||||
num_attr_labels=400,
|
||||
intermediate_size=3072,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=2,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-12,
|
||||
pad_token_id=0,
|
||||
l_layers=9,
|
||||
x_layers=5,
|
||||
r_layers=5,
|
||||
visual_feat_dim=2048,
|
||||
visual_pos_dim=4,
|
||||
visual_loss_normalizer=6.67,
|
||||
task_matched=True,
|
||||
task_mask_lm=True,
|
||||
task_obj_predict=True,
|
||||
task_qa=True,
|
||||
visual_obj_loss=True,
|
||||
visual_attr_loss=True,
|
||||
visual_feat_loss=True,
|
||||
output_attentions=False,
|
||||
output_hidden_states=False,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.num_labels = num_labels
|
||||
self.hidden_act = hidden_act
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.num_qa_labels = num_qa_labels
|
||||
self.num_object_labels = num_object_labels
|
||||
self.num_attr_labels = num_attr_labels
|
||||
self.l_layers = l_layers
|
||||
self.x_layers = x_layers
|
||||
self.r_layers = r_layers
|
||||
self.visual_feat_dim = visual_feat_dim
|
||||
self.visual_pos_dim = visual_pos_dim
|
||||
self.visual_loss_normalizer = visual_loss_normalizer
|
||||
self.task_matched = task_matched
|
||||
self.task_mask_lm = task_mask_lm
|
||||
self.task_obj_predict = task_obj_predict
|
||||
self.task_qa = task_qa
|
||||
self.visual_obj_loss = visual_obj_loss
|
||||
self.visual_attr_loss = visual_attr_loss
|
||||
self.visual_feat_loss = visual_feat_loss
|
||||
self.output_hidden_states = output_hidden_states
|
||||
self.output_attentions = self.output_attentions
|
||||
self.num_hidden_layers = {"vision": r_layers, "cross_encoder": x_layers, "language": l_layers}
|
||||
61
src/transformers/convert_lxmert_original_tf_checkpoint_to_pytorch.py
Executable file
61
src/transformers/convert_lxmert_original_tf_checkpoint_to_pytorch.py
Executable file
@@ -0,0 +1,61 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert LXMERT checkpoint."""
|
||||
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
|
||||
import torch
|
||||
|
||||
from transformers import LxmertConfig, LxmertForPreTraining, load_tf_weights_in_lxmert
|
||||
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
|
||||
def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, config_file, pytorch_dump_path):
|
||||
# Initialise PyTorch model
|
||||
config = LxmertConfig.from_json_file(config_file)
|
||||
print("Building PyTorch model from configuration: {}".format(str(config)))
|
||||
model = LxmertForPreTraining(config)
|
||||
|
||||
# Load weights from tf checkpoint
|
||||
load_tf_weights_in_lxmert(model, config, tf_checkpoint_path)
|
||||
|
||||
# Save pytorch-model
|
||||
print("Save PyTorch model to {}".format(pytorch_dump_path))
|
||||
torch.save(model.state_dict(), pytorch_dump_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--tf_checkpoint_path", default=None, type=str, required=True, help="Path to the TensorFlow checkpoint path."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--config_file",
|
||||
default=None,
|
||||
type=str,
|
||||
required=True,
|
||||
help="The config json file corresponding to the pre-trained model. \n"
|
||||
"This specifies the model architecture.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pytorch_dump_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
|
||||
)
|
||||
args = parser.parse_args()
|
||||
convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.config_file, args.pytorch_dump_path)
|
||||
@@ -27,6 +27,7 @@ from transformers import (
|
||||
ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
T5_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
@@ -43,6 +44,7 @@ from transformers import (
|
||||
ElectraConfig,
|
||||
FlaubertConfig,
|
||||
GPT2Config,
|
||||
LxmertConfig,
|
||||
OpenAIGPTConfig,
|
||||
RobertaConfig,
|
||||
T5Config,
|
||||
@@ -57,6 +59,8 @@ from transformers import (
|
||||
TFElectraForPreTraining,
|
||||
TFFlaubertWithLMHeadModel,
|
||||
TFGPT2LMHeadModel,
|
||||
TFLxmertForPreTraining,
|
||||
TFLxmertVisualFeatureEncoder,
|
||||
TFOpenAIGPTLMHeadModel,
|
||||
TFRobertaForMaskedLM,
|
||||
TFRobertaForSequenceClassification,
|
||||
@@ -94,6 +98,8 @@ if is_torch_available():
|
||||
ElectraForPreTraining,
|
||||
FlaubertWithLMHeadModel,
|
||||
GPT2LMHeadModel,
|
||||
LxmertForPreTraining,
|
||||
LxmertVisualFeatureEncoder,
|
||||
OpenAIGPTLMHeadModel,
|
||||
RobertaForMaskedLM,
|
||||
RobertaForSequenceClassification,
|
||||
@@ -204,6 +210,18 @@ MODEL_CLASSES = {
|
||||
DistilBertForQuestionAnswering,
|
||||
DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
),
|
||||
"lxmert": (
|
||||
LxmertConfig,
|
||||
TFLxmertForPreTraining,
|
||||
LxmertForPreTraining,
|
||||
LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
),
|
||||
"lxmert-visual-feature-encoder": (
|
||||
LxmertConfig,
|
||||
TFLxmertVisualFeatureEncoder,
|
||||
LxmertVisualFeatureEncoder,
|
||||
LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
),
|
||||
"ctrl": (
|
||||
CTRLConfig,
|
||||
TFCTRLLMHeadModel,
|
||||
|
||||
@@ -31,6 +31,7 @@ from .configuration_auto import (
|
||||
FlaubertConfig,
|
||||
GPT2Config,
|
||||
LongformerConfig,
|
||||
LxmertConfig,
|
||||
MBartConfig,
|
||||
MobileBertConfig,
|
||||
OpenAIGPTConfig,
|
||||
@@ -116,6 +117,7 @@ from .modeling_longformer import (
|
||||
LongformerForTokenClassification,
|
||||
LongformerModel,
|
||||
)
|
||||
from .modeling_lxmert import LxmertForPreTraining, LxmertModel
|
||||
from .modeling_marian import MarianMTModel
|
||||
from .modeling_mbart import MBartForConditionalGeneration
|
||||
from .modeling_mobilebert import (
|
||||
@@ -200,6 +202,7 @@ MODEL_MAPPING = OrderedDict(
|
||||
(CTRLConfig, CTRLModel),
|
||||
(ElectraConfig, ElectraModel),
|
||||
(ReformerConfig, ReformerModel),
|
||||
(LxmertConfig, LxmertModel),
|
||||
]
|
||||
)
|
||||
|
||||
@@ -224,6 +227,7 @@ MODEL_FOR_PRETRAINING_MAPPING = OrderedDict(
|
||||
(XLMConfig, XLMWithLMHeadModel),
|
||||
(CTRLConfig, CTRLLMHeadModel),
|
||||
(ElectraConfig, ElectraForPreTraining),
|
||||
(LxmertConfig, LxmertForPreTraining),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
1426
src/transformers/modeling_lxmert.py
Normal file
1426
src/transformers/modeling_lxmert.py
Normal file
File diff suppressed because it is too large
Load Diff
1378
src/transformers/modeling_tf_lxmert.py
Normal file
1378
src/transformers/modeling_tf_lxmert.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -883,7 +883,7 @@ MOBILEBERT_START_DOCSTRING = r"""
|
||||
|
||||
MOBILEBERT_INPUTS_DOCSTRING = r"""
|
||||
Args:
|
||||
input_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`):
|
||||
input_ids (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`):
|
||||
Indices of input sequence tokens in the vocabulary.
|
||||
|
||||
Indices can be obtained using :class:`transformers.MobileBertTokenizer`.
|
||||
@@ -891,28 +891,28 @@ MOBILEBERT_INPUTS_DOCSTRING = r"""
|
||||
:func:`transformers.PreTrainedTokenizer.__call__` for details.
|
||||
|
||||
`What are input IDs? <../glossary.html#input-ids>`__
|
||||
attention_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
|
||||
attention_mask (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
|
||||
Mask to avoid performing attention on padding token indices.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
|
||||
|
||||
`What are attention masks? <../glossary.html#attention-mask>`__
|
||||
token_type_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
|
||||
token_type_ids (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
|
||||
Segment token indices to indicate first and second portions of the inputs.
|
||||
Indices are selected in ``[0, 1]``: ``0`` corresponds to a `sentence A` token, ``1``
|
||||
corresponds to a `sentence B` token
|
||||
|
||||
`What are token type IDs? <../glossary.html#token-type-ids>`__
|
||||
position_ids (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
|
||||
position_ids (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`{0}`, `optional`, defaults to :obj:`None`):
|
||||
Indices of positions of each input sequence tokens in the position embeddings.
|
||||
Selected in the range ``[0, config.max_position_embeddings - 1]``.
|
||||
|
||||
`What are position IDs? <../glossary.html#position-ids>`__
|
||||
head_mask (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
|
||||
head_mask (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
|
||||
Mask to nullify selected heads of the self-attention modules.
|
||||
Mask values selected in ``[0, 1]``:
|
||||
:obj:`1` indicates the head is **not masked**, :obj:`0` indicates the head is **masked**.
|
||||
inputs_embeds (:obj:`Numpy array` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):
|
||||
inputs_embeds (:obj:`np.ndarray` or :obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, embedding_dim)`, `optional`, defaults to :obj:`None`):
|
||||
Optionally, instead of passing :obj:`input_ids` you can to directly pass an embedded representation.
|
||||
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
|
||||
than the model's internal embedding lookup matrix.
|
||||
|
||||
@@ -191,7 +191,7 @@ class TFSequenceClassificationLoss:
|
||||
"""
|
||||
|
||||
def compute_loss(self, labels, logits):
|
||||
if shape_list(logits)[1] == 1:
|
||||
if len(shape_list(logits)) == 1 or shape_list(logits)[1] == 1:
|
||||
loss_fn = tf.keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)
|
||||
else:
|
||||
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
|
||||
|
||||
@@ -29,6 +29,7 @@ from .configuration_auto import (
|
||||
FlaubertConfig,
|
||||
GPT2Config,
|
||||
LongformerConfig,
|
||||
LxmertConfig,
|
||||
MarianConfig,
|
||||
MBartConfig,
|
||||
MobileBertConfig,
|
||||
@@ -55,6 +56,7 @@ from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
|
||||
from .tokenization_flaubert import FlaubertTokenizer
|
||||
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
|
||||
from .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast
|
||||
from .tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast
|
||||
from .tokenization_marian import MarianTokenizer
|
||||
from .tokenization_mbart import MBartTokenizer
|
||||
from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
|
||||
@@ -91,6 +93,7 @@ TOKENIZER_MAPPING = OrderedDict(
|
||||
(RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),
|
||||
(ReformerConfig, (ReformerTokenizer, None)),
|
||||
(ElectraConfig, (ElectraTokenizer, ElectraTokenizerFast)),
|
||||
(LxmertConfig, (LxmertTokenizer, LxmertTokenizerFast)),
|
||||
(BertConfig, (BertTokenizer, BertTokenizerFast)),
|
||||
(OpenAIGPTConfig, (OpenAIGPTTokenizer, OpenAIGPTTokenizerFast)),
|
||||
(GPT2Config, (GPT2Tokenizer, GPT2TokenizerFast)),
|
||||
@@ -128,6 +131,7 @@ class AutoTokenizer:
|
||||
- `xlm`: XLMTokenizer (XLM model)
|
||||
- `ctrl`: CTRLTokenizer (Salesforce CTRL model)
|
||||
- `electra`: ElectraTokenizer (Google ELECTRA model)
|
||||
- `lxmert`: LxmertTokenizer (Lxmert model)
|
||||
|
||||
This class cannot be instantiated using `__init__()` (throw an error).
|
||||
"""
|
||||
@@ -163,6 +167,7 @@ class AutoTokenizer:
|
||||
- `xlm`: XLMTokenizer (XLM model)
|
||||
- `ctrl`: CTRLTokenizer (Salesforce CTRL model)
|
||||
- `electra`: ElectraTokenizer (Google ELECTRA model)
|
||||
- `lxmert`: LxmertTokenizer (Lxmert model)
|
||||
|
||||
Params:
|
||||
pretrained_model_name_or_path: either:
|
||||
|
||||
80
src/transformers/tokenization_lxmert.py
Normal file
80
src/transformers/tokenization_lxmert.py
Normal file
@@ -0,0 +1,80 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2020 The Google AI Team, Stanford University and The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from .tokenization_bert import BertTokenizer, BertTokenizerFast
|
||||
|
||||
|
||||
####################################################
|
||||
# Mapping from the keyword arguments names of Tokenizer `__init__`
|
||||
# to file names for serializing Tokenizer instances
|
||||
####################################################
|
||||
VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
|
||||
|
||||
####################################################
|
||||
# Mapping from the keyword arguments names of Tokenizer `__init__`
|
||||
# to pretrained vocabulary URL for all the model shortcut names.
|
||||
####################################################
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
"vocab_file": {
|
||||
"unc-nlp/lxmert-base-uncased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",
|
||||
}
|
||||
}
|
||||
|
||||
####################################################
|
||||
# Mapping from model shortcut names to max length of inputs
|
||||
####################################################
|
||||
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||
"unc-nlp/lxmert-base-uncased": 512,
|
||||
}
|
||||
####################################################
|
||||
# Mapping from model shortcut names to a dictionary of additional
|
||||
# keyword arguments for Tokenizer `__init__`.
|
||||
# To be used for checkpoint specific configurations.
|
||||
####################################################
|
||||
PRETRAINED_INIT_CONFIGURATION = {
|
||||
"unc-nlp/lxmert-base-uncased": {"do_lower_case": True},
|
||||
}
|
||||
|
||||
|
||||
class LxmertTokenizer(BertTokenizer):
|
||||
r"""
|
||||
Constructs an Lxmert tokenizer.
|
||||
:class:`~transformers.LxmertTokenizer` is identical to :class:`~transformers.BertTokenizer` and runs end-to-end
|
||||
tokenization: punctuation splitting + wordpiece.
|
||||
|
||||
Refer to superclass :class:`~transformers.BertTokenizer` for usage examples and documentation concerning
|
||||
parameters.
|
||||
"""
|
||||
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
|
||||
|
||||
|
||||
class LxmertTokenizerFast(BertTokenizerFast):
|
||||
r"""
|
||||
Constructs a "Fast" Lxmert Fast tokenizer (backed by HuggingFace's `tokenizers` library).
|
||||
|
||||
:class:`~transformers.LxmertTokenizerFast` is identical to :class:`~transformers.BertTokenizerFast` and runs end-to-end
|
||||
tokenization: punctuation splitting + wordpiece.
|
||||
|
||||
Refer to superclass :class:`~transformers.BertTokenizerFast` for usage examples and documentation concerning
|
||||
parameters.
|
||||
"""
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
|
||||
684
tests/test_modeling_lxmert.py
Normal file
684
tests/test_modeling_lxmert.py
Normal file
@@ -0,0 +1,684 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 LXMERT Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import unittest
|
||||
|
||||
from transformers import is_torch_available
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import LxmertConfig, LxmertForPreTraining, LxmertForQuestionAnswering, LxmertModel
|
||||
from transformers.modeling_lxmert import LXMERT_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
|
||||
|
||||
class LxmertModelTester:
|
||||
"""You can also import this e.g from .test_modeling_bart import BartModelTester """
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
vocab_size=300,
|
||||
hidden_size=28,
|
||||
num_attention_heads=2,
|
||||
num_labels=2,
|
||||
intermediate_size=64,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=2,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-12,
|
||||
pad_token_id=0,
|
||||
num_qa_labels=30,
|
||||
num_object_labels=16,
|
||||
num_attr_labels=4,
|
||||
num_visual_features=10,
|
||||
l_layers=2,
|
||||
x_layers=1,
|
||||
r_layers=1,
|
||||
visual_feat_dim=128,
|
||||
visual_pos_dim=4,
|
||||
visual_loss_normalizer=6.67,
|
||||
seq_length=20,
|
||||
batch_size=4,
|
||||
is_training=True,
|
||||
task_matched=True,
|
||||
task_mask_lm=True,
|
||||
task_obj_predict=True,
|
||||
task_qa=True,
|
||||
visual_obj_loss=True,
|
||||
visual_attr_loss=True,
|
||||
visual_feat_loss=True,
|
||||
use_token_type_ids=True,
|
||||
use_lang_mask=True,
|
||||
output_attentions=False,
|
||||
output_hidden_states=False,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.num_labels = num_labels
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.pad_token_id = pad_token_id
|
||||
self.num_qa_labels = num_qa_labels
|
||||
self.num_object_labels = num_object_labels
|
||||
self.num_attr_labels = num_attr_labels
|
||||
self.l_layers = l_layers
|
||||
self.x_layers = x_layers
|
||||
self.r_layers = r_layers
|
||||
self.visual_feat_dim = visual_feat_dim
|
||||
self.visual_pos_dim = visual_pos_dim
|
||||
self.visual_loss_normalizer = visual_loss_normalizer
|
||||
self.seq_length = seq_length
|
||||
self.batch_size = batch_size
|
||||
self.is_training = is_training
|
||||
self.use_lang_mask = use_lang_mask
|
||||
self.task_matched = task_matched
|
||||
self.task_mask_lm = task_mask_lm
|
||||
self.task_obj_predict = task_obj_predict
|
||||
self.task_qa = task_qa
|
||||
self.visual_obj_loss = visual_obj_loss
|
||||
self.visual_attr_loss = visual_attr_loss
|
||||
self.visual_feat_loss = visual_feat_loss
|
||||
self.num_visual_features = num_visual_features
|
||||
self.use_token_type_ids = use_token_type_ids
|
||||
self.output_attentions = output_attentions
|
||||
self.output_hidden_states = output_hidden_states
|
||||
self.scope = scope
|
||||
self.num_hidden_layers = {"vision": r_layers, "cross_encoder": x_layers, "language": l_layers}
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
|
||||
output_attentions = self.output_attentions
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], vocab_size=self.vocab_size)
|
||||
visual_feats = torch.rand(self.batch_size, self.num_visual_features, self.visual_feat_dim)
|
||||
bounding_boxes = torch.rand(self.batch_size, self.num_visual_features, 4)
|
||||
|
||||
input_mask = None
|
||||
if self.use_lang_mask:
|
||||
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
obj_labels = None
|
||||
if self.task_obj_predict:
|
||||
obj_labels = {}
|
||||
if self.visual_attr_loss and self.task_obj_predict:
|
||||
obj_labels["attr"] = (
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_attr_labels),
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_attr_labels),
|
||||
)
|
||||
if self.visual_feat_loss and self.task_obj_predict:
|
||||
obj_labels["feat"] = (
|
||||
ids_tensor(
|
||||
[self.batch_size, self.num_visual_features, self.visual_feat_dim], self.num_visual_features
|
||||
),
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_visual_features),
|
||||
)
|
||||
if self.visual_obj_loss and self.task_obj_predict:
|
||||
obj_labels["obj"] = (
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_object_labels),
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_object_labels),
|
||||
)
|
||||
ans = None
|
||||
if self.task_qa:
|
||||
ans = ids_tensor([self.batch_size], self.num_qa_labels)
|
||||
masked_lm_labels = None
|
||||
if self.task_mask_lm:
|
||||
masked_lm_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
matched_label = None
|
||||
if self.task_matched:
|
||||
matched_label = ids_tensor([self.batch_size], self.num_labels)
|
||||
|
||||
config = LxmertConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
num_labels=self.num_labels,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
type_vocab_size=self.type_vocab_size,
|
||||
initializer_range=self.initializer_range,
|
||||
layer_norm_eps=self.layer_norm_eps,
|
||||
pad_token_id=self.pad_token_id,
|
||||
num_qa_labels=self.num_qa_labels,
|
||||
num_object_labels=self.num_object_labels,
|
||||
num_attr_labels=self.num_attr_labels,
|
||||
l_layers=self.l_layers,
|
||||
x_layers=self.x_layers,
|
||||
r_layers=self.r_layers,
|
||||
visual_feat_dim=self.visual_feat_dim,
|
||||
visual_pos_dim=self.visual_pos_dim,
|
||||
visual_loss_normalizer=self.visual_loss_normalizer,
|
||||
task_matched=self.task_matched,
|
||||
task_mask_lm=self.task_mask_lm,
|
||||
task_obj_predict=self.task_obj_predict,
|
||||
task_qa=self.task_qa,
|
||||
visual_obj_loss=self.visual_obj_loss,
|
||||
visual_attr_loss=self.visual_attr_loss,
|
||||
visual_feat_loss=self.visual_feat_loss,
|
||||
output_attentions=self.output_attentions,
|
||||
output_hidden_states=self.output_hidden_states,
|
||||
)
|
||||
|
||||
return (
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
)
|
||||
|
||||
def create_and_check_lxmert_model(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
):
|
||||
model = LxmertModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
output_attentions=output_attentions,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
output_attentions=not output_attentions,
|
||||
)
|
||||
result = model(input_ids, visual_feats, bounding_boxes, return_dict=False)
|
||||
result = model(input_ids, visual_feats, bounding_boxes, return_dict=True)
|
||||
|
||||
self.parent.assertEqual(result.language_output.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
self.parent.assertEqual(
|
||||
result.vision_output.shape, (self.batch_size, self.num_visual_features, self.hidden_size)
|
||||
)
|
||||
self.parent.assertEqual(result.pooled_output.shape, (self.batch_size, self.hidden_size))
|
||||
|
||||
def create_and_check_lxmert_for_question_answering(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
):
|
||||
model = LxmertForQuestionAnswering(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
labels=ans,
|
||||
output_attentions=output_attentions,
|
||||
return_dict=True,
|
||||
)
|
||||
result = model(input_ids, visual_feats, bounding_boxes, labels=ans)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
labels=ans,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
output_attentions=output_attentions,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
labels=ans,
|
||||
output_attentions=not output_attentions,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
self.parent.assertEqual(result.question_answering_score.shape, (self.batch_size, self.num_qa_labels))
|
||||
|
||||
def create_and_check_lxmert_for_pretraining(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
):
|
||||
model = LxmertForPreTraining(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
obj_labels=obj_labels,
|
||||
matched_label=matched_label,
|
||||
ans=ans,
|
||||
output_attentions=output_attentions,
|
||||
return_dict=True,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
output_attentions=not output_attentions,
|
||||
return_dict=False,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
obj_labels=obj_labels,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
matched_label=matched_label,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
ans=ans,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
obj_labels=obj_labels,
|
||||
matched_label=matched_label,
|
||||
ans=ans,
|
||||
output_attentions=not output_attentions,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
self.parent.assertEqual(result.prediction_logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
|
||||
def resize_lxmert_num_qa_labels(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
):
|
||||
|
||||
start_labels = config.num_qa_labels
|
||||
num_large_labels = config.num_qa_labels * 2
|
||||
num_small_labels = int(config.num_qa_labels * 2)
|
||||
less_labels_ans = ids_tensor([self.batch_size], num_small_labels)
|
||||
more_labels_ans = ids_tensor([self.batch_size], num_large_labels)
|
||||
model_pretrain = LxmertForPreTraining(config=config)
|
||||
model_qa = LxmertForQuestionAnswering(config=config)
|
||||
config.num_labels = num_small_labels
|
||||
end_labels = config.num_labels
|
||||
|
||||
result_pretrain = model_pretrain(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
ans=ans,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
result_qa = model_qa(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
labels=ans,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
model_pretrain.resize_num_qa_labels(num_small_labels)
|
||||
model_qa.resize_num_qa_labels(num_small_labels)
|
||||
|
||||
result_pretrain_less = model_pretrain(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
ans=less_labels_ans,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
result_qa_less = model_qa(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
labels=less_labels_ans,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
model_pretrain.resize_num_qa_labels(num_large_labels)
|
||||
model_qa.resize_num_qa_labels(num_large_labels)
|
||||
|
||||
result_pretrain_more = model_pretrain(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
ans=more_labels_ans,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
result_qa_more = model_qa(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
labels=more_labels_ans,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
model_qa_labels = model_qa.num_qa_labels
|
||||
|
||||
self.parent.assertNotEqual(start_labels, end_labels)
|
||||
self.parent.assertNotEqual(model_qa_labels, start_labels)
|
||||
self.parent.assertEqual(result_qa.question_answering_score.shape, (self.batch_size, start_labels))
|
||||
self.parent.assertEqual(result_pretrain.question_answering_score.shape, (self.batch_size, start_labels))
|
||||
self.parent.assertEqual(result_qa_less.question_answering_score.shape, (self.batch_size, num_small_labels))
|
||||
self.parent.assertEqual(
|
||||
result_pretrain_less.question_answering_score.shape, (self.batch_size, num_small_labels)
|
||||
)
|
||||
self.parent.assertEqual(result_qa_more.question_answering_score.shape, (self.batch_size, num_large_labels))
|
||||
self.parent.assertEqual(
|
||||
result_pretrain_more.question_answering_score.shape, (self.batch_size, num_large_labels)
|
||||
)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
) = config_and_inputs
|
||||
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"visual_feats": visual_feats,
|
||||
"visual_pos": bounding_boxes,
|
||||
"token_type_ids": token_type_ids,
|
||||
"attention_mask": input_mask,
|
||||
}
|
||||
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class LxmertModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (LxmertModel, LxmertForPreTraining, LxmertForQuestionAnswering) if is_torch_available() else ()
|
||||
|
||||
test_head_masking = False
|
||||
test_pruning = False
|
||||
test_torchscript = False
|
||||
|
||||
test_head_masking = False
|
||||
test_pruning = False
|
||||
test_torchscript = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = LxmertModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=LxmertConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_lxmert_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_lxmert_model(*config_and_inputs)
|
||||
|
||||
def test_lxmert_question_answering(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_lxmert_for_question_answering(*config_and_inputs)
|
||||
|
||||
def test_lxmert_pretraining(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_lxmert_for_pretraining(*config_and_inputs)
|
||||
|
||||
def test_lxmert_question_answering_labels_resize(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.resize_lxmert_num_qa_labels(*config_and_inputs)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in LXMERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = LxmertModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
def test_attention_outputs(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
seq_len = getattr(self.model_tester, "seq_length", None)
|
||||
encoder_seq_length = getattr(self.model_tester, "encoder_seq_length", seq_len)
|
||||
encoder_key_length = getattr(self.model_tester, "key_length", encoder_seq_length)
|
||||
chunk_length = getattr(self.model_tester, "chunk_length", None)
|
||||
if chunk_length is not None and hasattr(self.model_tester, "num_hashes"):
|
||||
encoder_seq_length = encoder_seq_length * self.model_tester.num_hashes
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = False
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
language_attentions, vision_attentions, cross_encoder_attentions = (outputs[-3], outputs[-2], outputs[-1])
|
||||
|
||||
self.assertEqual(len(language_attentions), self.model_tester.num_hidden_layers["language"])
|
||||
self.assertEqual(len(vision_attentions), self.model_tester.num_hidden_layers["vision"])
|
||||
self.assertEqual(len(cross_encoder_attentions), self.model_tester.num_hidden_layers["cross_encoder"])
|
||||
|
||||
# check that output_attentions also work using config
|
||||
del inputs_dict["output_attentions"]
|
||||
config.output_attentions = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
language_attentions, vision_attentions, cross_encoder_attentions = (outputs[-3], outputs[-2], outputs[-1])
|
||||
self.assertEqual(len(language_attentions), self.model_tester.num_hidden_layers["language"])
|
||||
self.assertEqual(len(vision_attentions), self.model_tester.num_hidden_layers["vision"])
|
||||
self.assertEqual(len(cross_encoder_attentions), self.model_tester.num_hidden_layers["cross_encoder"])
|
||||
|
||||
attentions = [language_attentions, vision_attentions, cross_encoder_attentions]
|
||||
attention_shapes = [
|
||||
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||
[
|
||||
self.model_tester.num_attention_heads,
|
||||
self.model_tester.num_visual_features,
|
||||
self.model_tester.num_visual_features,
|
||||
],
|
||||
[self.model_tester.num_attention_heads, encoder_key_length, self.model_tester.num_visual_features],
|
||||
]
|
||||
|
||||
for attention, attention_shape in zip(attentions, attention_shapes):
|
||||
self.assertListEqual(list(attention[0].shape[-3:]), attention_shape)
|
||||
out_len = len(outputs)
|
||||
|
||||
# Check attention is always last and order is fine
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
# 2 hidden states were added
|
||||
self.assertEqual(out_len + 2, len(outputs))
|
||||
|
||||
language_attentions, vision_attentions, cross_encoder_attentions = (outputs[-3], outputs[-2], outputs[-1])
|
||||
self.assertEqual(len(language_attentions), self.model_tester.num_hidden_layers["language"])
|
||||
self.assertEqual(len(vision_attentions), self.model_tester.num_hidden_layers["vision"])
|
||||
self.assertEqual(len(cross_encoder_attentions), self.model_tester.num_hidden_layers["cross_encoder"])
|
||||
|
||||
attentions = [language_attentions, vision_attentions, cross_encoder_attentions]
|
||||
attention_shapes = [
|
||||
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||
[
|
||||
self.model_tester.num_attention_heads,
|
||||
self.model_tester.num_visual_features,
|
||||
self.model_tester.num_visual_features,
|
||||
],
|
||||
[self.model_tester.num_attention_heads, encoder_key_length, self.model_tester.num_visual_features],
|
||||
]
|
||||
|
||||
for attention, attention_shape in zip(attentions, attention_shapes):
|
||||
self.assertListEqual(list(attention[0].shape[-3:]), attention_shape)
|
||||
|
||||
def test_hidden_states_output(self):
|
||||
def check_hidden_states_output(inputs_dict, config, model_class):
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
language_hidden_states, vision_hidden_states = outputs[-2], outputs[-1]
|
||||
|
||||
self.assertEqual(len(language_hidden_states), self.model_tester.num_hidden_layers["language"] + 1)
|
||||
self.assertEqual(len(vision_hidden_states), self.model_tester.num_hidden_layers["vision"] + 1)
|
||||
|
||||
seq_length = self.model_tester.seq_length
|
||||
num_visual_features = self.model_tester.num_visual_features
|
||||
|
||||
self.assertListEqual(
|
||||
list(language_hidden_states[0].shape[-2:]),
|
||||
[seq_length, self.model_tester.hidden_size],
|
||||
)
|
||||
self.assertListEqual(
|
||||
list(vision_hidden_states[0].shape[-2:]),
|
||||
[num_visual_features, self.model_tester.hidden_size],
|
||||
)
|
||||
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
check_hidden_states_output(inputs_dict, config, model_class)
|
||||
|
||||
# check that output_hidden_states also work using config
|
||||
del inputs_dict["output_hidden_states"]
|
||||
config.output_hidden_states = True
|
||||
|
||||
check_hidden_states_output(inputs_dict, config, model_class)
|
||||
680
tests/test_modeling_tf_lxmert.py
Normal file
680
tests/test_modeling_tf_lxmert.py
Normal file
@@ -0,0 +1,680 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 XXX Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from transformers import LxmertConfig, is_tf_available
|
||||
from transformers.testing_utils import require_tf, slow
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_tf_common import TFModelTesterMixin, ids_tensor
|
||||
|
||||
|
||||
if is_tf_available():
|
||||
import tensorflow as tf
|
||||
|
||||
from transformers.modeling_tf_lxmert import TFLxmertForPreTraining, TFLxmertModel
|
||||
|
||||
|
||||
class TFLxmertModelTester(object):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
vocab_size=300,
|
||||
hidden_size=28,
|
||||
num_attention_heads=2,
|
||||
num_labels=2,
|
||||
intermediate_size=64,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=2,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-12,
|
||||
pad_token_id=0,
|
||||
num_qa_labels=30,
|
||||
num_object_labels=16,
|
||||
num_attr_labels=4,
|
||||
num_visual_features=10,
|
||||
l_layers=2,
|
||||
x_layers=1,
|
||||
r_layers=1,
|
||||
visual_feat_dim=128,
|
||||
visual_pos_dim=4,
|
||||
visual_loss_normalizer=6.67,
|
||||
seq_length=20,
|
||||
batch_size=8,
|
||||
is_training=True,
|
||||
task_matched=True,
|
||||
task_mask_lm=True,
|
||||
task_obj_predict=True,
|
||||
task_qa=True,
|
||||
visual_obj_loss=True,
|
||||
visual_attr_loss=True,
|
||||
visual_feat_loss=True,
|
||||
use_token_type_ids=True,
|
||||
use_lang_mask=True,
|
||||
output_attentions=False,
|
||||
output_hidden_states=False,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.num_labels = num_labels
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.pad_token_id = pad_token_id
|
||||
self.num_qa_labels = num_qa_labels
|
||||
self.num_object_labels = num_object_labels
|
||||
self.num_attr_labels = num_attr_labels
|
||||
self.l_layers = l_layers
|
||||
self.x_layers = x_layers
|
||||
self.r_layers = r_layers
|
||||
self.visual_feat_dim = visual_feat_dim
|
||||
self.visual_pos_dim = visual_pos_dim
|
||||
self.visual_loss_normalizer = visual_loss_normalizer
|
||||
self.seq_length = seq_length
|
||||
self.batch_size = batch_size
|
||||
self.is_training = is_training
|
||||
self.use_lang_mask = use_lang_mask
|
||||
self.task_matched = task_matched
|
||||
self.task_mask_lm = task_mask_lm
|
||||
self.task_obj_predict = task_obj_predict
|
||||
self.task_qa = task_qa
|
||||
self.visual_obj_loss = visual_obj_loss
|
||||
self.visual_attr_loss = visual_attr_loss
|
||||
self.visual_feat_loss = visual_feat_loss
|
||||
self.num_visual_features = num_visual_features
|
||||
self.use_token_type_ids = use_token_type_ids
|
||||
self.output_attentions = output_attentions
|
||||
self.output_hidden_states = output_hidden_states
|
||||
self.scope = scope
|
||||
self.num_hidden_layers = {"vision": r_layers, "cross_encoder": x_layers, "language": l_layers}
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
output_attentions = self.output_attentions
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], vocab_size=self.vocab_size)
|
||||
visual_feats = tf.random.uniform((self.batch_size, self.num_visual_features, self.visual_feat_dim))
|
||||
bounding_boxes = tf.random.uniform((self.batch_size, self.num_visual_features, 4))
|
||||
|
||||
input_mask = None
|
||||
if self.use_lang_mask:
|
||||
input_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
|
||||
token_type_ids = None
|
||||
if self.use_token_type_ids:
|
||||
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||
obj_labels = None
|
||||
if self.task_obj_predict:
|
||||
obj_labels = {}
|
||||
if self.visual_attr_loss and self.task_obj_predict:
|
||||
obj_labels["attr"] = (
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_attr_labels),
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_attr_labels),
|
||||
)
|
||||
if self.visual_feat_loss and self.task_obj_predict:
|
||||
obj_labels["feat"] = (
|
||||
ids_tensor(
|
||||
[self.batch_size, self.num_visual_features, self.visual_feat_dim], self.num_visual_features
|
||||
),
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_visual_features),
|
||||
)
|
||||
if self.visual_obj_loss and self.task_obj_predict:
|
||||
obj_labels["obj"] = (
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_object_labels),
|
||||
ids_tensor([self.batch_size, self.num_visual_features], self.num_object_labels),
|
||||
)
|
||||
ans = None
|
||||
if self.task_qa:
|
||||
ans = ids_tensor([self.batch_size], self.num_qa_labels)
|
||||
masked_lm_labels = None
|
||||
if self.task_mask_lm:
|
||||
masked_lm_labels = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
matched_label = None
|
||||
if self.task_matched:
|
||||
matched_label = ids_tensor([self.batch_size], self.num_labels)
|
||||
|
||||
config = LxmertConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
num_labels=self.num_labels,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
type_vocab_size=self.type_vocab_size,
|
||||
initializer_range=self.initializer_range,
|
||||
layer_norm_eps=self.layer_norm_eps,
|
||||
pad_token_id=self.pad_token_id,
|
||||
num_qa_labels=self.num_qa_labels,
|
||||
num_object_labels=self.num_object_labels,
|
||||
num_attr_labels=self.num_attr_labels,
|
||||
l_layers=self.l_layers,
|
||||
x_layers=self.x_layers,
|
||||
r_layers=self.r_layers,
|
||||
visual_feat_dim=self.visual_feat_dim,
|
||||
visual_pos_dim=self.visual_pos_dim,
|
||||
visual_loss_normalizer=self.visual_loss_normalizer,
|
||||
task_matched=self.task_matched,
|
||||
task_mask_lm=self.task_mask_lm,
|
||||
task_obj_predict=self.task_obj_predict,
|
||||
task_qa=self.task_qa,
|
||||
visual_obj_loss=self.visual_obj_loss,
|
||||
visual_attr_loss=self.visual_attr_loss,
|
||||
visual_feat_loss=self.visual_feat_loss,
|
||||
output_attentions=self.output_attentions,
|
||||
output_hidden_states=self.output_hidden_states,
|
||||
)
|
||||
|
||||
return (
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
)
|
||||
|
||||
def create_and_check_lxmert_model(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
):
|
||||
model = TFLxmertModel(config=config)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
output_attentions=output_attentions,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
output_attentions=not output_attentions,
|
||||
)
|
||||
result = model(input_ids, visual_feats, bounding_boxes, return_dict=False)
|
||||
result = model(input_ids, visual_feats, bounding_boxes, return_dict=True)
|
||||
|
||||
self.parent.assertEqual(result.language_output.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
self.parent.assertEqual(
|
||||
result.vision_output.shape, (self.batch_size, self.num_visual_features, self.hidden_size)
|
||||
)
|
||||
self.parent.assertEqual(result.pooled_output.shape, (self.batch_size, self.hidden_size))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self, return_obj_labels=False):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
) = config_and_inputs
|
||||
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"visual_feats": visual_feats,
|
||||
"visual_pos": bounding_boxes,
|
||||
"token_type_ids": token_type_ids,
|
||||
"attention_mask": input_mask,
|
||||
}
|
||||
|
||||
if return_obj_labels:
|
||||
inputs_dict["obj_labels"] = obj_labels
|
||||
|
||||
return config, inputs_dict
|
||||
|
||||
def create_and_check_lxmert_for_pretraining(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids,
|
||||
input_mask,
|
||||
obj_labels,
|
||||
masked_lm_labels,
|
||||
matched_label,
|
||||
ans,
|
||||
output_attentions,
|
||||
):
|
||||
model = TFLxmertForPreTraining(config=config)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
obj_labels=obj_labels,
|
||||
matched_label=matched_label,
|
||||
ans=ans,
|
||||
output_attentions=output_attentions,
|
||||
return_dict=True,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
output_attentions=not output_attentions,
|
||||
return_dict=False,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
obj_labels=obj_labels,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
matched_label=matched_label,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
ans=ans,
|
||||
)
|
||||
result = model(
|
||||
input_ids,
|
||||
visual_feats,
|
||||
bounding_boxes,
|
||||
token_type_ids=token_type_ids,
|
||||
attention_mask=input_mask,
|
||||
masked_lm_labels=masked_lm_labels,
|
||||
obj_labels=obj_labels,
|
||||
matched_label=matched_label,
|
||||
ans=ans,
|
||||
output_attentions=not output_attentions,
|
||||
return_dict=True,
|
||||
)
|
||||
|
||||
self.parent.assertEqual(result.prediction_logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
|
||||
|
||||
@require_tf
|
||||
class TFLxmertModelTest(TFModelTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (TFLxmertModel, TFLxmertForPreTraining) if is_tf_available() else ()
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = TFLxmertModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=LxmertConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_lxmert_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_lxmert_model(*config_and_inputs)
|
||||
|
||||
def test_lxmert_for_pretraining(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_lxmert_for_pretraining(*config_and_inputs)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in ["unc-nlp/lxmert-base-uncased"]:
|
||||
model = TFLxmertModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
def test_attention_outputs(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
encoder_seq_length = (
|
||||
self.model_tester.encoder_seq_length
|
||||
if hasattr(self.model_tester, "encoder_seq_length")
|
||||
else self.model_tester.seq_length
|
||||
)
|
||||
encoder_key_length = (
|
||||
self.model_tester.key_length if hasattr(self.model_tester, "key_length") else encoder_seq_length
|
||||
)
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = False
|
||||
model = model_class(config)
|
||||
outputs = model(self._prepare_for_class(inputs_dict, model_class))
|
||||
language_attentions, vision_attentions, cross_encoder_attentions = (outputs[-3], outputs[-2], outputs[-1])
|
||||
|
||||
self.assertEqual(model.config.output_hidden_states, False)
|
||||
|
||||
self.assertEqual(len(language_attentions), self.model_tester.num_hidden_layers["language"])
|
||||
self.assertEqual(len(vision_attentions), self.model_tester.num_hidden_layers["vision"])
|
||||
self.assertEqual(len(cross_encoder_attentions), self.model_tester.num_hidden_layers["cross_encoder"])
|
||||
|
||||
attentions = [language_attentions, vision_attentions, cross_encoder_attentions]
|
||||
attention_shapes = [
|
||||
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||
[
|
||||
self.model_tester.num_attention_heads,
|
||||
self.model_tester.num_visual_features,
|
||||
self.model_tester.num_visual_features,
|
||||
],
|
||||
[self.model_tester.num_attention_heads, encoder_key_length, self.model_tester.num_visual_features],
|
||||
]
|
||||
|
||||
for attention, attention_shape in zip(attentions, attention_shapes):
|
||||
self.assertListEqual(list(attention[0].shape[-3:]), attention_shape)
|
||||
out_len = len(outputs)
|
||||
|
||||
# Check attention is always last and order is fine
|
||||
inputs_dict["output_attentions"] = True
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
model = model_class(config)
|
||||
outputs = model(self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
# 2 hidden states were added
|
||||
self.assertEqual(out_len + 2, len(outputs))
|
||||
language_attentions, vision_attentions, cross_encoder_attentions = (outputs[-3], outputs[-2], outputs[-1])
|
||||
self.assertEqual(len(language_attentions), self.model_tester.num_hidden_layers["language"])
|
||||
self.assertEqual(len(vision_attentions), self.model_tester.num_hidden_layers["vision"])
|
||||
self.assertEqual(len(cross_encoder_attentions), self.model_tester.num_hidden_layers["cross_encoder"])
|
||||
|
||||
attentions = [language_attentions, vision_attentions, cross_encoder_attentions]
|
||||
attention_shapes = [
|
||||
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||
[
|
||||
self.model_tester.num_attention_heads,
|
||||
self.model_tester.num_visual_features,
|
||||
self.model_tester.num_visual_features,
|
||||
],
|
||||
[self.model_tester.num_attention_heads, encoder_key_length, self.model_tester.num_visual_features],
|
||||
]
|
||||
|
||||
for attention, attention_shape in zip(attentions, attention_shapes):
|
||||
self.assertListEqual(list(attention[0].shape[-3:]), attention_shape)
|
||||
|
||||
def test_hidden_states_output(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
def check_hidden_states_output(config, inputs_dict, model_class):
|
||||
model = model_class(config)
|
||||
outputs = model(self._prepare_for_class(inputs_dict, model_class))
|
||||
language_hidden_states, vision_hidden_states = outputs[-2], outputs[-1]
|
||||
|
||||
self.assertEqual(len(language_hidden_states), self.model_tester.num_hidden_layers["language"] + 1)
|
||||
self.assertEqual(len(vision_hidden_states), self.model_tester.num_hidden_layers["vision"] + 1)
|
||||
|
||||
seq_length = self.model_tester.seq_length
|
||||
num_visual_features = self.model_tester.num_visual_features
|
||||
|
||||
self.assertListEqual(
|
||||
list(language_hidden_states[0].shape[-2:]),
|
||||
[seq_length, self.model_tester.hidden_size],
|
||||
)
|
||||
self.assertListEqual(
|
||||
list(vision_hidden_states[0].shape[-2:]),
|
||||
[num_visual_features, self.model_tester.hidden_size],
|
||||
)
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
inputs_dict["output_hidden_states"] = True
|
||||
check_hidden_states_output(config, inputs_dict, model_class)
|
||||
|
||||
del inputs_dict["output_hidden_states"]
|
||||
config.output_hidden_states = True
|
||||
check_hidden_states_output(config, inputs_dict, model_class)
|
||||
|
||||
def test_pt_tf_model_equivalence(self):
|
||||
from transformers import is_torch_available
|
||||
|
||||
if not is_torch_available():
|
||||
return
|
||||
|
||||
import torch
|
||||
|
||||
import transformers
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common(
|
||||
return_obj_labels="PreTraining" in model_class.__name__
|
||||
)
|
||||
|
||||
pt_model_class_name = model_class.__name__[2:] # Skip the "TF" at the beggining
|
||||
pt_model_class = getattr(transformers, pt_model_class_name)
|
||||
|
||||
config.output_hidden_states = True
|
||||
config.task_obj_predict = False
|
||||
|
||||
tf_model = model_class(config)
|
||||
pt_model = pt_model_class(config)
|
||||
|
||||
# Check we can load pt model in tf and vice-versa with model => model functions
|
||||
|
||||
tf_model = transformers.load_pytorch_model_in_tf2_model(
|
||||
tf_model, pt_model, tf_inputs=self._prepare_for_class(inputs_dict, model_class)
|
||||
)
|
||||
pt_model = transformers.load_tf2_model_in_pytorch_model(pt_model, tf_model)
|
||||
|
||||
# Check predictions on first output (logits/hidden-states) are close enought given low-level computational differences
|
||||
pt_model.eval()
|
||||
|
||||
# Delete obj labels as we want to compute the hidden states and not the loss
|
||||
|
||||
if "obj_labels" in inputs_dict:
|
||||
del inputs_dict["obj_labels"]
|
||||
|
||||
def torch_type(key):
|
||||
if key in ("visual_feats", "visual_pos"):
|
||||
return torch.float32
|
||||
else:
|
||||
return torch.long
|
||||
|
||||
def recursive_numpy_convert(iterable):
|
||||
return_dict = {}
|
||||
for key, value in iterable.items():
|
||||
if isinstance(value, dict):
|
||||
return_dict[key] = recursive_numpy_convert(value)
|
||||
else:
|
||||
if isinstance(value, (list, tuple)):
|
||||
return_dict[key] = (
|
||||
torch.from_numpy(iter_value.numpy()).to(torch_type(key)) for iter_value in value
|
||||
)
|
||||
else:
|
||||
return_dict[key] = torch.from_numpy(value.numpy()).to(torch_type(key))
|
||||
return return_dict
|
||||
|
||||
pt_inputs_dict = recursive_numpy_convert(self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
# need to rename encoder-decoder "inputs" for PyTorch
|
||||
if "inputs" in pt_inputs_dict and self.is_encoder_decoder:
|
||||
pt_inputs_dict["input_ids"] = pt_inputs_dict.pop("inputs")
|
||||
|
||||
with torch.no_grad():
|
||||
pto = pt_model(**pt_inputs_dict)
|
||||
tfo = tf_model(self._prepare_for_class(inputs_dict, model_class), training=False)
|
||||
tf_hidden_states = tfo[0].numpy()
|
||||
pt_hidden_states = pto[0].numpy()
|
||||
|
||||
import numpy as np
|
||||
|
||||
tf_nans = np.copy(np.isnan(tf_hidden_states))
|
||||
pt_nans = np.copy(np.isnan(pt_hidden_states))
|
||||
|
||||
pt_hidden_states[tf_nans] = 0
|
||||
tf_hidden_states[tf_nans] = 0
|
||||
pt_hidden_states[pt_nans] = 0
|
||||
tf_hidden_states[pt_nans] = 0
|
||||
|
||||
max_diff = np.amax(np.abs(tf_hidden_states - pt_hidden_states))
|
||||
# Debug info (remove when fixed)
|
||||
if max_diff >= 2e-2:
|
||||
print("===")
|
||||
print(model_class)
|
||||
print(config)
|
||||
print(inputs_dict)
|
||||
print(pt_inputs_dict)
|
||||
self.assertLessEqual(max_diff, 6e-2)
|
||||
|
||||
# Check we can load pt model in tf and vice-versa with checkpoint => model functions
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
import os
|
||||
|
||||
pt_checkpoint_path = os.path.join(tmpdirname, "pt_model.bin")
|
||||
torch.save(pt_model.state_dict(), pt_checkpoint_path)
|
||||
tf_model = transformers.load_pytorch_checkpoint_in_tf2_model(tf_model, pt_checkpoint_path)
|
||||
|
||||
tf_checkpoint_path = os.path.join(tmpdirname, "tf_model.h5")
|
||||
tf_model.save_weights(tf_checkpoint_path)
|
||||
pt_model = transformers.load_tf2_checkpoint_in_pytorch_model(pt_model, tf_checkpoint_path)
|
||||
|
||||
# Check predictions on first output (logits/hidden-states) are close enought given low-level computational differences
|
||||
pt_model.eval()
|
||||
pt_inputs_dict = dict(
|
||||
(name, torch.from_numpy(key.numpy()).to(torch.long))
|
||||
for name, key in self._prepare_for_class(inputs_dict, model_class).items()
|
||||
)
|
||||
|
||||
for key, value in pt_inputs_dict.items():
|
||||
if key in ("visual_feats", "visual_pos"):
|
||||
pt_inputs_dict[key] = value.to(torch.float32)
|
||||
else:
|
||||
pt_inputs_dict[key] = value.to(torch.long)
|
||||
|
||||
with torch.no_grad():
|
||||
pto = pt_model(**pt_inputs_dict)
|
||||
tfo = tf_model(self._prepare_for_class(inputs_dict, model_class))
|
||||
tfo = tfo[0].numpy()
|
||||
pto = pto[0].numpy()
|
||||
tf_nans = np.copy(np.isnan(tfo))
|
||||
pt_nans = np.copy(np.isnan(pto))
|
||||
|
||||
pto[tf_nans] = 0
|
||||
tfo[tf_nans] = 0
|
||||
pto[pt_nans] = 0
|
||||
tfo[pt_nans] = 0
|
||||
|
||||
max_diff = np.amax(np.abs(tfo - pto))
|
||||
self.assertLessEqual(max_diff, 6e-2)
|
||||
|
||||
def test_save_load(self):
|
||||
for model_class in self.all_model_classes:
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common(
|
||||
return_obj_labels="PreTraining" in model_class.__name__
|
||||
)
|
||||
|
||||
model = model_class(config)
|
||||
outputs = model(self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
model.save_pretrained(tmpdirname)
|
||||
model = model_class.from_pretrained(tmpdirname)
|
||||
after_outputs = model(self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
self.assert_outputs_same(after_outputs, outputs)
|
||||
|
||||
def test_compile_tf_model(self):
|
||||
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
|
||||
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
|
||||
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common(
|
||||
return_obj_labels="PreTraining" in model_class.__name__
|
||||
)
|
||||
|
||||
input_ids = tf.keras.Input(
|
||||
batch_shape=(self.model_tester.batch_size, self.model_tester.seq_length),
|
||||
name="input_ids",
|
||||
dtype="int32",
|
||||
)
|
||||
visual_feats = tf.keras.Input(
|
||||
batch_shape=(
|
||||
self.model_tester.batch_size,
|
||||
self.model_tester.num_visual_features,
|
||||
self.model_tester.visual_feat_dim,
|
||||
),
|
||||
name="visual_feats",
|
||||
dtype="int32",
|
||||
)
|
||||
visual_pos = tf.keras.Input(
|
||||
batch_shape=(self.model_tester.batch_size, self.model_tester.num_visual_features, 4),
|
||||
name="visual_pos",
|
||||
dtype="int32",
|
||||
)
|
||||
|
||||
# Prepare our model
|
||||
model = model_class(config)
|
||||
|
||||
# Let's load it from the disk to be sure we can use pretrained weights
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
outputs = model(self._prepare_for_class(inputs_dict, model_class)) # build the model
|
||||
model.save_pretrained(tmpdirname)
|
||||
model = model_class.from_pretrained(tmpdirname)
|
||||
|
||||
outputs_dict = model(input_ids, visual_feats, visual_pos)
|
||||
hidden_states = outputs_dict[0]
|
||||
|
||||
# Add a dense layer on top to test integration with other keras modules
|
||||
outputs = tf.keras.layers.Dense(2, activation="softmax", name="outputs")(hidden_states)
|
||||
|
||||
# Compile extended model
|
||||
extended_model = tf.keras.Model(inputs=[input_ids, visual_feats, visual_pos], outputs=[outputs])
|
||||
extended_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
|
||||
65
tests/test_tokenization_lxmert.py
Normal file
65
tests/test_tokenization_lxmert.py
Normal file
@@ -0,0 +1,65 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2018 LXMERT Authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
import os
|
||||
import unittest
|
||||
|
||||
from transformers.tokenization_bert import VOCAB_FILES_NAMES
|
||||
from transformers.tokenization_lxmert import LxmertTokenizer
|
||||
|
||||
from .test_tokenization_common import TokenizerTesterMixin
|
||||
|
||||
|
||||
class LxmertTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
|
||||
|
||||
tokenizer_class = LxmertTokenizer
|
||||
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
|
||||
vocab_tokens = [
|
||||
"[UNK]",
|
||||
"[CLS]",
|
||||
"[SEP]",
|
||||
"want",
|
||||
"##want",
|
||||
"##ed",
|
||||
"wa",
|
||||
"un",
|
||||
"runn",
|
||||
"##ing",
|
||||
",",
|
||||
"low",
|
||||
"lowest",
|
||||
]
|
||||
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
|
||||
with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
|
||||
vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return LxmertTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def get_input_output_texts(self, tokenizer):
|
||||
input_text = "UNwant\u00E9d,running"
|
||||
output_text = "unwanted, running"
|
||||
return input_text, output_text
|
||||
|
||||
def test_full_tokenizer(self):
|
||||
tokenizer = self.tokenizer_class(self.vocab_file)
|
||||
|
||||
tokens = tokenizer.tokenize("UNwant\u00E9d,running")
|
||||
self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
|
||||
self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [7, 4, 5, 10, 8, 9])
|
||||
Reference in New Issue
Block a user