Adding the LXMERT pretraining model (MultiModal languageXvision) to HuggingFace's suite of models (#5793)

* added template files for LXMERT and competed the configuration_lxmert.py * added modeling, tokization, testing, and finishing touched for lxmert [yet to be tested] * added model card for lxmert * cleaning up lxmert code * Update src/transformers/modeling_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_tf_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_tf_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/modeling_lxmert.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * tested torch lxmert, changed documtention, updated outputs, and other small fixes * Update src/transformers/convert_pytorch_checkpoint_to_tf2.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/convert_pytorch_checkpoint_to_tf2.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Update src/transformers/convert_pytorch_checkpoint_to_tf2.py Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * renaming, other small issues, did not change TF code in this commit * added lxmert question answering model in pytorch * added capability to edit number of qa labels for lxmert * made answer optional for lxmert question answering * add option to return hidden_states for lxmert * changed default qa labels for lxmert * changed config archive path * squshing 3 commits: merged UI + testing improvments + more UI and testing * changed some variable names for lxmert * TF LXMERT * Various fixes to LXMERT * Final touches to LXMERT * AutoTokenizer order * Add LXMERT to index.rst and README.md * Merge commit test fixes + Style update * TensorFlow 2.3.0 sequential model changes variable names Remove inherited test * Update src/transformers/modeling_tf_pytorch_utils.py * Update docs/source/model_doc/lxmert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update docs/source/model_doc/lxmert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_tf_lxmert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * added suggestions * Fixes * Final fixes for TF model * Fix docs Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-09-03 04:02:25 -04:00
parent 4ebb52afdb
commit ea2c6f1afc
23 changed files with 4798 additions and 12 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -128,7 +128,10 @@ conversion utilities for the following models:
    <https://arxiv.org/abs/1912.08777>`_ by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu.
 24. `MBart <https://github.com/pytorch/fairseq/tree/master/examples/mbart>`_ (from Facebook) released with the paper  `Multilingual Denoising Pre-training for Neural Machine Translation <https://arxiv.org/abs/2001.08210>`_ by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov,
    Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.  
-25. `Other community models <https://huggingface.co/models>`_, contributed by the `community
+25. `LXMERT <https://github.com/airsplay/lxmert>`_ (from UNC Chapel Hill) released with the paper `LXMERT: Learning
+    Cross-Modality Encoder Representations from Transformers for Open-Domain Question
+    Answering <https://arxiv.org/abs/1908.07490>`_ by Hao Tan and Mohit Bansal.
+26. `Other community models <https://huggingface.co/models>`_, contributed by the `community
    <https://huggingface.co/users>`_.

 .. toctree::
@@ -213,6 +216,7 @@ conversion utilities for the following models:
    model_doc/dpr
    model_doc/pegasus
    model_doc/mbart
+    model_doc/lxmert
    internal/modeling_utils
    internal/tokenization_utils
    internal/pipelines_utils
--- a/docs/source/model_doc/lxmert.rst
+++ b/docs/source/model_doc/lxmert.rst
@@ -0,0 +1,109 @@
+LXMERT
+----------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+
+The LXMERT model was proposed in `LXMERT: Learning Cross-Modality Encoder Representations from Transformers <https://arxiv.org/abs/1908.07490>`__
+by Hao Tan & Mohit Bansal. It is a series of bidirectional transformer encoders (one for the vision modality, one for the language modality, and then one to fuse both modalities)
+pre-trained using a combination of masked language modeling, visual-language text alignment, ROI-feature regression, masked visual-attribute modeling, masked visual-object modeling, and visual-question answering objectives.
+The pretraining consists of multiple multi-modal datasets: MSCOCO, Visual-Genome + Visual-Genome Question Answering, VQA 2.0, and GQA.
+
+The abstract from the paper is the following:
+
+*Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two
+modalities. We thus propose the LXMERT
+(Learning Cross-Modality Encoder Representations from Transformers) framework to learn
+these vision-and-language connections. In
+LXMERT, we build a large-scale Transformer
+model that consists of three encoders: an object relationship encoder, a language encoder,
+and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we
+pre-train the model with large amounts of
+image-and-sentence pairs, via five diverse representative pre-training tasks: masked language modeling, masked object prediction
+(feature regression and label classification),
+cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pretrained parameters, our model achieves the
+state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA).
+We also show the generalizability of our pretrained cross-modality model by adapting it to
+a challenging visual-reasoning task, NLVR
+,
+and improve the previous best result by 22%
+absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that
+both our novel model components and pretraining strategies significantly contribute to
+our strong results; and also present several
+attention visualizations for the different encoders*
+
+Tips:
+
+- Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features will work.
+- Both the language hidden states and the visual hidden states that LXMERT outputs are passed through the cross-modality layer, so they
+  contain information from both modalities. To access a modality that only attends to itself, select the vision/language hidden states from the first input in the tuple.
+- The bi-directional cross-modality encoder attention only returns attention values when the language modality is used as the input and the vision modality is used as the context vector. Further,
+  while the cross-modality encoder contains self-attention for each respective modality and cross-attention, only the cross attention is returned and both self attention outputs are disregarded.
+
+The code can be found `here <https://github.com/airsplay/lxmert>`__
+
+
+LxmertConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LxmertConfig
+    :members:
+
+
+LxmertTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LxmertTokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+Lxmert specific outputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_lxmert.LxmertModelOutput
+    :members:
+
+.. autoclass:: transformers.modeling_lxmert.LxmertForPreTrainingOutput
+    :members:
+
+.. autoclass:: transformers.modeling_lxmert.LxmertForQuestionAnsweringOutput
+    :members:
+
+.. autoclass:: transformers.modeling_tf_lxmert.TFLxmertModelOutput
+    :members:
+
+.. autoclass:: transformers.modeling_tf_lxmert.TFLxmertForPreTrainingOutput
+    :members:
+
+
+LxmertModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LxmertModel
+    :members:
+
+LxmertForPreTraining
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LxmertForPreTraining
+    :members:
+
+LxmertForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.LxmertForQuestionAnswering
+    :members:
+
+
+TFLxmertModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFLxmertModel
+    :members:
+
+TFLxmertForPreTraining
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFLxmertForPreTraining
+    :members:
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -363,4 +363,8 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 |                   | ``facebook/mbart-large-en-ro``                             | | 24-layer, 1024-hidden, 16-heads, 610M parameters                                                                                    |
 |                   |                                                            | | mbart-large-cc25 model finetuned on WMT english romanian translation.                                                               |
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+| Lxmert            | ``lxmert-base-uncased``                                    | | 9-language layers, 9-relationship layers, and 12-cross-modality layers                                                              |
+|                   |                                                            | | 768-hidden, 12-heads (for each layer) ~ 228M parameters                                                                             |
+|                   |                                                            | | Starting from lxmert-base checkpoint, trained on over 9 million image-text couplets from COCO, VisualGenome, GQA, VQA               |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+