VisualBERT (#10534)
* Init VisualBERT * Add cookie-cutter, Config, and Embeddings * Add preliminary Model * Add Bert analogous classes * Add basic code for NLVR, VQA, Flickr * Update Init * Fix VisualBert Downstream Models * Rename classifier to cls * Comment position_ids buffer * Remove sentence image predictor output * Update output dicts * Remove unnecessary files * Fix Auto Modeling * Fix transformers init * Add conversion script * Add conversion script * Fix docs * Update visualbert modelling * Update configuration * Style fixes * Add model and integration tests * Add all tests * Update model mapping * Add simple detector from original repository * Update docs and configs * Fix style * Fix style * Update docs * Fix style * Fix import issues in style * Fix style * Add changes from review * Fix style * Fix style * Update docs * Fix style * Fix style * Update docs/source/model_doc/visual_bert.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update tests/test_modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add changes from review * Remove convert run script * Add changes from review * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/models/visual_bert/modeling_visual_bert.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Add changes from review * Add changes from review * Add visual embedding example in docs * Fix "copied from" comments * Add changes from review * Fix error, style, checkpoints * Update docs * Fix integration tests * Fix style Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -256,22 +256,25 @@ Supported models
|
||||
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
|
||||
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
|
||||
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
54. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||
54. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
|
||||
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
|
||||
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
55. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
|
||||
Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||
55. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||
56. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
|
||||
56. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||
57. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
|
||||
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
57. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||
58. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
|
||||
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
|
||||
Zettlemoyer and Veselin Stoyanov.
|
||||
58. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||
59. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
|
||||
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
59. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
||||
60. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
||||
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
|
||||
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
|
||||
@@ -389,6 +392,8 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| ViT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| VisualBert | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| Wav2Vec2 | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| XLM | ✅ | ❌ | ✅ | ✅ | ❌ |
|
||||
@@ -537,6 +542,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
model_doc/tapas
|
||||
model_doc/transformerxl
|
||||
model_doc/vit
|
||||
model_doc/visual_bert
|
||||
model_doc/wav2vec2
|
||||
model_doc/xlm
|
||||
model_doc/xlmprophetnet
|
||||
|
||||
128
docs/source/model_doc/visual_bert.rst
Normal file
128
docs/source/model_doc/visual_bert.rst
Normal file
@@ -0,0 +1,128 @@
|
||||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
VisualBERT
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The VisualBERT model was proposed in `VisualBERT: A Simple and Performant Baseline for Vision and Language
|
||||
<https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
VisualBERT is a neural network trained on a variety of (image, text) pairs.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
|
||||
VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an
|
||||
associated input image with self-attention. We further propose two visually-grounded language model objectives for
|
||||
pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2,
|
||||
and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly
|
||||
simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any
|
||||
explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between
|
||||
verbs and image regions corresponding to their arguments.*
|
||||
|
||||
Tips:
|
||||
|
||||
1. Most of the checkpoints provided work with the :class:`~transformers.VisualBertForPreTraining` configuration. Other
|
||||
checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR
|
||||
('visualbert-vcr'), NLVR2 ('visualbert-nlvr2'). Hence, if you are not working on these downstream tasks, it is
|
||||
recommended that you use the pretrained checkpoints.
|
||||
|
||||
2. For the VCR task, the authors use a fine-tuned detector for generating visual embeddings, for all the checkpoints.
|
||||
We do not provide the detector and its weights as a part of the package, but it will be available in the research
|
||||
projects, and the states can be loaded directly into the detector provided.
|
||||
|
||||
Usage
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice,
|
||||
visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare
|
||||
embeddings for image-text pairs. Both the text and visual features are then projected to a latent space with identical
|
||||
dimension.
|
||||
|
||||
To feed images to the model, each image is passed through a pre-trained object detector and the regions and the
|
||||
bounding boxes are extracted. The authors use the features generated after passing these regions through a pre-trained
|
||||
CNN like ResNet as visual embeddings. They also add absolute position embeddings, and feed the resulting sequence of
|
||||
vectors to a standard BERT model. The text input is concatenated in the front of the visual embeddings in the embedding
|
||||
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
|
||||
appropriately for the textual and visual parts.
|
||||
|
||||
The :class:`~transformers.BertTokenizer` is used to encode the text. A custom detector/feature extractor must be used
|
||||
to get the visual embeddings. For an example on how to generate visual embeddings, see the `colab notebook
|
||||
<https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing>`__. The following example shows
|
||||
how to get the last hidden state using :class:`~transformers.VisualBertModel`:
|
||||
|
||||
.. code-block::
|
||||
|
||||
>>> import torch
|
||||
>>> from transformers import BertTokenizer, VisualBertModel
|
||||
|
||||
>>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
||||
|
||||
>>> inputs = tokenizer("What is the man eating?", return_tensors="pt")
|
||||
>>> # this is a custom function that returns the visual embeddings given the image path
|
||||
>>> visual_embeds = get_visual_embeddings(image_path)
|
||||
|
||||
>>> outputs = model(**inputs)
|
||||
>>> last_hidden_state = outputs.last_hidden_state
|
||||
|
||||
This model was contributed by `gchhablani <https://huggingface.co/gchhablani>`__. The original code can be found `here
|
||||
<https://github.com/uclanlp/visualbert>`__.
|
||||
|
||||
VisualBertConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisualBertConfig
|
||||
:members:
|
||||
|
||||
VisualBertModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisualBertModel
|
||||
:members: forward
|
||||
|
||||
|
||||
VisualBertForPreTraining
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisualBertForPreTraining
|
||||
:members: forward
|
||||
|
||||
|
||||
VisualBertForQuestionAnswering
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisualBertForQuestionAnswering
|
||||
:members: forward
|
||||
|
||||
|
||||
VisualBertForMultipleChoice
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisualBertForMultipleChoice
|
||||
:members: forward
|
||||
|
||||
|
||||
VisualBertForVisualReasoning
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisualBertForVisualReasoning
|
||||
:members: forward
|
||||
|
||||
|
||||
VisualBertForRegionToPhraseAlignment
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.VisualBertForRegionToPhraseAlignment
|
||||
:members: forward
|
||||
Reference in New Issue
Block a user