Convert rst files (#14888)
* Convert all tutorials and guides * Convert all remaining rst to mdx * Track and fix bad links
This commit is contained in:
123
docs/source/model_doc/visual_bert.mdx
Normal file
123
docs/source/model_doc/visual_bert.mdx
Normal file
@@ -0,0 +1,123 @@
|
||||
<!--Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# VisualBERT
|
||||
|
||||
## Overview
|
||||
|
||||
The VisualBERT model was proposed in [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
VisualBERT is a neural network trained on a variety of (image, text) pairs.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks.
|
||||
VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an
|
||||
associated input image with self-attention. We further propose two visually-grounded language model objectives for
|
||||
pre-training VisualBERT on image caption data. Experiments on four vision-and-language tasks including VQA, VCR, NLVR2,
|
||||
and Flickr30K show that VisualBERT outperforms or rivals with state-of-the-art models while being significantly
|
||||
simpler. Further analysis demonstrates that VisualBERT can ground elements of language to image regions without any
|
||||
explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between
|
||||
verbs and image regions corresponding to their arguments.*
|
||||
|
||||
Tips:
|
||||
|
||||
1. Most of the checkpoints provided work with the [`VisualBertForPreTraining`] configuration. Other
|
||||
checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR
|
||||
('visualbert-vcr'), NLVR2 ('visualbert-nlvr2'). Hence, if you are not working on these downstream tasks, it is
|
||||
recommended that you use the pretrained checkpoints.
|
||||
|
||||
2. For the VCR task, the authors use a fine-tuned detector for generating visual embeddings, for all the checkpoints.
|
||||
We do not provide the detector and its weights as a part of the package, but it will be available in the research
|
||||
projects, and the states can be loaded directly into the detector provided.
|
||||
|
||||
## Usage
|
||||
|
||||
VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice,
|
||||
visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare
|
||||
embeddings for image-text pairs. Both the text and visual features are then projected to a latent space with identical
|
||||
dimension.
|
||||
|
||||
To feed images to the model, each image is passed through a pre-trained object detector and the regions and the
|
||||
bounding boxes are extracted. The authors use the features generated after passing these regions through a pre-trained
|
||||
CNN like ResNet as visual embeddings. They also add absolute position embeddings, and feed the resulting sequence of
|
||||
vectors to a standard BERT model. The text input is concatenated in the front of the visual embeddings in the embedding
|
||||
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
|
||||
appropriately for the textual and visual parts.
|
||||
|
||||
The [`BertTokenizer`] is used to encode the text. A custom detector/feature extractor must be used
|
||||
to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:
|
||||
|
||||
- [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers/tree/master/examples/research_projects/visual_bert) : This notebook
|
||||
contains an example on VisualBERT VQA.
|
||||
|
||||
- [Generate Embeddings for VisualBERT (Colab Notebook)](https://colab.research.google.com/drive/1bLGxKdldwqnMVA5x4neY7-l_8fKGWQYI?usp=sharing) : This notebook contains
|
||||
an example on how to generate visual embeddings.
|
||||
|
||||
The following example shows how to get the last hidden state using [`VisualBertModel`]:
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
>>> from transformers import BertTokenizer, VisualBertModel
|
||||
|
||||
>>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre")
|
||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
||||
|
||||
>>> inputs = tokenizer("What is the man eating?", return_tensors="pt")
|
||||
>>> # this is a custom function that returns the visual embeddings given the image path
|
||||
>>> visual_embeds = get_visual_embeddings(image_path)
|
||||
|
||||
>>> visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long)
|
||||
>>> visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float)
|
||||
>>> inputs.update({
|
||||
... "visual_embeds": visual_embeds,
|
||||
... "visual_token_type_ids": visual_token_type_ids,
|
||||
... "visual_attention_mask": visual_attention_mask
|
||||
... })
|
||||
>>> outputs = model(**inputs)
|
||||
>>> last_hidden_state = outputs.last_hidden_state
|
||||
```
|
||||
|
||||
This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/uclanlp/visualbert).
|
||||
|
||||
## VisualBertConfig
|
||||
|
||||
[[autodoc]] VisualBertConfig
|
||||
|
||||
## VisualBertModel
|
||||
|
||||
[[autodoc]] VisualBertModel
|
||||
- forward
|
||||
|
||||
## VisualBertForPreTraining
|
||||
|
||||
[[autodoc]] VisualBertForPreTraining
|
||||
- forward
|
||||
|
||||
## VisualBertForQuestionAnswering
|
||||
|
||||
[[autodoc]] VisualBertForQuestionAnswering
|
||||
- forward
|
||||
|
||||
## VisualBertForMultipleChoice
|
||||
|
||||
[[autodoc]] VisualBertForMultipleChoice
|
||||
- forward
|
||||
|
||||
## VisualBertForVisualReasoning
|
||||
|
||||
[[autodoc]] VisualBertForVisualReasoning
|
||||
- forward
|
||||
|
||||
## VisualBertForRegionToPhraseAlignment
|
||||
|
||||
[[autodoc]] VisualBertForRegionToPhraseAlignment
|
||||
- forward
|
||||
Reference in New Issue
Block a user