Add BridgeTower model (#20775)
* Commit with BTModel and latest HF code * Placeholder classes for BTForMLM and BTForITR * Importing Bert classes from transformers * Removed objectives.py and dist_utils.py * Removed swin_transformer.py * Add image normalization, BridgeTowerForImageAndTextRetrieval * Add center_crop * Removing bert tokenizer and LCI references * Tested config loading from HF transformers hub * Removed state_dict updates and added path to hub * Enable center crop * Getting image_size from config, renaming num_heads and num_layers * Handling max_length in BridgeTowerProcessor * Add BridgeTowerForMaskedLM * Add doc string for BridgeTowerConfig * Add doc strings for BT config, processor, image processor * Adding docs, removed swin * Removed convert_bridgetower_original_to_pytorch.py * Added doc files for bridgetower, removed is_vision * Add support attention_mask=None and BridgeTowerModelOutput * Fix formatting * Fixes with 'make style', 'make quality', 'make fixup' * Remove downstream tasks from BridgeTowerModel * Formatting fixes, add return_dict to BT models * Clean up after doc_test * Update BTModelOutput return type, fix todo in doc * Remove loss_names from init * implement tests and update tuples returned by models * Add image reference to bridgetower.mdx * after make fix-copies, make fixup, make style, make quality, make repo-consistency * Rename class names with BridgeTower prefix * Fix for image_size in BTImageProcessor * implement feature extraction bridgetower tests * Update image_mean and image_std to be list * remove unused import * Removed old comments * Rework CLIP * update config in tests followed config update * Formatting fixes * Add copied from for BridgeTowerPredictionHeadTransform * Update bridgetower.mdx * Update test_feature_extraction_bridgetower.py * Update bridgetower.mdx * BridgeTowerForMaskedLM is conditioned on image too * Add BridgeTowerForMaskedLM * Fixes * Call post_init to init weights * Move freeze layers into method * Remove BTFeatureExtractor, add BT under multimodal models * Remove BTFeatureExtractor, add BT under multimodal models * Code review feedback - cleanup * Rename variables * Formatting and style to PR review feedback * Move center crop after resize * Use named parameters * Style fix for modeling_bridgetower.py * Update docs/source/en/model_doc/bridgetower.mdx Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/bridgetower.mdx Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/bridgetower.mdx Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/bridgetower/modeling_bridgetower.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/bridgetower/modeling_bridgetower.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update docs/source/en/model_doc/bridgetower.mdx Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/models/bridgetower/modeling_bridgetower.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Rename config params, copy BERT classes, clean comments * Cleanup irtr * Replace Roberta imports, add BTTextConfig and Model * Update docs, add visionconfig, consistent arg names * make fixup * Comments for forward in BTModel and make fixup * correct tests * Remove inconsistent roberta copied from * Add BridgeTowerTextModel to dummy_pt_objects.py * Add BridgeTowerTextModel to IGNORE_NON_TESTED * Update docs for BT Text and Vision Configs * Treat BridgeTowerTextModel as a private model * BridgeTowerTextModel as private * Run make fix-copies * Adding BTTextModel to PRIVATE_MODELS * Fix for issue with BT Text and Image configs * make style changes * Update README_ja.md Add から to BridgeTower's description * Clean up config, .mdx and arg names * Fix init_weights. Remove nn.Sequential * Formatting and style fixes * Re-add tie_word_embeddings in config * update test implementation * update style * remove commented out * fix style * Update README with abs for BridgeTower * fix style * fix mdx file * Update bridgetower.mdx * Update img src in bridgetower.mdx * Update README.md * Update README.md * resolve style failed * Update _toctree.yml * Update README_ja.md * Removed mlp_ratio, rename feats, rename BTCLIPModel * Replace BTCLIP with BTVisionModel,pass in vision_config to BTVisionModel * Add test_initialization support * Add support for output_hidden_states * Update support for output_hidden_states * Add support for output_attentions * Add docstring for output_hidden_states * update tests * add bridgetowervisionmodel as private model * rerun the PR test * Remove model_type, pass configs to classes, renames * Change self.device to use weight device * Remove image_size * Style check fixes * Add hidden_size and num_hidden_layers to BridgeTowerTransformer * Update device setting * cosmetic update * trigger test again * trigger tests again * Update test_modeling_bridgetower.py trigger tests again * Update test_modeling_bridgetower.py * minor update * re-trigger tests * Update docs/source/en/model_doc/bridgetower.mdx Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Remove pad, update max_text_len, doc cleanup, pass eps to LayerNorm * Added copied to, some more review feedback * make fixup * Use BridgeTowerVisionEmbeddings * Code cleanup * Fixes for BridgeTowerVisionEmbeddings * style checks * re-tests * fix embedding * address comment on init file * retrigger tests * update import prepare_image_inputs * update test_image_processing_bridgetower.py to reflect test_image_processing_common.py * retrigger tests Co-authored-by: Shaoyen Tseng <shao-yen.tseng@intel.com> Co-authored-by: Tiep Le <tiep.le@intel.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Tiep Le <97980157+tileintel@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
39799fbf85
commit
3a6e4a221c
@@ -516,6 +516,8 @@
|
||||
title: AltCLIP
|
||||
- local: model_doc/blip
|
||||
title: BLIP
|
||||
- local: model_doc/bridgetower
|
||||
title: BridgeTower
|
||||
- local: model_doc/chinese_clip
|
||||
title: Chinese-CLIP
|
||||
- local: model_doc/clip
|
||||
@@ -595,4 +597,4 @@
|
||||
- local: internal/file_utils
|
||||
title: General Utilities
|
||||
title: Internal Helpers
|
||||
title: API
|
||||
title: API
|
||||
|
||||
@@ -68,6 +68,7 @@ The documentation is organized into five sections:
|
||||
1. **[BLIP](model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi.
|
||||
1. **[BLOOM](model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/).
|
||||
1. **[BORT](model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry.
|
||||
1. **[BridgeTower](model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan.
|
||||
1. **[ByT5](model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel.
|
||||
1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
|
||||
1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting.
|
||||
@@ -250,6 +251,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| BlenderbotSmall | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| BLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| BLOOM | ❌ | ✅ | ✅ | ❌ | ❌ |
|
||||
| BridgeTower | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| CANINE | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Chinese-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
|
||||
140
docs/source/en/model_doc/bridgetower.mdx
Normal file
140
docs/source/en/model_doc/bridgetower.mdx
Normal file
@@ -0,0 +1,140 @@
|
||||
<!--Copyright 2023 The Intel Labs Team Authors, The Microsoft Research Team Authors and HuggingFace Inc. team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# BridgeTower
|
||||
|
||||
## Overview
|
||||
|
||||
The BridgeTower model was proposed in [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. The goal of this model is to build a
|
||||
bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder thus achieving remarkable performance on various downstream tasks with almost negligible additional performance and computational costs.
|
||||
|
||||
This paper has been accepted to the [AAAI'23](https://aaai.org/Conferences/AAAI-23/) conference.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Vision-Language (VL) models with the TWO-TOWER architecture have dominated visual-language representation learning in recent years.
|
||||
Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder.
|
||||
Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose BRIDGETOWER, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the crossmodal encoder.
|
||||
This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, BRIDGETOWER achieves state-of-the-art performance on various downstream vision-language tasks.
|
||||
In particular, on the VQAv2 test-std set, BRIDGETOWER achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs.
|
||||
Notably, when further scaling the model, BRIDGETOWER achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.*
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/bridgetower_architecture%20.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> BridgeTower architecture. Taken from the <a href="https://arxiv.org/abs/2206.08657">original paper.</a> </small>
|
||||
|
||||
## Usage
|
||||
|
||||
BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers.
|
||||
The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder.
|
||||
In principle, one can apply any visual, textual or cross-modal encoder in the proposed architecture.
|
||||
|
||||
The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImageProcessor`] into a single instance to both
|
||||
encode the text and prepare the images respectively.
|
||||
|
||||
The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`].
|
||||
```python
|
||||
>>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
|
||||
>>> import requests
|
||||
>>> from PIL import Image
|
||||
|
||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||
>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
|
||||
|
||||
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
|
||||
>>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
|
||||
|
||||
>>> # forward pass
|
||||
>>> scores = dict()
|
||||
>>> for text in texts:
|
||||
... # prepare inputs
|
||||
... encoding = processor(image, text, return_tensors="pt")
|
||||
... outputs = model(**encoding)
|
||||
... scores[text] = outputs.logits[0, 1].item()
|
||||
```
|
||||
|
||||
The following example shows how to run masked language modeling using [`BridgeTowerProcessor`] and [`BridgeTowerForMaskedLM`].
|
||||
|
||||
```python
|
||||
>>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
|
||||
>>> from PIL import Image
|
||||
>>> import requests
|
||||
|
||||
>>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
|
||||
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
||||
>>> text = "a <mask> looking out of the window"
|
||||
|
||||
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
|
||||
>>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
|
||||
|
||||
>>> # prepare inputs
|
||||
>>> encoding = processor(image, text, return_tensors="pt")
|
||||
|
||||
>>> # forward pass
|
||||
>>> outputs = model(**encoding)
|
||||
|
||||
>>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
|
||||
|
||||
>>> print(results)
|
||||
.a cat looking out of the window.
|
||||
```
|
||||
|
||||
This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower).
|
||||
|
||||
|
||||
Tips:
|
||||
|
||||
- This implementation of BridgeTower uses [`RobertaTokenizer`] to generate text embeddings and OpenAI's CLIP/ViT model to compute visual embeddings.
|
||||
- Checkpoints for pre-trained [bridgeTower-base](https://huggingface.co/BridgeTower/bridgetower-base) and [bridgetower masked language modeling and image text matching](https://huggingface.co/BridgeTower/bridgetower-base-itm-mlm) are released.
|
||||
- Please refer to [Table 5](https://arxiv.org/pdf/2206.08657.pdf) for BridgeTower's performance on Image Retrieval and other down stream tasks.
|
||||
- The PyTorch version of this model is only available in torch 1.10 and higher.
|
||||
|
||||
|
||||
## BridgeTowerConfig
|
||||
|
||||
[[autodoc]] BridgeTowerConfig
|
||||
|
||||
## BridgeTowerTextConfig
|
||||
|
||||
[[autodoc]] BridgeTowerTextConfig
|
||||
|
||||
## BridgeTowerVisionConfig
|
||||
|
||||
[[autodoc]] BridgeTowerVisionConfig
|
||||
|
||||
## BridgeTowerImageProcessor
|
||||
|
||||
[[autodoc]] BridgeTowerImageProcessor
|
||||
- preprocess
|
||||
|
||||
## BridgeTowerProcessor
|
||||
|
||||
[[autodoc]] BridgeTowerProcessor
|
||||
- __call__
|
||||
|
||||
## BridgeTowerModel
|
||||
|
||||
[[autodoc]] BridgeTowerModel
|
||||
- forward
|
||||
|
||||
## BridgeTowerForMaskedLM
|
||||
|
||||
[[autodoc]] BridgeTowerForMaskedLM
|
||||
- forward
|
||||
|
||||
## BridgeTowerForImageAndTextRetrieval
|
||||
|
||||
[[autodoc]] BridgeTowerForImageAndTextRetrieval
|
||||
- forward
|
||||
|
||||
Reference in New Issue
Block a user