Adds VIP-llava to transformers (#27932)
* v1 * add-new-model-like * revert * fix forward and conversion script * revert * fix copies * fixup * fix * Update docs/source/en/index.md * Apply suggestions from code review * push * fix * fixes here and there * up * fixup and fix tests * Apply suggestions from code review * add docs * fixup * fixes * docstring * add docstring * fixup * docstring * fixup * nit * docs * more copies * fix copies * nit * update test
This commit is contained in:
@@ -741,6 +741,8 @@
|
||||
title: TVP
|
||||
- local: model_doc/vilt
|
||||
title: ViLT
|
||||
- local: model_doc/vipllava
|
||||
title: VipLlava
|
||||
- local: model_doc/vision-encoder-decoder
|
||||
title: Vision Encoder Decoder Models
|
||||
- local: model_doc/vision-text-dual-encoder
|
||||
|
||||
@@ -280,6 +280,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| [VAN](model_doc/van) | ✅ | ❌ | ❌ |
|
||||
| [VideoMAE](model_doc/videomae) | ✅ | ❌ | ❌ |
|
||||
| [ViLT](model_doc/vilt) | ✅ | ❌ | ❌ |
|
||||
| [VipLlava](model_doc/vipllava) | ✅ | ❌ | ❌ |
|
||||
| [Vision Encoder decoder](model_doc/vision-encoder-decoder) | ✅ | ✅ | ✅ |
|
||||
| [VisionTextDualEncoder](model_doc/vision-text-dual-encoder) | ✅ | ✅ | ✅ |
|
||||
| [VisualBERT](model_doc/visual_bert) | ✅ | ❌ | ❌ |
|
||||
|
||||
61
docs/source/en/model_doc/vipllava.md
Normal file
61
docs/source/en/model_doc/vipllava.md
Normal file
@@ -0,0 +1,61 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# VipLlava
|
||||
|
||||
## Overview
|
||||
|
||||
The VipLlava model was proposed in [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
|
||||
|
||||
VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a "red bounding box" or "pointed arrow" during training.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.*
|
||||
|
||||
Tips:
|
||||
|
||||
- The architecture is similar than llava architecture except that the multi-modal projector takes a set of concatenated vision hidden states and has an additional layernorm layer on that module.
|
||||
|
||||
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
|
||||
|
||||
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
|
||||
|
||||
- For better results, we recommend users to prompt the model with the correct prompt format:
|
||||
|
||||
```bash
|
||||
"USER: <image>\n<prompt>ASSISTANT:"
|
||||
```
|
||||
|
||||
For multiple turns conversation:
|
||||
|
||||
```bash
|
||||
"USER: <image>\n<prompt1>ASSISTANT: <answer1>USER: <prompt2>ASSISTANT: <answer2>USER: <prompt3>ASSISTANT:"
|
||||
```
|
||||
|
||||
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
|
||||
|
||||
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
|
||||
|
||||
|
||||
## VipLlavaConfig
|
||||
|
||||
[[autodoc]] VipLlavaConfig
|
||||
|
||||
## VipLlavaForConditionalGeneration
|
||||
|
||||
[[autodoc]] VipLlavaForConditionalGeneration
|
||||
- forward
|
||||
@@ -46,6 +46,7 @@ FlashAttention-2 is currently supported for the following architectures:
|
||||
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
|
||||
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
|
||||
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
|
||||
* [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)
|
||||
* [MBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel)
|
||||
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
|
||||
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
|
||||
|
||||
Reference in New Issue
Block a user