Llava: add default chat templates (#31691)

* add default chat templates

* Update src/transformers/models/llava/processing_llava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/models/llava_next/processing_llava_next.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* more clear docstring and docs

* Update docs/source/en/model_doc/llava.md

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Update docs/source/en/model_doc/vipllava.md

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* add tests

* remove default templates (see #31733)

* load chat template from another file

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* revert some changes in docs

* forgot vipllava

* chat template file is not temporary hack

* warn if loading from processor

* not that file

* similarly modify `save_pretrained`

* Update tests/models/llava_next/test_processor_llava_next.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/models/vipllava/test_processor_vipllava.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/vipllava.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/processing_utils.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/processing_utils.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/vipllava.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/llava.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/llava.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/processing_utils.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update docs/source/en/model_doc/llava_next.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* fix

---------

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
This commit is contained in:
Raushan Turganbay
2024-07-19 10:08:56 +05:00
committed by GitHub
parent 271fd8e60d
commit b873234cb6
8 changed files with 318 additions and 19 deletions

View File

@@ -26,7 +26,12 @@ The abstract from the paper is the following:
*While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.*
Tips:
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
## Usage tips:
- The architecture is similar than llava architecture except that the multi-modal projector takes a set of concatenated vision hidden states and has an additional layernorm layer on that module.
@@ -34,22 +39,51 @@ Tips:
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
- For better results, we recommend users to prompt the model with the correct prompt format:
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
```python
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Whats shown in this image?"},
,
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This image shows a red stop sign."},]
},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image in more details."},
],
},
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
print(text_prompt)
>>> "###Human: <image>\nWhats shown in this image?###Assistant: This image shows a red stop sign.###Human: Describe the image in more details.###Assistant:"
```
- If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by VipLLaVa checkpoints:
```bash
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt>###Assistant:
```
For multiple turns conversation:
```bash
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt1>###Assistant: <answer1>###Human: <prompt2>###Assistant:
```
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
## VipLlavaConfig