Add Idefics2 (#30253)
* Initial add model additions * Test * All weights loading * Can perform full forward pass * Local and remote the same * Matching local and remote * Fixup * Idefics2Model importable; fixup docstrings * Don't skip by default * Remove deprecated use_resampler arg * Remove self.config * DecoupledLinear takes config * Tidy up * Enable eager attention and tidy up * Most tests passing * Update for batch of processed images * Add image processor * Update doc pages * Update conversion script * Remove erroneous breakpoint * Remove accidendtal spelling change * Update to reflect changes on hub - make generate work * Fix up * Image processor tests * Update tests * Add a processor * Add a processor * Update convert script * Update modeling file - remove fixmes * Bug fix * Add processing test * Use processor * Fix up * Update src/transformers/models/idefics2/modeling_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Update src/transformers/models/idefics2/modeling_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Fix test * Update config - PR comments and defaults align with checkpoint * Reviewer comments * Add copied froms for flahs attention * Update src/transformers/models/idefics2/modeling_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Remove qk_layer_norm and freeze_layers functionality * Fix * Remove freeze_layer options from config * Sync with upstream main * Fix attention shapes siglip * Remove Llava-next refs - TO REBASE * Use AutoModel for text model * Add comment to explain vision embeddings * Fix issue with tie_word_embeddings * Address review comments * Fix and fix up * Chat templates for idefics * Fix copies * Fix * Add layer norms to FA2 * Fix tests * Apply suggestions from code review Co-authored-by: Victor SANH <victorsanh@gmail.com> * Fix * Review comments * Update src/transformers/models/idefics2/modeling_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Update inputs merger * Merge weights in correct order * Update convert script * Update src/transformers/models/idefics2/processing_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Update template * Model code examples (fix idefics too) * More review comments * Tidy up * Update processing * Fix attention mask preparation * Update inputs_merger inputs * Vectorize inputs_merger * Update src/transformers/models/idefics2/__init__.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/idefics2/modeling_idefics2.py * Review comments * saying bye to the `qk_layer_norms` * Simplify * Update latents * Remove erroneuous readme changes * Return images when applying chat template * Fix bug - prompt images are for a single sample * Update src/transformers/models/idefics2/modeling_idefics2.py * image splitting * fix test * some more comment * some comment * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/idefics2/image_processing_idefics2.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update processor * Update model tests * Update src/transformers/models/idefics2/processing_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Update src/transformers/models/idefics2/processing_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Don't add BOS in template * Update src/transformers/models/idefics2/processing_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * Remove index in examples * Update tests to reflect #13 * Update src/transformers/models/idefics2/processing_idefics2.py Co-authored-by: Victor SANH <victorsanh@gmail.com> * PR comment - consistent typing * Update readme and model doc * Update docs * Update checkpoint references * Update examples * Fix and update tests * Small addition * Update tests - remove copied from as no ignore placement copy could be found * Update example * small fixes * Update docs/source/en/model_doc/idefics2.md Co-authored-by: Victor SANH <victorsanh@gmail.com> * Update docs/source/en/model_doc/idefics2.md Co-authored-by: Victor SANH <victorsanh@gmail.com> * Update README.md Co-authored-by: Victor SANH <victorsanh@gmail.com> * Connector model as bridge * Fix up * Fix up * Don't pass model inputs for generation kwargs update * IDEFICS-2 -> Idefics2 * Remove config archive name * IDEFICS-2 -> Idefics2 * Add back llava-next * Update readmes * Add requirements for processor tester * Use custom convert_to_rgb to avoid possible BC * Fix doc example * Fix doc example * Skip model doc tests - as model to large * More doc example - account for image splitting * Update src/transformers/image_transforms.py * Fix config doctest --------- Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> Co-authored-by: ArthurZucker <arthur.zucker@gmail.com> Co-authored-by: Victor SANH <victorsanh@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
@@ -738,6 +738,8 @@
|
||||
title: GroupViT
|
||||
- local: model_doc/idefics
|
||||
title: IDEFICS
|
||||
- local: model_doc/idefics2
|
||||
title: Idefics2
|
||||
- local: model_doc/instructblip
|
||||
title: InstructBLIP
|
||||
- local: model_doc/kosmos-2
|
||||
|
||||
@@ -160,6 +160,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| [Hubert](model_doc/hubert) | ✅ | ✅ | ❌ |
|
||||
| [I-BERT](model_doc/ibert) | ✅ | ❌ | ❌ |
|
||||
| [IDEFICS](model_doc/idefics) | ✅ | ❌ | ❌ |
|
||||
| [Idefics2](model_doc/idefics2) | ✅ | ❌ | ❌ |
|
||||
| [ImageGPT](model_doc/imagegpt) | ✅ | ❌ | ❌ |
|
||||
| [Informer](model_doc/informer) | ✅ | ❌ | ❌ |
|
||||
| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ |
|
||||
|
||||
98
docs/source/en/model_doc/idefics2.md
Normal file
98
docs/source/en/model_doc/idefics2.md
Normal file
@@ -0,0 +1,98 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Idefics2
|
||||
|
||||
## Overview
|
||||
|
||||
The Idefics2 model was created by the [Hugging Face M4](https://huggingface.co/HuggingFaceM4) team and authored by Léo Tronchon, Hugo Laurencon, Victor Sanh.
|
||||
The accompanying blog post can be found [here](https://huggingface.co/blog/idefics2).
|
||||
|
||||
Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text
|
||||
outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple
|
||||
images, or simply behave as a pure language model without visual inputs. It improves upon IDEFICS-1, notably on
|
||||
document understanding, OCR, or visual reasoning. Idefics2 is lightweight (8 billion parameters) and treats
|
||||
images in their native aspect ratio and resolution, which allows for varying inference efficiency.
|
||||
|
||||
Tips:
|
||||
- Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images in a batch for input to the model.
|
||||
- The processor has a `do_image_splitting` option. If `True`, each input image will be split into 4 sub-images, and concatenated with the original to form 5 images. This is useful for increasing model performance. Make sure `processor.image_processor.do_image_splitting` is set to `False` if the model was not trained with this option.
|
||||
- `text` passed to the processor should have the `<image>` tokens where the images should be inserted. And `<end_of_utterance>` at the end of each utterance if the text is a chat message.
|
||||
- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as `text` to the processor.
|
||||
|
||||
Example of how to use the processor on chat messages:
|
||||
```python
|
||||
import requests
|
||||
from PIL import Image
|
||||
from transformers import Idefics2Processor, Idefics2ForConditionalGeneration
|
||||
|
||||
url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
|
||||
|
||||
image_1 = Image.open(requests.get(url_1, stream=True).raw)
|
||||
image_2 = Image.open(requests.get(url_2, stream=True).raw)
|
||||
images = [image_1, image_2]
|
||||
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What’s the difference between these two images?"},
|
||||
{"type": "image"},
|
||||
{"type": "image"},
|
||||
],
|
||||
}]
|
||||
|
||||
processor = Idefics2Processor.from_pretrained("HuggingFaceM4/idefics2-8b")
|
||||
model = Idefics2ForConditionalGeneration.from_pretrained("HuggingFaceM4/idefics2-8b")
|
||||
|
||||
text = processor.apply_chat_template(messages)
|
||||
# "User: What’s the difference between these two images?<image><image><end_of_utterance>\n"
|
||||
print(text)
|
||||
|
||||
inputs = processor(images=images, text=text)
|
||||
|
||||
generated_text = model.generate(**inputs)
|
||||
```
|
||||
|
||||
This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).
|
||||
The original code can be found [here](https://huggingface.co/HuggingFaceM4/idefics2).
|
||||
|
||||
|
||||
## Idefics2Config
|
||||
|
||||
[[autodoc]] Idefics2Config
|
||||
|
||||
|
||||
## Idefics2Model
|
||||
|
||||
[[autodoc]] Idefics2Model
|
||||
- forward
|
||||
|
||||
|
||||
## Idefics2ForConditionalGeneration
|
||||
|
||||
[[autodoc]] Idefics2ForConditionalGeneration
|
||||
- forward
|
||||
|
||||
|
||||
## Idefics2ImageProcessor
|
||||
[[autodoc]] Idefics2ImageProcessor
|
||||
- preprocess
|
||||
|
||||
|
||||
## Idefics2Processor
|
||||
[[autodoc]] Idefics2Processor
|
||||
- __call__
|
||||
@@ -47,6 +47,7 @@ FlashAttention-2 is currently supported for the following architectures:
|
||||
* [GPTNeo](https://huggingface.co/docs/transformers/model_doc/gpt_neo#transformers.GPTNeoModel)
|
||||
* [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel)
|
||||
* [GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJModel)
|
||||
* [Idefics2](https://huggingface.co/docs/transformers/model_doc/idefics2#transformers.Idefics2Model)
|
||||
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
|
||||
* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel)
|
||||
* [Llava](https://huggingface.co/docs/transformers/model_doc/llava)
|
||||
@@ -96,8 +97,8 @@ model_id = "tiiuae/falcon-7b"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.bfloat16,
|
||||
model_id,
|
||||
torch_dtype=torch.bfloat16,
|
||||
attn_implementation="flash_attention_2",
|
||||
)
|
||||
```
|
||||
@@ -109,7 +110,7 @@ FlashAttention-2 can only be used when the model's dtype is `fp16` or `bf16`. Ma
|
||||
<br>
|
||||
|
||||
You can also set `use_flash_attention_2=True` to enable FlashAttention-2 but it is deprecated in favor of `attn_implementation="flash_attention_2"`.
|
||||
|
||||
|
||||
</Tip>
|
||||
|
||||
FlashAttention-2 can be combined with other optimization techniques like quantization to further speedup inference. For example, you can combine FlashAttention-2 with 8-bit or 4-bit quantization:
|
||||
@@ -123,14 +124,14 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
# load in 8bit
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
model_id,
|
||||
load_in_8bit=True,
|
||||
attn_implementation="flash_attention_2",
|
||||
)
|
||||
|
||||
# load in 4bit
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
model_id,
|
||||
load_in_4bit=True,
|
||||
attn_implementation="flash_attention_2",
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user