Add auto model for image-text-to-text (#32472)
* Add Auto model for image-text-to-text * Remove donut from processing auto, add chameleon ti image text to text models * add qwen2_vl and llava_onevision * add pixtral to auto model for image-text-to-text * add mllama and idefics3 * remove models in IGNORE_NON_AUTO_CONFIGURED * add AutoModelForImageTextToText to tests and doc
This commit is contained in:
@@ -381,3 +381,7 @@ The following auto classes are available for the following multimodal tasks.
|
||||
### FlaxAutoModelForVision2Seq
|
||||
|
||||
[[autodoc]] FlaxAutoModelForVision2Seq
|
||||
|
||||
### AutoModelForImageTextToText
|
||||
|
||||
[[autodoc]] AutoModelForImageTextToText
|
||||
|
||||
@@ -166,10 +166,10 @@ LLaVa-Next can perform inference with multiple images as input, where images eit
|
||||
import requests
|
||||
from PIL import Image
|
||||
import torch
|
||||
from transformers import AutoProcessor, LlavaNextForConditionalGeneration
|
||||
from transformers import AutoProcessor, AutoModelForImageTextToText
|
||||
|
||||
# Load the model in half-precision
|
||||
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, device_map="auto")
|
||||
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, device_map="auto")
|
||||
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
|
||||
|
||||
# Get three different images
|
||||
@@ -246,7 +246,7 @@ We value your feedback to help identify bugs before the full release! Check out
|
||||
Simply change the snippet above with:
|
||||
|
||||
```python
|
||||
from transformers import LlavaNextForConditionalGeneration, BitsAndBytesConfig
|
||||
from transformers import AutoModelForImageTextToText, BitsAndBytesConfig
|
||||
|
||||
# specify how to quantize the model
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
@@ -255,7 +255,7 @@ quantization_config = BitsAndBytesConfig(
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
)
|
||||
|
||||
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto")
|
||||
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto")
|
||||
```
|
||||
|
||||
### Use Flash-Attention 2 to further speed-up generation
|
||||
@@ -263,9 +263,9 @@ model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-m
|
||||
First make sure to install flash-attn. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
|
||||
|
||||
```python
|
||||
from transformers import LlavaNextForConditionalGeneration
|
||||
from transformers import AutoModelForImageTextToText
|
||||
|
||||
model = LlavaNextForConditionalGeneration.from_pretrained(
|
||||
model = AutoModelForImageTextToText.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.float16,
|
||||
low_cpu_mem_usage=True,
|
||||
|
||||
@@ -27,22 +27,22 @@ To begin with, there are multiple types of VLMs:
|
||||
- chat fine-tuned models for conversation
|
||||
- instruction fine-tuned models
|
||||
|
||||
This guide focuses on inference with an instruction-tuned model.
|
||||
This guide focuses on inference with an instruction-tuned model.
|
||||
|
||||
Let's begin installing the dependencies.
|
||||
|
||||
```bash
|
||||
pip install -q transformers accelerate flash_attn
|
||||
pip install -q transformers accelerate flash_attn
|
||||
```
|
||||
|
||||
Let's initialize the model and the processor.
|
||||
Let's initialize the model and the processor.
|
||||
|
||||
```python
|
||||
from transformers import AutoProcessor, Idefics2ForConditionalGeneration
|
||||
from transformers import AutoProcessor, AutoModelForImageTextToText
|
||||
import torch
|
||||
|
||||
device = torch.device("cuda")
|
||||
model = Idefics2ForConditionalGeneration.from_pretrained(
|
||||
model = AutoModelForImageTextToText.from_pretrained(
|
||||
"HuggingFaceM4/idefics2-8b",
|
||||
torch_dtype=torch.bfloat16,
|
||||
attn_implementation="flash_attention_2",
|
||||
@@ -51,7 +51,7 @@ model = Idefics2ForConditionalGeneration.from_pretrained(
|
||||
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
|
||||
```
|
||||
|
||||
This model has a [chat template](./chat_templating) that helps user parse chat outputs. Moreover, the model can also accept multiple images as input in a single conversation or message. We will now prepare the inputs.
|
||||
This model has a [chat template](./chat_templating) that helps user parse chat outputs. Moreover, the model can also accept multiple images as input in a single conversation or message. We will now prepare the inputs.
|
||||
|
||||
The image inputs look like the following.
|
||||
|
||||
@@ -74,7 +74,7 @@ images = [Image.open(requests.get(img_urls[0], stream=True).raw),
|
||||
Image.open(requests.get(img_urls[1], stream=True).raw)]
|
||||
```
|
||||
|
||||
Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template.
|
||||
Below is an example of the chat template. We can feed conversation turns and the last message as an input by appending it at the end of the template.
|
||||
|
||||
|
||||
```python
|
||||
@@ -98,7 +98,7 @@ messages = [
|
||||
{"type": "image"},
|
||||
{"type": "text", "text": "And how about this image?"},
|
||||
]
|
||||
},
|
||||
},
|
||||
]
|
||||
```
|
||||
|
||||
@@ -180,11 +180,11 @@ def model_inference(
|
||||
if acc_text.endswith("<end_of_utterance>"):
|
||||
acc_text = acc_text[:-18]
|
||||
yield acc_text
|
||||
|
||||
|
||||
thread.join()
|
||||
```
|
||||
|
||||
Now let's call the `model_inference` function we created and stream the values.
|
||||
Now let's call the `model_inference` function we created and stream the values.
|
||||
|
||||
```python
|
||||
generator = model_inference(
|
||||
@@ -204,7 +204,7 @@ for value in generator:
|
||||
|
||||
## Fit models in smaller hardware
|
||||
|
||||
VLMs are often large and need to be optimized to fit on smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto). int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency.
|
||||
VLMs are often large and need to be optimized to fit on smaller hardware. Transformers supports many model quantization libraries, and here we will only show int8 quantization with [Quanto](./quantization/quanto#quanto). int8 quantization offers memory improvements up to 75 percent (if all weights are quantized). However it is no free lunch, since 8-bit is not a CUDA-native precision, the weights are quantized back and forth on the fly, which adds up to latency.
|
||||
|
||||
First, install dependencies.
|
||||
|
||||
@@ -215,18 +215,20 @@ pip install -U quanto bitsandbytes
|
||||
To quantize a model during loading, we need to first create [`QuantoConfig`]. Then load the model as usual, but pass `quantization_config` during model initialization.
|
||||
|
||||
```python
|
||||
from transformers import Idefics2ForConditionalGeneration, AutoTokenizer, QuantoConfig
|
||||
from transformers import AutoModelForImageTextToText, QuantoConfig
|
||||
|
||||
model_id = "HuggingFaceM4/idefics2-8b"
|
||||
quantization_config = QuantoConfig(weights="int8")
|
||||
quantized_model = Idefics2ForConditionalGeneration.from_pretrained(model_id, device_map="cuda", quantization_config=quantization_config)
|
||||
quantized_model = AutoModelForImageTextToText.from_pretrained(
|
||||
model_id, device_map="cuda", quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
And that's it, we can use the model the same way with no changes.
|
||||
And that's it, we can use the model the same way with no changes.
|
||||
|
||||
## Further Reading
|
||||
|
||||
Here are some more resources for the image-text-to-text task.
|
||||
|
||||
- [Image-text-to-text task page](https://huggingface.co/tasks/image-text-to-text) covers model types, use cases, datasets, and more.
|
||||
- [Image-text-to-text task page](https://huggingface.co/tasks/image-text-to-text) covers model types, use cases, datasets, and more.
|
||||
- [Vision Language Models Explained](https://huggingface.co/blog/vlms) is a blog post that covers everything about vision language models and supervised fine-tuning using [TRL](https://huggingface.co/docs/trl/en/index).
|
||||
|
||||
Reference in New Issue
Block a user