Add auto model for image-text-to-text (#32472)

* Add Auto model for image-text-to-text

* Remove donut from processing auto, add chameleon ti image text to text models

* add qwen2_vl and llava_onevision

* add pixtral to auto model for image-text-to-text

* add mllama and idefics3

* remove models in IGNORE_NON_AUTO_CONFIGURED

* add AutoModelForImageTextToText to tests and doc
This commit is contained in:
Yoni Gozlan
2024-10-08 14:26:43 +02:00
committed by GitHub
parent 0dbc7090ba
commit e2001c3413
11 changed files with 89 additions and 28 deletions

View File

@@ -381,3 +381,7 @@ The following auto classes are available for the following multimodal tasks.
### FlaxAutoModelForVision2Seq
[[autodoc]] FlaxAutoModelForVision2Seq
### AutoModelForImageTextToText
[[autodoc]] AutoModelForImageTextToText

View File

@@ -166,10 +166,10 @@ LLaVa-Next can perform inference with multiple images as input, where images eit
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaNextForConditionalGeneration
from transformers import AutoProcessor, AutoModelForImageTextToText
# Load the model in half-precision
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, device_map="auto")
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
# Get three different images
@@ -246,7 +246,7 @@ We value your feedback to help identify bugs before the full release! Check out
Simply change the snippet above with:
```python
from transformers import LlavaNextForConditionalGeneration, BitsAndBytesConfig
from transformers import AutoModelForImageTextToText, BitsAndBytesConfig
# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
@@ -255,7 +255,7 @@ quantization_config = BitsAndBytesConfig(
bnb_4bit_compute_dtype=torch.float16,
)
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto")
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", quantization_config=quantization_config, device_map="auto")
```
### Use Flash-Attention 2 to further speed-up generation
@@ -263,9 +263,9 @@ model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-m
First make sure to install flash-attn. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
```python
from transformers import LlavaNextForConditionalGeneration
from transformers import AutoModelForImageTextToText
model = LlavaNextForConditionalGeneration.from_pretrained(
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,