Uniformize kwargs for Pixtral processor (#33521)

* add uniformized pixtral and kwargs

* update doc

* fix _validate_images_text_input_order

* nit
This commit is contained in:
Yoni Gozlan
2024-09-17 14:44:27 -04:00
committed by GitHub
parent c29a8694b0
commit d8500cd229
7 changed files with 255 additions and 62 deletions

View File

@@ -24,7 +24,7 @@ The Pixtral model was released by the Mistral AI team on [Vllm](https://github.c
Tips:
- Pixtral is a multimodal model, the main contribution is the 2d ROPE on the images, and support for arbitrary image size (the images are not padded together nor are they resized)
- This model follows the `Llava` familiy, meaning image embeddings are placed instead of the `[IMG]` token placeholders.
- This model follows the `Llava` familiy, meaning image embeddings are placed instead of the `[IMG]` token placeholders.
- The format for one or mulitple prompts is the following:
```
"<s>[INST][IMG]\nWhat are the things I should be cautious about when I visit this place?[/INST]"
@@ -35,7 +35,7 @@ This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts)
Here is an example of how to run it:
```python
```python
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
@@ -51,7 +51,7 @@ IMG_URLS = [
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"
inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]