Process inputs directly in apply_chat_template in image-text-to-text pipeline (#35616)

* tokenize inputs directly in apply_chat_template

* refactor processing

* revert changes processing llava

* Update docs

* fix issue with str being iterable

* add test chat text only

* change function name
This commit is contained in:
Yoni Gozlan
2025-04-23 13:31:33 -04:00
committed by GitHub
parent 80ea2c05c2
commit 5cd6b64059
3 changed files with 186 additions and 54 deletions

View File

@@ -160,7 +160,48 @@ outputs[0]["generated_text"]
# with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems
```
## Streaming
If you prefer, you can also load the images separately and pass them to the pipeline like so:
```python
pipe = pipeline("image-text-to-text", model="HuggingFaceTB/SmolVLM-256M-Instruct")
img_urls = [
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
]
images = [
Image.open(requests.get(img_urls[0], stream=True).raw),
Image.open(requests.get(img_urls[1], stream=True).raw),
]
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "What do you see in these images?"},
],
}
]
outputs = pipe(text=messages, images=images, max_new_tokens=50, return_full_text=False)
outputs[0]["generated_text"]
" In the first image, there are two cats sitting on a plant. In the second image, there are flowers with a pinkish hue."
```
The images will still be included in the `"input_text"` field of the output:
```python
outputs[0]['input_text']
"""
[{'role': 'user',
'content': [{'type': 'image',
'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=622x412>},
{'type': 'image',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=5184x3456>},
{'type': 'text', 'text': 'What do you see in these images?'}]}]## Streaming
"""
```
We can use [text streaming](./generation_strategies#streaming) for a better generation experience. Transformers supports streaming with the [`TextStreamer`] or [`TextIteratorStreamer`] classes. We will use the [`TextIteratorStreamer`] with IDEFICS-8B.