fix multi-image case for llava-onevision (#38084)

* _get_padding_size module * do not patchify images when processing multi image * modify llava onevision image processor fast * tensor to list of tensors * backward compat * reuse pad_to_square in llave & some clarification * add to doc * fix: consider no image cases (text only or video) * add integration test * style & repo_consistency
2025-05-21 18:50:46 +09:00
parent a21f11fca2
commit 101b3fa4ea
13 changed files with 620 additions and 93 deletions
--- a/docs/source/en/model_doc/llava_onevision.md
+++ b/docs/source/en/model_doc/llava_onevision.md
@@ -147,7 +147,7 @@ print(processor.decode(output[0], skip_special_tokens=True))

 ### Multi image inference

-LLaVa-OneVision can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). For that you have to use checkpoints with an "ov" suffix. Here is how you can do it:
+LLaVa-OneVision can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). For that you have to use checkpoints with an "ov" suffix. For multi-image cases, we recommend using a **nested list of images** as input. Otherwise, every image will be patchified and consume a lot of memory. Here is how you can do it:

 ```python
 import requests