Refactoring of ImageProcessorFast (#35069)

* add init and base image processing functions

* add add_fast_image_processor to transformers-cli

* add working fast image processor clip

* add fast image processor to doc, working tests

* remove "to be implemented" SigLip

* fix unprotected import

* fix unprotected vision import

* update ViTImageProcessorFast

* increase threshold slow fast ewuivalence

* add fast img blip

* add fast class in tests with cli

* improve cli

* add fast image processor convnext

* add LlavaPatchingMixin and fast image processor for llava_next and llava_onevision

* add device kwarg to ImagesKwargs for fast processing on cuda

* cleanup

* fix unprotected import

* group images by sizes and add batch processing

* Add batch equivalence tests, skip when center_crop is used

* cleanup

* update init and cli

* fix-copies

* refactor convnext, cleanup base

* fix

* remove patching mixins, add piped torchvision transforms for ViT

* fix unbatched processing

* fix f strings

* protect imports

* change llava onevision to class transforms (test)

* fix convnext

* improve formatting (following Pavel review)

* fix handling device arg

* improve cli

* fix

* fix inits

* Add distinction between preprocess and _preprocess, and support for arbitrary kwargs through valid_extra_kwargs

* uniformize qwen2_vl fast

* fix docstrings

* add add fast image processor llava

* remove min_pixels max_pixels from accepted size

* nit

* nit

* refactor fast image processors docstrings

* cleanup and remove fast class transforms

* update add fast image processor transformers cli

* cleanup docstring

* uniformize pixtral fast and  make _process_image explicit

* fix prepare image structure llava next/onevision

* Use typed kwargs instead of explicit args

* nit fix import Unpack

* clearly separate pops and gets in base preprocess. Use explicit typed kwargs

* make qwen2_vl preprocess arguments hashable
This commit is contained in:
Yoni Gozlan
2025-02-04 17:52:31 -05:00
committed by GitHub
parent 8d73a38606
commit fa56dcc2ab
66 changed files with 4047 additions and 2244 deletions

View File

@@ -61,6 +61,11 @@ The original code can be found [here](https://github.com/salesforce/BLIP).
[[autodoc]] BlipImageProcessor
- preprocess
## BlipImageProcessorFast
[[autodoc]] BlipImageProcessorFast
- preprocess
<frameworkcontent>
<pt>

View File

@@ -251,6 +251,11 @@ The resource should ideally demonstrate something new instead of duplicating an
[[autodoc]] CLIPImageProcessor
- preprocess
## CLIPImageProcessorFast
[[autodoc]] CLIPImageProcessorFast
- preprocess
## CLIPFeatureExtractor
[[autodoc]] CLIPFeatureExtractor

View File

@@ -64,6 +64,11 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] ConvNextImageProcessor
- preprocess
## ConvNextImageProcessorFast
[[autodoc]] ConvNextImageProcessorFast
- preprocess
<frameworkcontent>
<pt>

View File

@@ -125,6 +125,11 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] DeiTImageProcessor
- preprocess
## DeiTImageProcessorFast
[[autodoc]] DeiTImageProcessorFast
- preprocess
<frameworkcontent>
<pt>

View File

@@ -195,6 +195,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] LlavaImageProcessor
- preprocess
## LlavaImageProcessorFast
[[autodoc]] LlavaImageProcessorFast
- preprocess
## LlavaProcessor
[[autodoc]] LlavaProcessor

View File

@@ -288,6 +288,11 @@ model = AutoModelForImageTextToText.from_pretrained(
[[autodoc]] LlavaNextImageProcessor
- preprocess
## LlavaNextImageProcessorFast
[[autodoc]] LlavaNextImageProcessorFast
- preprocess
## LlavaNextProcessor
[[autodoc]] LlavaNextProcessor

View File

@@ -100,8 +100,8 @@ import torch
from PIL import Image
import requests
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to("cuda:0")
# prepare image and text prompt, using the appropriate prompt template
@@ -298,8 +298,8 @@ First make sure to install flash-attn. Refer to the [original repository of Flas
from transformers import LlavaOnevisionForConditionalGeneration
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_flash_attention_2=True
).to(0)
@@ -318,6 +318,11 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
[[autodoc]] LlavaOnevisionImageProcessor
## LlavaOnevisionImageProcessorFast
[[autodoc]] LlavaOnevisionImageProcessorFast
- preprocess
## LlavaOnevisionVideoProcessor
[[autodoc]] LlavaOnevisionVideoProcessor

View File

@@ -214,6 +214,11 @@ Below is an expected speedup diagram that compares inference time between the na
[[autodoc]] SiglipImageProcessor
- preprocess
## SiglipImageProcessorFast
[[autodoc]] SiglipImageProcessorFast
- preprocess
## SiglipProcessor
[[autodoc]] SiglipProcessor

View File

@@ -61,6 +61,11 @@ BLIP は、次のようなさまざまなマルチモーダル タスクを実
[[autodoc]] BlipImageProcessor
- preprocess
## BlipImageProcessorFast
[[autodoc]] BlipImageProcessorFast
- preprocess
<frameworkcontent>
<pt>

View File

@@ -133,6 +133,11 @@ CLIP を使い始めるのに役立つ公式 Hugging Face およびコミュニ
[[autodoc]] CLIPImageProcessor
- preprocess
## CLIPImageProcessorFast
[[autodoc]] CLIPImageProcessorFast
- preprocess
## CLIPFeatureExtractor
[[autodoc]] CLIPFeatureExtractor

View File

@@ -64,6 +64,11 @@ ConvNeXT の使用を開始するのに役立つ公式 Hugging Face およびコ
[[autodoc]] ConvNextImageProcessor
- preprocess
## ConvNextImageProcessorFast
[[autodoc]] ConvNextImageProcessorFast
- preprocess
<frameworkcontent>
<pt>

View File

@@ -98,6 +98,11 @@ DeiT を始めるのに役立つ公式 Hugging Face およびコミュニティ
[[autodoc]] DeiTImageProcessor
- preprocess
## DeiTImageProcessorFast
[[autodoc]] DeiTImageProcessorFast
- preprocess
<frameworkcontent>
<pt>