Refactoring of ImageProcessorFast (#35069)
* add init and base image processing functions * add add_fast_image_processor to transformers-cli * add working fast image processor clip * add fast image processor to doc, working tests * remove "to be implemented" SigLip * fix unprotected import * fix unprotected vision import * update ViTImageProcessorFast * increase threshold slow fast ewuivalence * add fast img blip * add fast class in tests with cli * improve cli * add fast image processor convnext * add LlavaPatchingMixin and fast image processor for llava_next and llava_onevision * add device kwarg to ImagesKwargs for fast processing on cuda * cleanup * fix unprotected import * group images by sizes and add batch processing * Add batch equivalence tests, skip when center_crop is used * cleanup * update init and cli * fix-copies * refactor convnext, cleanup base * fix * remove patching mixins, add piped torchvision transforms for ViT * fix unbatched processing * fix f strings * protect imports * change llava onevision to class transforms (test) * fix convnext * improve formatting (following Pavel review) * fix handling device arg * improve cli * fix * fix inits * Add distinction between preprocess and _preprocess, and support for arbitrary kwargs through valid_extra_kwargs * uniformize qwen2_vl fast * fix docstrings * add add fast image processor llava * remove min_pixels max_pixels from accepted size * nit * nit * refactor fast image processors docstrings * cleanup and remove fast class transforms * update add fast image processor transformers cli * cleanup docstring * uniformize pixtral fast and make _process_image explicit * fix prepare image structure llava next/onevision * Use typed kwargs instead of explicit args * nit fix import Unpack * clearly separate pops and gets in base preprocess. Use explicit typed kwargs * make qwen2_vl preprocess arguments hashable
This commit is contained in:
@@ -61,6 +61,11 @@ The original code can be found [here](https://github.com/salesforce/BLIP).
|
||||
[[autodoc]] BlipImageProcessor
|
||||
- preprocess
|
||||
|
||||
## BlipImageProcessorFast
|
||||
|
||||
[[autodoc]] BlipImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
|
||||
|
||||
@@ -251,6 +251,11 @@ The resource should ideally demonstrate something new instead of duplicating an
|
||||
[[autodoc]] CLIPImageProcessor
|
||||
- preprocess
|
||||
|
||||
## CLIPImageProcessorFast
|
||||
|
||||
[[autodoc]] CLIPImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## CLIPFeatureExtractor
|
||||
|
||||
[[autodoc]] CLIPFeatureExtractor
|
||||
|
||||
@@ -64,6 +64,11 @@ If you're interested in submitting a resource to be included here, please feel f
|
||||
[[autodoc]] ConvNextImageProcessor
|
||||
- preprocess
|
||||
|
||||
## ConvNextImageProcessorFast
|
||||
|
||||
[[autodoc]] ConvNextImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
|
||||
|
||||
@@ -125,6 +125,11 @@ If you're interested in submitting a resource to be included here, please feel f
|
||||
[[autodoc]] DeiTImageProcessor
|
||||
- preprocess
|
||||
|
||||
## DeiTImageProcessorFast
|
||||
|
||||
[[autodoc]] DeiTImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
<frameworkcontent>
|
||||
<pt>
|
||||
|
||||
|
||||
@@ -195,6 +195,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
[[autodoc]] LlavaImageProcessor
|
||||
- preprocess
|
||||
|
||||
## LlavaImageProcessorFast
|
||||
|
||||
[[autodoc]] LlavaImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## LlavaProcessor
|
||||
|
||||
[[autodoc]] LlavaProcessor
|
||||
|
||||
@@ -288,6 +288,11 @@ model = AutoModelForImageTextToText.from_pretrained(
|
||||
[[autodoc]] LlavaNextImageProcessor
|
||||
- preprocess
|
||||
|
||||
## LlavaNextImageProcessorFast
|
||||
|
||||
[[autodoc]] LlavaNextImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## LlavaNextProcessor
|
||||
|
||||
[[autodoc]] LlavaNextProcessor
|
||||
|
||||
@@ -100,8 +100,8 @@ import torch
|
||||
from PIL import Image
|
||||
import requests
|
||||
|
||||
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
|
||||
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
|
||||
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
|
||||
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
|
||||
model.to("cuda:0")
|
||||
|
||||
# prepare image and text prompt, using the appropriate prompt template
|
||||
@@ -298,8 +298,8 @@ First make sure to install flash-attn. Refer to the [original repository of Flas
|
||||
from transformers import LlavaOnevisionForConditionalGeneration
|
||||
|
||||
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.float16,
|
||||
model_id,
|
||||
torch_dtype=torch.float16,
|
||||
low_cpu_mem_usage=True,
|
||||
use_flash_attention_2=True
|
||||
).to(0)
|
||||
@@ -318,6 +318,11 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
|
||||
|
||||
[[autodoc]] LlavaOnevisionImageProcessor
|
||||
|
||||
## LlavaOnevisionImageProcessorFast
|
||||
|
||||
[[autodoc]] LlavaOnevisionImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## LlavaOnevisionVideoProcessor
|
||||
|
||||
[[autodoc]] LlavaOnevisionVideoProcessor
|
||||
|
||||
@@ -214,6 +214,11 @@ Below is an expected speedup diagram that compares inference time between the na
|
||||
[[autodoc]] SiglipImageProcessor
|
||||
- preprocess
|
||||
|
||||
## SiglipImageProcessorFast
|
||||
|
||||
[[autodoc]] SiglipImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## SiglipProcessor
|
||||
|
||||
[[autodoc]] SiglipProcessor
|
||||
|
||||
Reference in New Issue
Block a user