Add Idefics2/3 and SmolVLM Fast image processors + improvements for fast image processors (#38157)
* add working idefics2 fast and improvements for fast nested images processing * add fast image processors idefics 3 and smolvlm * cleanup tests * fic doc idefics2 * PR review and fix issues after merge * Force providing disable_grouping to group_images_by_shape * simplify group_images_by_shape * fix modular * Fix nits after review
This commit is contained in:
@@ -162,7 +162,7 @@ To load and run a model using Flash Attention-2, simply change the code snippet
|
||||
```diff
|
||||
model = Idefics2ForConditionalGeneration.from_pretrained(
|
||||
"HuggingFaceM4/idefics2-8b",
|
||||
+ torch_dtype=torch.float16,
|
||||
+ torch_dtype=torch.float16,
|
||||
+ attn_implementation="flash_attention_2",
|
||||
).to(device)
|
||||
```
|
||||
@@ -184,7 +184,7 @@ Quantizing a model is as simple as passing a `quantization_config` to the model.
|
||||
+ )
|
||||
model = Idefics2ForConditionalGeneration.from_pretrained(
|
||||
"HuggingFaceM4/idefics2-8b",
|
||||
+ torch_dtype=torch.float16,
|
||||
+ torch_dtype=torch.float16,
|
||||
+ quantization_config=quantization_config,
|
||||
).to(device)
|
||||
```
|
||||
@@ -218,7 +218,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
[[autodoc]] Idefics2ImageProcessor
|
||||
- preprocess
|
||||
|
||||
## Idefics2ImageProcessorFast
|
||||
[[autodoc]] Idefics2ImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## Idefics2Processor
|
||||
[[autodoc]] Idefics2Processor
|
||||
- __call__
|
||||
- __call__
|
||||
|
||||
@@ -80,6 +80,9 @@ This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts)
|
||||
[[autodoc]] Idefics3ImageProcessor
|
||||
- preprocess
|
||||
|
||||
## Idefics3ImageProcessorFast
|
||||
[[autodoc]] Idefics3ImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## Idefics3Processor
|
||||
[[autodoc]] Idefics3Processor
|
||||
|
||||
@@ -32,7 +32,7 @@ SmolVLM2 is an adaptation of the Idefics3 model with two main differences:
|
||||
|
||||
Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.
|
||||
|
||||
Videos should not be upsampled.
|
||||
Videos should not be upsampled.
|
||||
|
||||
If `do_resize` is set to `True`, the model resizes images so that the longest edge is 4*512 pixels by default.
|
||||
The default resizing behavior can be customized by passing a dictionary to the `size` parameter. For example, `{"longest_edge": 4 * 512}` is the default, but you can change it to a different value if needed.
|
||||
@@ -192,11 +192,14 @@ print(generated_texts[0])
|
||||
[[autodoc]] SmolVLMForConditionalGeneration
|
||||
- forward
|
||||
|
||||
|
||||
## SmolVLMImageProcessor
|
||||
[[autodoc]] SmolVLMImageProcessor
|
||||
- preprocess
|
||||
|
||||
## SmolVLMImageProcessorFast
|
||||
[[autodoc]] SmolVLMImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## SmolVLMVideoProcessor
|
||||
[[autodoc]] SmolVLMVideoProcessor
|
||||
- preprocess
|
||||
|
||||
Reference in New Issue
Block a user