[Model] Cohere2 Vision (#39810)
* Add cohere2_vision to support CohereLabs/command-a-vision-07-2025 * update and add modualr file * update processors and check with orig impl later * delete unused files * image processor reduce LOC and re-use GotOCR2 * update the config to use modular * model tests pass * processor fixes * check model outputs decorator * address one more comment * Update tokens. Temp - need to read from tokenizer' * fix for multi-gpu * Fix image token handling * upadte image token expansion logic * fix a few issues with remote code loading * not related but modular forces us to change all files now * Add overview and code sample to cohere vision docs * add scripts. TMP. * Update inference script * Create script * set dtype in export script * TO revert: modular export fix * Fix scripts * Revert "TO revert: modular export fix" This reverts commit bdb2f305b61027a05f0032ce70d6ca698879191c. * Use modular weights * Upload to hub Removed OOD weights ad script * Updated docs * fix import error Update docs Added pipeline test * Updated docs * Run modular script remove modular for config Added patch_size Added docstrings in modular Fix OOM Add docs, fixup integration tests. 8-gpu passing * tiny updates * address comments + fixup * add test for chat template * check model outputs workaround * aya vision fix check model inputs * Revert "add test for chat template" This reverts commit 42c756e397f588d76b449ff1f93292d8ee0202d8. * reveert more changes * last revert * skip and merge * faulty copy from --------- Co-authored-by: Julian Mack <julian.mack@cohere.com> Co-authored-by: kyle-cohere <kyle@cohere.com>
This commit is contained in:
committed by
GitHub
parent
6c3f27ba61
commit
e1688d28d3
@@ -411,6 +411,8 @@
|
||||
title: Cohere
|
||||
- local: model_doc/cohere2
|
||||
title: Cohere2
|
||||
- local: model_doc/cohere2_vision
|
||||
title: Cohere2Vision
|
||||
- local: model_doc/convbert
|
||||
title: ConvBERT
|
||||
- local: model_doc/cpm
|
||||
|
||||
92
docs/source/en/model_doc/cohere2_vision.md
Normal file
92
docs/source/en/model_doc/cohere2_vision.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# Command A Vision
|
||||
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
|
||||
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
<img alt="Tensor parallelism" src="https://img.shields.io/badge/Tensor%20parallelism-06b6d4?style=flat&logoColor=white">
|
||||
</div>
|
||||
|
||||
## Overview
|
||||
|
||||
Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.
|
||||
|
||||
The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.
|
||||
|
||||
Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.
|
||||
|
||||
## Usage tips
|
||||
|
||||
The model and image processor can be loaded as follows:
|
||||
|
||||
```python
|
||||
|
||||
import torch
|
||||
from transformers import AutoProcessor, AutoModelForImageTextToText
|
||||
|
||||
model_id = "CohereLabs/command-a-vision-07-2025"
|
||||
|
||||
processor = AutoProcessor.from_pretrained(model_id)
|
||||
model = AutoModelForImageTextToText.from_pretrained(
|
||||
model_id, device_map="auto", torch_dtype=torch.float16
|
||||
)
|
||||
|
||||
# Format message with the Command-A-Vision chat template
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image",
|
||||
"url": "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg",
|
||||
},
|
||||
{"type": "text", "text": "what is in this image?"},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
inputs = processor.apply_chat_template(
|
||||
messages,
|
||||
padding=True,
|
||||
add_generation_prompt=True,
|
||||
tokenize=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt",
|
||||
).to(model.device)
|
||||
|
||||
gen_tokens = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=300,
|
||||
do_sample=True,
|
||||
temperature=0.3,
|
||||
)
|
||||
|
||||
print(
|
||||
processor.tokenizer.decode(
|
||||
gen_tokens[0][inputs.input_ids.shape[1] :], skip_special_tokens=True
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Cohere2VisionConfig
|
||||
|
||||
[[autodoc]] Cohere2VisionConfig
|
||||
|
||||
## Cohere2VisionForConditionalGeneration
|
||||
|
||||
[[autodoc]] Cohere2VisionForConditionalGeneration
|
||||
- forward
|
||||
|
||||
## Cohere2VisionModel
|
||||
|
||||
[[autodoc]] Cohere2VisionModel
|
||||
- forward
|
||||
|
||||
## Cohere2VisionImageProcessorFast
|
||||
|
||||
[[autodoc]] Cohere2VisionImageProcessorFast
|
||||
- preprocess
|
||||
|
||||
## Cohere2VisionProcessor
|
||||
|
||||
[[autodoc]] Cohere2VisionProcessor
|
||||
Reference in New Issue
Block a user