Uniformize kwargs for chameleon processor (#32181)
* uniformize kwargs of Chameleon * fix linter nit * rm stride default * add tests for chameleon processor * fix tests * add comment on get_component * rm Chameleon's slow tokenizer * add check order images text + nit * update docs and tests * Fix LlamaTokenizer tests * fix gated repo access * fix wrong import --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
This commit is contained in:
committed by
GitHub
parent
f2c388e3f9
commit
0a21381ba3
@@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
|
||||
## Overview
|
||||
|
||||
The Chameleon model was proposed in [Chameleon: Mixed-Modal Early-Fusion Foundation Models
|
||||
](https://arxiv.org/abs/2405.09818v1) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
|
||||
](https://arxiv.org/abs/2405.09818v1) by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response. Image generation module is not released yet.
|
||||
|
||||
|
||||
The abstract from the paper is the following:
|
||||
@@ -61,7 +61,7 @@ The original code can be found [here](https://github.com/facebookresearch/chamel
|
||||
|
||||
### Single image inference
|
||||
|
||||
Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token.
|
||||
Chameleon is a gated model so make sure to have access and login to Hugging Face Hub using a token.
|
||||
Here's how to load the model and perform inference in half-precision (`torch.bfloat16`):
|
||||
|
||||
```python
|
||||
@@ -78,7 +78,7 @@ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
|
||||
image = Image.open(requests.get(url, stream=True).raw)
|
||||
prompt = "What do you see in this image?<image>"
|
||||
|
||||
inputs = processor(prompt, image, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
|
||||
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
|
||||
|
||||
# autoregressively complete prompt
|
||||
output = model.generate(**inputs, max_new_tokens=50)
|
||||
@@ -117,7 +117,7 @@ prompts = [
|
||||
|
||||
# We can simply feed images in the order they have to be used in the text prompt
|
||||
# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
|
||||
inputs = processor(text=prompts, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(device="cuda", dtype=torch.bfloat16)
|
||||
inputs = processor(images=[image_stop, image_cats, image_snowman], text=prompts, padding=True, return_tensors="pt").to(device="cuda", dtype=torch.bfloat16)
|
||||
|
||||
# Generate
|
||||
generate_ids = model.generate(**inputs, max_new_tokens=50)
|
||||
@@ -162,8 +162,8 @@ from transformers import ChameleonForConditionalGeneration
|
||||
|
||||
model_id = "facebook/chameleon-7b"
|
||||
model = ChameleonForConditionalGeneration.from_pretrained(
|
||||
model_id,
|
||||
torch_dtype=torch.bfloat16,
|
||||
model_id,
|
||||
torch_dtype=torch.bfloat16,
|
||||
low_cpu_mem_usage=True,
|
||||
attn_implementation="flash_attention_2"
|
||||
).to(0)
|
||||
|
||||
Reference in New Issue
Block a user