Add support for Pixtral (#33449)
* initial commit * gloups * updates * work * weights match * nits * nits * updates to support the tokenizer :) * updates * Pixtral processor (#33454) * rough outline * Add in image break and end tokens * Fix * Udo some formatting changes * Set patch_size default * Fix * Fix token expansion * nit in conversion script * Fix image token list creation * done * add expected results * Process list of list of images (#33465) * updates * working image and processor * this is the expected format * some fixes * push current updated * working mult images! * add a small integration test * Uodate configuration docstring * Formatting * Config docstring fix * simplify model test * fixup modeling and etests * Return BatchMixFeature in image processor * fix some copies * update * nits * Update model docstring * Apply suggestions from code review * Fix up * updates * revert modeling changes * update * update * fix load safe * addd liscence * update * use pixel_values as required by the model * skip some tests and refactor * Add pixtral image processing tests (#33476) * Image processing tests * Add processing tests * woops * defaults reflect pixtral image processor * fixup post merge * images -> pixel values * oups sorry Mr docbuilder * isort * fix * fix processor tests * small fixes * nit * update * last nits * oups this was really breaking! * nits * is composition needs to be true --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
@@ -569,3 +569,50 @@ class LlavaForConditionalGenerationIntegrationTest(unittest.TestCase):
|
||||
|
||||
# check that both inputs are handled correctly and generate the same output
|
||||
self.assertListEqual(output_expanded[:, -20:].tolist(), output[:, -20:].tolist())
|
||||
|
||||
@slow
|
||||
@require_bitsandbytes
|
||||
def test_pixtral(self):
|
||||
model_id = "hf-internal-testing/pixtral-12b"
|
||||
model = LlavaForConditionalGeneration.from_pretrained(model_id)
|
||||
processor = AutoProcessor.from_pretrained(model_id)
|
||||
|
||||
IMG_URLS = [
|
||||
Image.open(requests.get("https://picsum.photos/id/237/400/300", stream=True).raw),
|
||||
Image.open(requests.get("https://picsum.photos/id/231/200/300", stream=True).raw),
|
||||
Image.open(requests.get("https://picsum.photos/id/27/500/500", stream=True).raw),
|
||||
Image.open(requests.get("https://picsum.photos/id/17/150/600", stream=True).raw),
|
||||
]
|
||||
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"
|
||||
|
||||
# image = Image.open(requests.get(url, stream=True).raw)
|
||||
inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
|
||||
generate_ids = model.generate(**inputs, max_new_tokens=500)
|
||||
ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
||||
|
||||
# fmt: off
|
||||
EXPECTED_GENERATION = """
|
||||
Describe the images.
|
||||
Sure, let's break down each image description:
|
||||
|
||||
1. **Image 1:**
|
||||
- **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
|
||||
- **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.
|
||||
|
||||
2. **Image 2:**
|
||||
- **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
|
||||
- **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.
|
||||
|
||||
3. **Image 3:**
|
||||
- **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
|
||||
- **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.
|
||||
|
||||
4. **Image 4:**
|
||||
- **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
|
||||
- **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.
|
||||
|
||||
Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
|
||||
"""
|
||||
# fmt: on
|
||||
# check that both inputs are handled correctly and generate the same output
|
||||
self.assertListEqual(ouptut, EXPECTED_GENERATION)
|
||||
|
||||
Reference in New Issue
Block a user