[VLMs] support passing embeds along with pixels (#38467)
* VLMs can work with embeds now * update more models * fix tests * fix copies * fixup * fix * style * unskip tests * fix copies * fix tests * style * omni modality models * qwen models had extra indentation * fix some other tests * fix copies * fix test last time * unrelated changes revert * we can't rely only on embeds * delete file * de-flake mistral3 * fix qwen models * fix style * fix tests * fix copies * deflake the test * modular reverted by fixes, fix again * flaky test, overwritten * fix copies * style
This commit is contained in:
committed by
GitHub
parent
20901f1d68
commit
f8b88866f5
@@ -96,6 +96,7 @@ class Idefics3VisionText2TextModelTester:
|
||||
image_token_id=57,
|
||||
):
|
||||
self.parent = parent
|
||||
self.pad_token_id = text_config["pad_token_id"]
|
||||
self.is_training = is_training
|
||||
self.batch_size = batch_size
|
||||
self.num_images = num_images
|
||||
@@ -148,6 +149,7 @@ class Idefics3VisionText2TextModelTester:
|
||||
|
||||
# For simplicity just set the last n tokens to the image token
|
||||
n_image_tokens_per_batch = self.seq_length
|
||||
input_ids[input_ids == self.image_token_id] = self.pad_token_id
|
||||
input_ids[:, -n_image_tokens_per_batch:] = self.image_token_id
|
||||
attention_mask = input_ids.ne(1).to(torch_device)
|
||||
inputs_dict = {
|
||||
|
||||
Reference in New Issue
Block a user