[WIP] Emu3: add model (#33770)
Some checks failed
Release - Conda / build_and_package (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled

* model can convert to HF and be loaded back

* nit

* works in single batch generation but hallucinates

* use the image tokens

* add image generation

* now it works

* add tests

* update

* add modulare but it doesn't work for porting docstring :(

* skip some tests

* add slow tests

* modular removed the import?

* guess this works

* update

* update

* fix copies

* fix test

* fix copies

* update

* docs

* fix tests

* last fix tests?

* pls

* repo consistency

* more style

* style

* remove file

* address comments

* tiny bits

* update after the new modular

* fix tests

* add one more cond in check attributes

* decompose down/up/mid blocks

* allow static cache generation in VLMs

* nit

* fix copies

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/emu3.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* fix VAE upsampling

* Update src/transformers/models/emu3/modular_emu3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* address comments

* state overwritten stuff explicitly

* fix copies

* add the flag for flex attn

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
Raushan Turganbay
2025-01-10 12:23:00 +01:00
committed by Arthur Zucker
parent 59e28c30fa
commit 6bc0fbcfa7
28 changed files with 5722 additions and 5 deletions

View File

@@ -860,6 +860,8 @@
title: DePlot
- local: model_doc/donut
title: Donut
- local: model_doc/emu3
title: Emu3
- local: model_doc/flava
title: FLAVA
- local: model_doc/git

View File

@@ -137,6 +137,7 @@ Flax), PyTorch, and/or TensorFlow.
| [EfficientFormer](model_doc/efficientformer) | ✅ | ✅ | ❌ |
| [EfficientNet](model_doc/efficientnet) | ✅ | ❌ | ❌ |
| [ELECTRA](model_doc/electra) | ✅ | ✅ | ✅ |
| [Emu3](model_doc/emu3) | ✅ | ❌ | ❌ |
| [EnCodec](model_doc/encodec) | ✅ | ❌ | ❌ |
| [Encoder decoder](model_doc/encoder-decoder) | ✅ | ✅ | ✅ |
| [ERNIE](model_doc/ernie) | ✅ | ❌ | ❌ |

View File

@@ -0,0 +1,179 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Emu3
## Overview
The Emu3 model was proposed in [Emu3: Next-Token Prediction is All You Need](https://arxiv.org/abs/2409.18869) by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 is a multimodal LLM that uses vector quantization to tokenize images into discrete tokens. Discretized image tokens are later fused with text token ids for image and text generation. The model can additionally generate images by predicting image token ids.
The abstract from the paper is the following:
*While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.*
Tips:
- We advise users to set `processor.tokenizer.padding_side = "left"` before batched generation as it leads to more accurate results.
- Note that the model has been trained with a specific prompt format for chatting. Use `processor.apply_chat_template(my_conversation_dict)` to correctly format your prompts.
- Emu3 has two different checkpoints for image-generation and text-generation, make sure to use the correct checkpoint when loading the model. To generate an image, it is advised to use `prefix_constraints` so that the generated tokens are sampled only from possible image tokens. See more below for usage examples.
> [!TIP]
> Emu3 implementation in Transformers uses a special image token to indicate where to merge image embeddings. The special image token isn't new and uses one of the reserved tokens: `<|extra_0|>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.
This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
The original code can be found [here](https://github.com/baaivision/Emu3).
## Usage example
### Text generation inference
Here's how to load the model and perform inference in half-precision (`torch.bfloat16`) to generate textual output from text or text and image inputs:
```python
from transformers import Emu3Processor, Emu3ForConditionalGeneration
import torch
from PIL import Image
import requests
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
model = Emu3ForConditionalGeneration.from_pretrained("Emu3-community/Emu3-Chat-hf", torch_dtype=torch.bfloat16, device_map="cuda")
# prepare image and text prompt
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
prompt = "What do you see in this image?<image>"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True))
```
### Image generation inference
Emu3 can also generate images from textual input. Here is how you can do it:
```python
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Gen-hf")
model = Emu3ForConditionalGeneration.from_pretrained("Emu3-community/Emu3-Gen-hf", torch_dtype="bfloat16", device_map="auto", attn_implementation="flash_attention_2")
inputs = processor(
text=["a portrait of young girl. masterpiece, film grained, best quality.", "a dog running under the rain"],
padding=True,
return_tensors="pt",
return_for_image_generation=True,
)
inputs = inputs.to(device="cuda:0", dtype=torch.bfloat16)
neg_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
neg_inputs = processor(text=[neg_prompt] * 2, return_tensors="pt").to(device="cuda:0")
image_sizes = inputs.pop("image_sizes")
HEIGHT, WIDTH = image_sizes[0]
VISUAL_TOKENS = model.vocabulary_mapping.image_tokens
def prefix_allowed_tokens_fn(batch_id, input_ids):
height, width = HEIGHT, WIDTH
visual_tokens = VISUAL_TOKENS
image_wrapper_token_id = torch.tensor([processor.tokenizer.image_wrapper_token_id], device=model.device)
eoi_token_id = torch.tensor([processor.tokenizer.eoi_token_id], device=model.device)
eos_token_id = torch.tensor([processor.tokenizer.eos_token_id], device=model.device)
pad_token_id = torch.tensor([processor.tokenizer.pad_token_id], device=model.device)
eof_token_id = torch.tensor([processor.tokenizer.eof_token_id], device=model.device)
eol_token_id = processor.tokenizer.encode("<|extra_200|>", return_tensors="pt")[0]
position = torch.nonzero(input_ids == image_wrapper_token_id, as_tuple=True)[0][0]
offset = input_ids.shape[0] - position
if offset % (width + 1) == 0:
return (eol_token_id, )
elif offset == (width + 1) * height + 1:
return (eof_token_id, )
elif offset == (width + 1) * height + 2:
return (eoi_token_id, )
elif offset == (width + 1) * height + 3:
return (eos_token_id, )
elif offset > (width + 1) * height + 3:
return (pad_token_id, )
else:
return visual_tokens
out = model.generate(
**inputs,
max_new_tokens=50_000, # make sure to have enough tokens for one image
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
return_dict_in_generate=True,
negative_prompt_ids=neg_inputs.input_ids, # indicate for Classifier-Free Guidance
negative_prompt_attention_mask=neg_inputs.attention_mask,
)
image = model.decode_image_tokens(out.sequences[:, inputs.input_ids.shape[1]: ], height=HEIGHT, width=WIDTH)
images = processor.postprocess(list(image.float()), return_tensors="PIL.Image.Image") # internally we convert to np but it's not supported in bf16 precision
for i, image in enumerate(images['pixel_values']):
image.save(f"result{i}.png")
```
## Emu3Config
[[autodoc]] Emu3Config
## Emu3VQVAEConfig
[[autodoc]] Emu3VQVAEConfig
## Emu3TextConfig
[[autodoc]] Emu3TextConfig
## Emu3Processor
[[autodoc]] Emu3Processor
## Emu3ImageProcessor
[[autodoc]] Emu3ImageProcessor
- preprocess
## Emu3VQVAE
[[autodoc]] Emu3VQVAE
- forward
## Emu3TextModel
[[autodoc]] Emu3TextModel
- forward
## Emu3ForCausalLM
[[autodoc]] Emu3ForCausalLM
- forward
## Emu3ForConditionalGeneration
[[autodoc]] Emu3ForConditionalGeneration
- forward

View File

@@ -49,6 +49,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
* [DiffLlama](https://huggingface.co/docs/transformers/model_doc/diffllama#transformers.DiffLlamaModel)
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
* [Emu3](https://huggingface.co/docs/transformers/model_doc/emu3)
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
* [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model)
* [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)
@@ -245,6 +246,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
* [Dpr](https://huggingface.co/docs/transformers/model_doc/dpr#transformers.DprReader)
* [EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder_decoder#transformers.EncoderDecoderModel)
* [Emu3](https://huggingface.co/docs/transformers/model_doc/emu3)
* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel)
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
* [Gemma2](https://huggingface.co/docs/transformers/model_doc/gemma2#transformers.Gemma2Model)

View File

@@ -428,6 +428,12 @@ _import_structure = {
"ElectraConfig",
"ElectraTokenizer",
],
"models.emu3": [
"Emu3Config",
"Emu3Processor",
"Emu3TextConfig",
"Emu3VQVAEConfig",
],
"models.encodec": [
"EncodecConfig",
"EncodecFeatureExtractor",
@@ -1222,6 +1228,7 @@ else:
_import_structure["models.donut"].extend(["DonutFeatureExtractor", "DonutImageProcessor"])
_import_structure["models.dpt"].extend(["DPTFeatureExtractor", "DPTImageProcessor"])
_import_structure["models.efficientnet"].append("EfficientNetImageProcessor")
_import_structure["models.emu3"].append("Emu3ImageProcessor")
_import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaImageProcessor", "FlavaProcessor"])
_import_structure["models.fuyu"].extend(["FuyuImageProcessor", "FuyuProcessor"])
_import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"])
@@ -2243,6 +2250,15 @@ else:
"load_tf_weights_in_electra",
]
)
_import_structure["models.emu3"].extend(
[
"Emu3ForCausalLM",
"Emu3ForConditionalGeneration",
"Emu3PreTrainedModel",
"Emu3TextModel",
"Emu3VQVAE",
]
)
_import_structure["models.encodec"].extend(
[
"EncodecModel",
@@ -5440,6 +5456,12 @@ if TYPE_CHECKING:
ElectraConfig,
ElectraTokenizer,
)
from .models.emu3 import (
Emu3Config,
Emu3Processor,
Emu3TextConfig,
Emu3VQVAEConfig,
)
from .models.encodec import (
EncodecConfig,
EncodecFeatureExtractor,
@@ -6270,6 +6292,7 @@ if TYPE_CHECKING:
from .models.donut import DonutFeatureExtractor, DonutImageProcessor
from .models.dpt import DPTFeatureExtractor, DPTImageProcessor
from .models.efficientnet import EfficientNetImageProcessor
from .models.emu3 import Emu3ImageProcessor
from .models.flava import (
FlavaFeatureExtractor,
FlavaImageProcessor,
@@ -7139,6 +7162,13 @@ if TYPE_CHECKING:
ElectraPreTrainedModel,
load_tf_weights_in_electra,
)
from .models.emu3 import (
Emu3ForCausalLM,
Emu3ForConditionalGeneration,
Emu3PreTrainedModel,
Emu3TextModel,
Emu3VQVAE,
)
from .models.encodec import (
EncodecModel,
EncodecPreTrainedModel,

View File

@@ -1634,17 +1634,18 @@ class GenerationMixin:
cache_dtype = self.get_output_embeddings().weight.dtype
def get_layer_device_map(execution_device_map: Optional[dict] = None):
num_hidden_layers = self.config.get_text_config().num_hidden_layers
if execution_device_map is None:
return None
elif len(execution_device_map) == 1 and "" in execution_device_map:
return {idx: execution_device_map[""] for idx in range(self.config.num_hidden_layers)}
return {idx: execution_device_map[""] for idx in range(num_hidden_layers)}
layer_device_map = {}
for layer in execution_device_map:
for idx in range(self.config.num_hidden_layers):
for idx in range(num_hidden_layers):
if f".{idx}." in f"{layer}.":
layer_device_map[idx] = execution_device_map[layer]
break
for idx in range(self.config.num_hidden_layers):
for idx in range(num_hidden_layers):
if idx not in layer_device_map:
raise RuntimeError(f"layer {idx} has not been mapped to a device.")
return layer_device_map

View File

@@ -86,6 +86,7 @@ from . import (
dpt,
efficientnet,
electra,
emu3,
encodec,
encoder_decoder,
ernie,

View File

@@ -103,6 +103,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("efficientformer", "EfficientFormerConfig"),
("efficientnet", "EfficientNetConfig"),
("electra", "ElectraConfig"),
("emu3", "Emu3Config"),
("encodec", "EncodecConfig"),
("encoder-decoder", "EncoderDecoderConfig"),
("ernie", "ErnieConfig"),
@@ -420,6 +421,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("efficientformer", "EfficientFormer"),
("efficientnet", "EfficientNet"),
("electra", "ELECTRA"),
("emu3", "Emu3"),
("encodec", "EnCodec"),
("encoder-decoder", "Encoder decoder"),
("ernie", "ERNIE"),

View File

@@ -499,6 +499,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
("dbrx", "DbrxForCausalLM"),
("diffllama", "DiffLlamaForCausalLM"),
("electra", "ElectraForCausalLM"),
("emu3", "Emu3ForCausalLM"),
("ernie", "ErnieForCausalLM"),
("falcon", "FalconForCausalLM"),
("falcon_mamba", "FalconMambaForCausalLM"),
@@ -800,6 +801,7 @@ MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(
("blip", "BlipForConditionalGeneration"),
("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
("git", "GitForCausalLM"),
("idefics", "IdeficsForVisionText2Text"),
@@ -1428,6 +1430,7 @@ MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES = OrderedDict(
("deberta-v2", "DebertaV2Model"),
("distilbert", "DistilBertModel"),
("electra", "ElectraModel"),
("emu3", "Emu3TextModel"),
("flaubert", "FlaubertModel"),
("ibert", "IBertModel"),
("longformer", "LongformerModel"),

View File

@@ -59,6 +59,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("clipseg", "CLIPSegProcessor"),
("clvp", "ClvpProcessor"),
("colpali", "ColPaliProcessor"),
("emu3", "Emu3Processor"),
("flava", "FlavaProcessor"),
("fuyu", "FuyuProcessor"),
("git", "GitProcessor"),

View File

@@ -186,6 +186,7 @@ else:
),
),
("electra", ("ElectraTokenizer", "ElectraTokenizerFast" if is_tokenizers_available() else None)),
("emu3", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
("ernie", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("ernie_m", ("ErnieMTokenizer" if is_sentencepiece_available() else None, None)),
("esm", ("EsmTokenizer", None)),

View File

@@ -62,6 +62,7 @@ class ChameleonProcessor(ProcessorMixin):
attributes = ["image_processor", "tokenizer"]
tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
valid_kwargs = ["image_seq_length", "image_token"]
image_processor_class = "ChameleonImageProcessor"
def __init__(self, image_processor, tokenizer, image_seq_length: int = 1024, image_token: str = "<image>"):

View File

@@ -0,0 +1,29 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_emu3 import *
from .image_processing_emu3 import *
from .modeling_emu3 import *
from .processing_emu3 import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@@ -0,0 +1,327 @@
# coding=utf-8
# Copyright 2024 HuggingFace Inc. team. All rights reserved.
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Dict, List, Optional, Union
from ...configuration_utils import PretrainedConfig
from ...modeling_rope_utils import rope_config_validation
class Emu3VQVAEConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`Emu3VQVAE`]. It is used to instantiate an VQ-VAE
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a configuration to the VQ model presented in Emu3 paper.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
codebook_size (`int`, *optional*, defaults to 32768):
Codebook size of the VQ model.
embed_dim (`int`, *optional*, defaults to 4):
Dimension of the quantized vector in codebook.
latent_channels (`int`, *optional*, defaults to 4):
Dimension of the output channel of encoder and the input channel of decoder
double_latent (`bool`, *optional*, defaults to `False`):
Whether double the output dim of the encoder.
in_channels (`int`, *optional*, defaults to 3):
Input channel of encoder.
out_channels (`int`, *optional*, defaults to 3):
Output channel of decoder.
temporal_downsample_factor (`int`, *optional*, defaults to 4):
Temporal downsample factor.
base_channels (`int`, *optional*, defaults to 256):
Basic channel number of the intermediate blocks.
channel_multiplier (`List[int]`, *optional*, defaults to `[1, 2, 2, 4]`):
Channel scaling factor of the intermediate blocks.
num_res_blocks (`int`, *optional*, defaults to 2):
Residual block number in each stage.
attn_resolutions (`List[int]`, *optional*, defaults to `[3]`):
Stage indices to apply attention.
hidden_size (`int`, *optional*, defaults to 1024):
Dimension of the hidden representations in the attention layer.
num_attention_heads (`int`, *optional*, defaults to 1):
Number of attention heads for each attention layer.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
```python
>>> from transformers import Emu3VQVAE, Emu3VQVAEConfig
>>> # Initializing a video VQ model of Emu3 configuration
>>> configuration = Emu3VQVAEConfig()
>>> # Initializing a model from the Emu3 VQ model style configuration
>>> model = Emu3VQVAE(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "emu3_vqgan"
base_config_key = "vq_config"
def __init__(
self,
codebook_size: int = 32768,
embed_dim: int = 4,
latent_channels: int = 4,
double_latent: bool = False,
in_channels: int = 3,
out_channels: int = 3,
temporal_downsample_factor: int = 4,
base_channels: int = 256,
channel_multiplier: List[int] = [1, 2, 2, 4],
num_res_blocks: int = 2,
attn_resolutions: List[int] = [3],
hidden_size: int = 1024,
num_attention_heads: int = 1,
attention_dropout: float = 0.0,
**kwargs,
):
super().__init__(**kwargs)
self.codebook_size = codebook_size
self.embed_dim = embed_dim
self.latent_channels = latent_channels
self.double_latent = double_latent
self.in_channels = in_channels
self.out_channels = out_channels
self.temporal_downsample_factor = temporal_downsample_factor
self.base_channels = base_channels
self.channel_multiplier = channel_multiplier
self.num_res_blocks = num_res_blocks
self.attn_resolutions = attn_resolutions
self.hidden_size = hidden_size
self.num_attention_heads = num_attention_heads
self.attention_dropout = attention_dropout
class Emu3TextConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`Emu3TextModel`]. It is used to instantiate a
emu3 model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the
[Emu3-community/Emu3-Chat-hf](https://huggingface.co/Emu3-community/Emu3-Chat-hf).
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 184622):
Vocabulary size of the Emu3 model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`Emu3Model`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 14336):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer decoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (`int`, *optional*, defaults to 8):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
`num_attention_heads`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 9216):
The maximum sequence length that this model might ever be used with. Emu supports up to 9216 tokens,
rms_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
pad_token_id (`int`, *optional*, defaults to 151643):
Padding token id.
bos_token_id (`int`, *optional*, defaults to 151849):
Beginning of stream token id.
eos_token_id (`int`, *optional*, defaults to 151850):
End of stream token id.
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
rope_theta (`float`, *optional*, defaults to 1000000.0):
The base period of the RoPE embeddings.
rope_scaling (`Dict`, *optional*):
Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
accordingly.
Expected contents:
`rope_type` (`str`):
The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
'llama3'], with 'default' being the original RoPE implementation.
`factor` (`float`, *optional*):
Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
most scaling types, a `factor` of x will enable the model to handle sequences of length x *
original maximum pre-trained length.
`original_max_position_embeddings` (`int`, *optional*):
Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
pretraining.
`attention_factor` (`float`, *optional*):
Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
computation. If unspecified, it defaults to value recommended by the implementation, using the
`factor` field to infer the suggested value.
`beta_fast` (`float`, *optional*):
Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
ramp function. If unspecified, it defaults to 32.
`beta_slow` (`float`, *optional*):
Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
ramp function. If unspecified, it defaults to 1.
`short_factor` (`List[float]`, *optional*):
Only used with 'longrope'. The scaling factor to be applied to short contexts (<
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
size divided by the number of attention heads divided by 2
`long_factor` (`List[float]`, *optional*):
Only used with 'longrope'. The scaling factor to be applied to long contexts (<
`original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
size divided by the number of attention heads divided by 2
`low_freq_factor` (`float`, *optional*):
Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
`high_freq_factor` (`float`, *optional*):
Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
mlp_bias (`bool`, *optional*, defaults to `False`):
Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
attention_bias (`bool`, *optional*, defaults to `False`):
Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
```python
>>> from transformers import Emu3Model, Emu3Config
>>> # Initializing a Emu3-community/Emu3-Chat-hf style configuration
>>> configuration = Emu3Config()
>>> # Initializing a model from the Emu3-community/Emu3-Chat-hf style configuration
>>> model = Emu3Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "emu3_text_model"
base_config_key = "text_config"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size: int = 184622,
hidden_size: int = 4096,
intermediate_size: int = 14336,
num_hidden_layers: int = 32,
num_attention_heads: int = 32,
num_key_value_heads: Optional[int] = 8,
hidden_act: str = "silu",
max_position_embeddings: int = 9216,
rms_norm_eps: float = 1e-5,
use_cache: bool = True,
pad_token_id: int = 151643,
bos_token_id: int = 151849,
eos_token_id: int = 151850,
tie_word_embeddings: bool = False,
rope_theta: float = 1000000.0,
rope_scaling: Optional = None,
mlp_bias=False,
attention_bias=False,
attention_dropout: float = 0.1,
initializer_range: float = 0.02,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self.mlp_bias = mlp_bias
self.attention_bias = attention_bias
self.initializer_range = initializer_range
rope_config_validation(self)
self.attention_dropout = attention_dropout
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
class Emu3Config(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`Emu3Model`]. It is used to instantiate a
emu3 model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the
[Emu3-community/Emu3-Chat-hf](https://huggingface.co/Emu3-community/Emu3-Chat-hf).
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vq_config (`Union[Dict, Emu3VQVAEConfig]`, *optional*):
Emu3VQVAEConfig instance containing the configuration for the VQ-VAE model.
text_config (`Union[Dict, Emu3TextConfig]``, *optional*):
Emu3TextConfig instance containing the configuration for the language model.
vocabulary_map (`dict`, *optional*):
A dictionary containing the vocabulary map from the tokenizer. Used to obtain tokens from the image inputs.
"""
model_type = "emu3"
keys_to_ignore_at_inference = ["past_key_values"]
sub_configs = {"text_config": Emu3TextConfig, "vq_config": Emu3VQVAEConfig}
def __init__(
self,
vq_config: Union[Dict, Emu3VQVAEConfig] = None,
text_config: Union[Dict, Emu3TextConfig] = None,
vocabulary_map: Dict[int, int] = None,
**kwargs,
):
if vq_config is None:
vq_config = Emu3VQVAEConfig()
elif isinstance(vq_config, dict):
vq_config = Emu3VQVAEConfig(**vq_config)
if text_config is None:
text_config = Emu3TextConfig()
elif isinstance(text_config, dict):
text_config = Emu3TextConfig(**text_config)
self.vq_config = vq_config
self.text_config = text_config
self.vocabulary_map = vocabulary_map
super().__init__(**kwargs)
__all__ = ["Emu3Config", "Emu3TextConfig", "Emu3VQVAEConfig"]

View File

@@ -0,0 +1,448 @@
# Copyright 2024 The Emu team, BAAI and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import os
import re
from typing import Dict, Optional
import requests
import torch
from accelerate import init_empty_weights
from PIL import Image
from transformers import (
AutoModel,
AutoModelForCausalLM,
AutoTokenizer,
Emu3Config,
Emu3ForConditionalGeneration,
Emu3ImageProcessor,
Emu3Processor,
Emu3TextConfig,
GenerationConfig,
)
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
"""
Sample usage:
```
python src/transformers/models/emu3/convert_emu3_weights_to_hf.py \
--vq_model_id BAAI/Emu3-VisionTokenizer --llm_model_id BAAI/Emu3-Chat --output_dir /output/path
```
Thereafter, models can be loaded via:
```py
from transformers import Emu3ForConditionalGeneration, Emu3Processor
model = Emu3ForConditionalGeneration.from_pretrained("/output/path")
processor = Emu3Processor.from_pretrained("/output/path")
```
"""
byte_encoder = bytes_to_unicode()
CHAT_TEMPLATE = "{% for message in messages %}{% if message['role'] != 'system' %}{{ message['role'].upper() + ': '}}{% endif %}{# Render all images first #}{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}{{ '<image>' }}{% endfor %}{# Render all text next #}{% if message['role'] != 'assistant' %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{{ content['text'] + ' '}}{% endfor %}{% else %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] + ' '}}{% endgeneration %}{% endfor %}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'ASSISTANT:' }}{% endif %}"
# Tiktoken to HF conversion, thanks for Xenova
def token_bytes_to_string(b):
return "".join([byte_encoder[ord(char)] for char in b.decode("latin-1")])
# Adapted from https://github.com/openai/tiktoken/issues/60#issuecomment-1499977960
def bpe(mergeable_ranks: Dict[bytes, int], token: bytes, max_rank: Optional[int] = None):
parts = [bytes([b]) for b in token]
while True:
min_idx = None
min_rank = None
for i, pair in enumerate(zip(parts[:-1], parts[1:])):
rank = mergeable_ranks.get(pair[0] + pair[1])
if rank is not None and (min_rank is None or rank < min_rank):
min_idx = i
min_rank = rank
if min_rank is None or (max_rank is not None and min_rank >= max_rank):
break
assert min_idx is not None
parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2 :]
return parts
def generate_vocab_and_merges(encoder):
mergeable_ranks = encoder._mergeable_ranks
merges = []
vocab = {}
for token, rank in mergeable_ranks.items():
vocab[token_bytes_to_string(token)] = rank
if len(token) == 1:
continue
merged = tuple(bpe(mergeable_ranks, token, max_rank=rank))
assert len(merged) == 2
merges.append(" ".join(map(token_bytes_to_string, merged)))
# Also add special tokens
vocab.update(encoder._special_tokens)
return vocab, merges
def convert_tiktoken(tokenizer, output_dir):
encoder = tokenizer.tokenizer
vocab, merges = generate_vocab_and_merges(encoder)
added_tokens = [
{
"id": id,
"content": content,
"single_word": False,
"lstrip": False,
"rstrip": False,
"normalized": False,
"special": True,
}
for content, id in encoder._special_tokens.items()
if content != "<|extra_0|>"
]
# https://huggingface.co/Xenova/gpt2/raw/main/tokenizer_config.json
tokenizer_config_template = {
"add_prefix_space": False,
"bos_token": "<|extra_203|>",
"clean_up_tokenization_spaces": False,
"eos_token": "<|extra_204|>",
"pad_token": "<|endoftext|>",
}
tokenizer_config_template.update({"tokenizer_class": "GPT2Tokenizer"})
tokenizer_config_template = dict(sorted(tokenizer_config_template.items(), key=lambda x: x[0]))
# add placeholder image token by taking one of the reserved tokens
reserved_token_id = vocab["<|extra_0|>"]
vocab["<image>"] = reserved_token_id
del vocab["<|extra_0|>"]
added_tokens.append(
{
"id": reserved_token_id,
"content": "<image>",
"single_word": False,
"lstrip": False,
"rstrip": False,
"normalized": False,
"special": True,
}
)
os.makedirs(output_dir, exist_ok=True)
pre_tokenizer = {
"type": "ByteLevel",
"add_prefix_space": False,
"trim_offsets": True,
"use_regex": True,
}
# https://huggingface.co/Xenova/gpt2/raw/main/tokenizer.json
tokenizer_template = {
"version": "1.0",
"truncation": None,
"padding": None,
"added_tokens": added_tokens,
"normalizer": None,
"pre_tokenizer": pre_tokenizer,
"post_processor": None,
"decoder": {
"type": "ByteLevel",
"add_prefix_space": True,
"trim_offsets": True,
"use_regex": True,
},
"model": {
"type": "BPE",
"dropout": None,
"unk_token": None,
"continuing_subword_prefix": "",
"end_of_word_suffix": "",
"fuse_unk": False,
"byte_fallback": False,
"vocab": vocab,
"merges": merges,
},
}
# Save to files
with open(os.path.join(output_dir, "vocab.json"), "w", encoding="utf-8") as fp:
json.dump(vocab, fp, indent=2, ensure_ascii=False)
with open(os.path.join(output_dir, "tokenizer.json"), "w", encoding="utf-8") as fp:
json.dump(tokenizer_template, fp, indent=2, ensure_ascii=False)
with open(os.path.join(output_dir, "tokenizer_config.json"), "w", encoding="utf-8") as fp:
json.dump(tokenizer_config_template, fp, indent=2, ensure_ascii=False)
with open(os.path.join(output_dir, "special_tokens_map.json"), "w", encoding="utf-8") as fp:
json.dump(
{
"bos_token": "<|extra_203|>",
"eos_token": "<|extra_204|>",
"pad_token": "<|endoftext|>",
},
fp,
indent=2,
ensure_ascii=False,
)
with open(os.path.join(output_dir, "merges.txt"), "w", encoding="utf-8") as fp:
fp.write("#version: 0.2\n")
fp.write("\n".join(merges))
KEYS_TO_MODIFY_MAPPING = {
"^encoder": "model.vqmodel.encoder",
"^decoder": "model.vqmodel.decoder",
"^post_quant_conv": "model.vqmodel.post_quant_conv",
"^quant_conv": "model.vqmodel.quant_conv",
"^quantize": "model.vqmodel.quantize",
"^model": "text_model.model",
r"lm_head\.weight": "text_model.lm_head.weight",
r"^text_model\.model\.vqmodel": "vqmodel",
# rename QKV proj for the VQ-VAE model because we use SiglipAttention
r"\.q\.": ".q_proj.",
r"\.k\.": ".k_proj.",
r"\.v\.": ".v_proj.",
r"\.proj_out\.": ".out_proj.",
# move the attention norms outside of attention modules
r"mid\.attn_1\.norm\.": "mid.attn_norm.",
r"attn\.0\.norm\.": "attn_norms.0.",
r"attn\.1\.norm\.": "attn_norms.1.",
r"attn\.2\.norm\.": "attn_norms.2.",
r"attn\.3\.norm\.": "attn_norms.3.",
# isolate down/mid/up into separate classes for readability
r"\.down\.": ".down_block.down.",
r"\.up\.": ".up_block.up.",
r"\.mid\.": ".middle_block.",
}
def convert_state_dict_to_hf(old_state_dict, new_state_dict):
for key, value in old_state_dict.items():
# convert conv layers in attn to linear
if (
any(key.endswith(name) for name in ["q.weight", "k.weight", "v.weight", "proj_out.weight"])
and value.ndim == 4
):
value = value.squeeze()
for old_pattern, new_pattern in KEYS_TO_MODIFY_MAPPING.items():
key = re.sub(old_pattern, new_pattern, key)
new_state_dict[key] = value
return new_state_dict
def convert_model(vq_model_id, llm_model_id, output_dir, hub_model_id=None, test_inference=False):
os.makedirs(output_dir, exist_ok=True)
# Convert and save processor
tokenizer_tiktoken = AutoTokenizer.from_pretrained(llm_model_id, trust_remote_code=True)
convert_tiktoken(tokenizer_tiktoken, output_dir)
extra_special_tokens = extra_special_tokens = {
"image_token": "<image>",
"boi_token": "<|image start|>",
"eoi_token": "<|image end|>",
"image_wrapper_token": "<|image token|>",
"eof_token": "<|extra_201|>",
}
tokenizer_converted = AutoTokenizer.from_pretrained(output_dir, extra_special_tokens=extra_special_tokens)
tokenizer_converted.padding_side = "left"
image_processor = Emu3ImageProcessor.from_pretrained(vq_model_id)
processor = Emu3Processor(image_processor, tokenizer_converted, chat_template=CHAT_TEMPLATE)
processor.save_pretrained(output_dir)
# load models
model_llm = AutoModelForCausalLM.from_pretrained(
llm_model_id,
trust_remote_code=True,
)
model_vqgan = AutoModel.from_pretrained(vq_model_id, trust_remote_code=True)
with open(f"{output_dir}/tokenizer.json", "r") as file:
tokenizer_config = json.load(file)
vocabulary_map = tokenizer_config["model"]["vocab"]
text_config = Emu3TextConfig(
max_position_embeddings=model_llm.config.max_position_embeddings,
rope_scaling={"rope_type": "default"},
)
config = Emu3Config(text_config=text_config, vocabulary_map=vocabulary_map)
with init_empty_weights():
model = Emu3ForConditionalGeneration(config=config)
model.generation_config = GenerationConfig(
do_sample=True,
top_k=2048,
max_new_tokens=50_000,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
)
state_dict = {}
state_dict = convert_state_dict_to_hf(model_llm.state_dict(), state_dict)
state_dict = convert_state_dict_to_hf(model_vqgan.state_dict(), state_dict)
model.load_state_dict(state_dict, assign=True, strict=True)
model.save_pretrained(output_dir, safe_serialization=True)
if hub_model_id is not None:
model.push_to_hub(hub_model_id)
processor.push_to_hub(hub_model_id)
if test_inference and llm_model_id.endswith("Chat"):
# Short inference on a few examples to check if generation makes sense
print("Loading the checkpoint in a Emu3 model...")
print("*" * 100)
model = Emu3ForConditionalGeneration.from_pretrained(output_dir, torch_dtype=torch.bfloat16, device_map="auto")
processor = Emu3Processor.from_pretrained(output_dir)
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."},
],
},
{
"role": "user",
"content": [
{"type": "text", "text": "Please tell me about this art work and its artist."},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
image = Image.open(
requests.get(
"https://uploads4.wikiart.org/images/paul-klee/death-for-the-idea-1915.jpg!Large.jpg", stream=True
).raw
)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, torch.bfloat16)
length = inputs.input_ids.shape[1]
out = model.generate(**inputs, max_new_tokens=40, do_sample=False)
generated_text = processor.batch_decode(out[:, length:], skip_special_tokens=True)[0]
print(f"Generation for single-image: {generated_text}")
print("*" * 100)
elif test_inference and llm_model_id.endswith("Gen"):
processor = Emu3Processor.from_pretrained(output_dir)
model = Emu3ForConditionalGeneration.from_pretrained(output_dir, torch_dtype=torch.bfloat16, device_map="auto")
inputs = processor(
text=[
"a portrait of young girl. masterpiece, film grained, best quality.",
"a dog running under the rain",
],
padding=True,
return_tensors="pt",
return_for_image_generation=True,
)
inputs = inputs.to(device="cuda:0", dtype=torch.bfloat16)
neg_prompt = "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry."
neg_inputs = processor(text=[neg_prompt] * 2, return_tensors="pt").to(device="cuda:0")
image_sizes = inputs.pop("image_sizes")
HEIGHT, WIDTH = image_sizes[0]
VISUAL_TOKENS = model.vocabulary_mapping.image_tokens
def prefix_allowed_tokens_fn(batch_id, input_ids):
height, width = HEIGHT, WIDTH
visual_tokens = VISUAL_TOKENS
image_token_id = processor.tokenizer.encode("<|image token|>", return_tensors="pt")[0].to(model.device)
eoi_token_id = processor.tokenizer.encode("<|image end|>", return_tensors="pt")[0]
eos_token_id = processor.tokenizer.encode("<|extra_204|>", return_tensors="pt")[0]
pad_token_id = processor.tokenizer.encode("<|endoftext|>", return_tensors="pt")[0]
eol_token_id = processor.tokenizer.encode("<|extra_200|>", return_tensors="pt")[0]
eof_token_id = processor.tokenizer.encode("<|extra_201|>", return_tensors="pt")[0]
position = torch.nonzero(input_ids == image_token_id, as_tuple=True)[0][0]
offset = input_ids.shape[0] - position
if offset % (width + 1) == 0:
return (eol_token_id,)
elif offset == (width + 1) * height + 1:
return (eof_token_id,)
elif offset == (width + 1) * height + 2:
return (eoi_token_id,)
elif offset == (width + 1) * height + 3:
return (eos_token_id,)
elif offset > (width + 1) * height + 3:
return (pad_token_id,)
else:
return visual_tokens
out = model.generate(
**inputs,
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
negative_prompt_ids=neg_inputs.input_ids,
negative_prompt_attention_mask=neg_inputs.attention_mask,
)
image = model.decode_image_tokens(out[:, inputs.input_ids.shape[1] :], height=HEIGHT, width=WIDTH)
images = processor.postprocess(
list(image.float()), return_tensors="PIL.Image.Image"
) # internally we convert to np but it's not supported in bf16 precision
for i, image in enumerate(images["pixel_values"]):
image.save(f"result_{i}.png")
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
"--vq_model_id",
help="Model ID of Emu3 VQ-VAE on the hub",
default="BAAI/Emu3-VisionTokenizer",
)
parser.add_argument(
"--llm_model_id",
help="Model ID of Emu3 bacbone LLM on the hub",
default="BAAI/Emu3-Chat",
)
parser.add_argument(
"--output_dir",
help="Location to write HF model",
)
parser.add_argument(
"--hub_model_id",
help="Model ID in the hub where to push the model.",
)
parser.add_argument(
"--test_inference",
action="store_true",
help="Whether to load the model for generation to test it's converted correctly.",
)
args = parser.parse_args()
convert_model(
vq_model_id=args.vq_model_id,
llm_model_id=args.llm_model_id,
output_dir=args.output_dir,
hub_model_id=args.hub_model_id,
test_inference=args.test_inference,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,552 @@
# coding=utf-8
# Copyright 2024 HuggingFace Inc. team. All rights reserved.
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from typing import Dict, Iterable, List, Optional, Union
import numpy as np
from ...image_processing_utils import BaseImageProcessor, BatchFeature
from ...image_transforms import convert_to_rgb, pad, resize, to_channel_dimension_format
from ...image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
VideoInput,
get_image_size,
infer_channel_dimension_format,
is_scaled_image,
is_valid_image,
make_list_of_images,
to_numpy_array,
valid_images,
validate_preprocess_arguments,
)
from ...utils import TensorType, is_vision_available, logging
if is_vision_available():
from PIL import Image
logger = logging.get_logger(__name__)
def make_batched_images(images) -> List[List[ImageInput]]:
"""
Accepts images in list or nested list format, and makes a list of images for preprocessing.
Args:
images (`Union[List[List[ImageInput]], List[ImageInput], ImageInput]`):
The input image.
Returns:
list: A list of images.
"""
if isinstance(images, (list, tuple)) and isinstance(images[0], (list, tuple)) and is_valid_image(images[0][0]):
return [img for img_list in images for img in img_list]
elif isinstance(images, (list, tuple)) and is_valid_image(images[0]):
return images
elif is_valid_image(images):
return [images]
raise ValueError(f"Could not make batched images from {images}")
def smart_resize(
height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
):
"""Rescales the image so that the following conditions are met:
1. Both dimensions (height and width) are divisible by 'factor'.
2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
3. The aspect ratio of the image is maintained as closely as possible.
"""
if height < factor or width < factor:
raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
elif max(height, width) / min(height, width) > 200:
raise ValueError(
f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
)
h_bar = round(height / factor) * factor
w_bar = round(width / factor) * factor
if h_bar * w_bar > max_pixels:
beta = math.sqrt((height * width) / max_pixels)
h_bar = math.floor(height / beta / factor) * factor
w_bar = math.floor(width / beta / factor) * factor
elif h_bar * w_bar < min_pixels:
beta = math.sqrt(min_pixels / (height * width))
h_bar = math.ceil(height * beta / factor) * factor
w_bar = math.ceil(width * beta / factor) * factor
return h_bar, w_bar
class Emu3ImageProcessor(BaseImageProcessor):
r"""
Constructs a Emu3 image processor that dynamically resizes images based on the original images.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions.
resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
Resampling filter to use when resizing the image.
do_rescale (`bool`, *optional*, defaults to `True`):
Whether to rescale the image by the specified scale `rescale_factor`.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
do_pad (`bool`, *optional*, defaults to `True`):
Whether to pad the image. If `True`, will pad the patch dimension of the images in the batch to the largest
number of patches in the batch. Padding will be applied to the bottom and right with zeros.
min_pixels (`int`, *optional*, defaults to `512 * 512`):
The min pixels of the image to resize the image.
max_pixels (`int`, *optional*, defaults to `1024 * 1024`):
The max pixels of the image to resize the image.
spatial_factor (`int`, *optional*, defaults to 8):
The spatial downsample factor the image will be downsampled in feature extracting phase
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_resize: bool = True,
resample: PILImageResampling = PILImageResampling.BICUBIC,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = True,
do_pad: bool = True,
min_pixels: int = 512 * 512,
max_pixels: int = 1024 * 1024,
spatial_factor: int = 8,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.do_resize = do_resize
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
self.min_pixels = min_pixels
self.max_pixels = max_pixels
self.spatial_factor = spatial_factor
self.size = {"min_pixels": min_pixels, "max_pixels": max_pixels}
self.do_convert_rgb = do_convert_rgb
def _preprocess(
self,
images: Union[ImageInput, VideoInput],
do_resize: bool = None,
resample: PILImageResampling = None,
do_rescale: bool = None,
rescale_factor: float = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image or batch of images to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
vision_info (`List[Dict]`, *optional*):
Optional list of dictionaries containing additional information about vision inputs.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the `PILImageResampling` enums.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Scale factor to use if rescaling the image.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Mean to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Standard deviation to use if normalizing the image. Can be a float or a list of floats corresponding to the number of channels in the image.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
data_format (`ChannelDimension`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format. - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
images = make_list_of_images(images)
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if is_scaled_image(images[0]) and do_rescale:
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
height, width = get_image_size(images[0], channel_dim=input_data_format)
resized_height, resized_width = height, width
processed_images = []
for image in images:
if do_resize:
resized_height, resized_width = smart_resize(
height,
width,
factor=self.spatial_factor,
min_pixels=self.min_pixels,
max_pixels=self.max_pixels,
)
image = resize(
image, size=(resized_height, resized_width), resample=resample, input_data_format=input_data_format
)
if do_rescale:
image = self.rescale(image, scale=rescale_factor, input_data_format=input_data_format)
if do_normalize:
image = self.normalize(
image=image, mean=image_mean, std=image_std, input_data_format=input_data_format
)
image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
processed_images.append(image)
images = np.array(processed_images)
return images
def _pad_for_batching(
self,
pixel_values: List[np.ndarray],
image_sizes: List[List[int]],
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""
Pads images on the `num_of_patches` dimension with zeros to form a batch of same number of patches.
Args:
pixel_values (`List[np.ndarray]`):
An array of pixel values of each images of shape (`batch_size`, `num_patches`, `image_in_3D`)
image_sizes (`List[List[int]]`):
A list of sizes for each image in `pixel_values` in (height, width) format.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
If unset, will use same as the input image.
input_data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format for the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
If unset, will use the inferred format of the input image.
Returns:
List[`np.ndarray`]: The padded images.
"""
max_shape = (
max([size[0] for size in image_sizes]),
max([size[1] for size in image_sizes]),
)
pixel_values = [
pad(
image,
padding=((0, max_shape[0] - size[0]), (0, max_shape[1] - size[1])),
data_format=data_format,
input_data_format=input_data_format,
)
for image, size in zip(pixel_values, image_sizes)
]
return pixel_values
def preprocess(
self,
images: ImageInput,
do_resize: bool = None,
size: Dict[str, int] = None,
resample: PILImageResampling = None,
do_rescale: bool = None,
rescale_factor: float = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = None,
do_pad: bool = True,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
the longest edge resized to keep the input aspect ratio.
resample (`int`, *optional*, defaults to `self.resample`):
Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
has an effect if `do_resize` is set to `True`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
do_pad (`bool`, *optional*, defaults to `True`):
Whether to pad the image. If `True`, will pad the patch dimension of the images in the batch to the largest
number of patches in the batch. Padding will be applied to the bottom and right with zeros.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
size = size if size is not None else self.size
resample = resample if resample is not None else self.resample
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
do_pad = do_pad if do_pad is not None else self.do_pad
if images is not None:
images = make_batched_images(images)
if images is not None and not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
validate_preprocess_arguments(
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
do_resize=do_resize,
size=size,
resample=resample,
)
pixel_values = []
for image in images:
image = self._preprocess(
image,
do_resize=do_resize,
resample=resample,
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
data_format=data_format,
do_convert_rgb=do_convert_rgb,
input_data_format=input_data_format,
)
pixel_values.extend(image)
image_sizes = [image.shape[-2:] for image in pixel_values]
if do_pad:
pixel_values = self._pad_for_batching(pixel_values, image_sizes)
pixel_values = np.array(pixel_values)
return BatchFeature(
data={"pixel_values": pixel_values, "image_sizes": image_sizes}, tensor_type=return_tensors
)
def postprocess(
self,
images: ImageInput,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
return_tensors: Union[str, TensorType] = "PIL.Image.Image",
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
"""
Postprocess an image or batch of images tensor. Postprocess is the reverse process of preprocess.
The parameters should be same as in preprocess.
Args:
images (`ImageInput`):
Image to postprocess. Expects a single or batch of images with pixel values ranging from -1 to 1.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image.
rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to rescale the image by if `do_rescale` is set to `True`.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to `True`.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = 1.0 / self.rescale_factor if rescale_factor is None else rescale_factor
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
images = make_list_of_images(images)
if isinstance(images[0], Image.Image):
return images if len(images) > 1 else images[0]
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
pixel_values = []
for image in images:
image = to_numpy_array(image)
if do_normalize:
image = self.unnormalize(
image=image, image_mean=image_mean, image_std=image_std, input_data_format=input_data_format
)
if do_rescale:
image = self.rescale(image, scale=rescale_factor, input_data_format=input_data_format)
image = image.clip(0, 255).astype(np.uint8)
if do_normalize and do_rescale and return_tensors == "PIL.Image.Image":
image = to_channel_dimension_format(image, ChannelDimension.LAST, input_channel_dim=input_data_format)
pixel_values.append(Image.fromarray(image))
else:
pixel_values.extend(image)
data = {"pixel_values": pixel_values}
return_tensors = return_tensors if return_tensors != "PIL.Image.Image" else None
return BatchFeature(data=data, tensor_type=return_tensors)
def unnormalize(
self,
image: np.array,
image_mean: Union[float, Iterable[float]],
image_std: Union[float, Iterable[float]],
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.array:
"""
Unnormalizes `image` using the mean and standard deviation specified by `mean` and `std`.
image = (image * image_std) + image_mean
Args:
image (`torch.Tensor` of shape `(batch_size, num_channels, image_size, image_size)` or `(num_channels, image_size, image_size)`):
Batch of pixel values to postprocess.
image_mean (`float` or `Iterable[float]`):
The mean to use for unnormalization.
image_std (`float` or `Iterable[float]`):
The standard deviation to use for unnormalization.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
"""
num_channels = 3
if isinstance(image_mean, Iterable):
if len(image_mean) != num_channels:
raise ValueError(f"mean must have {num_channels} elements if it is an iterable, got {len(image_mean)}")
else:
image_mean = [image_mean] * num_channels
if isinstance(image_std, Iterable):
if len(image_std) != num_channels:
raise ValueError(f"std must have {num_channels} elements if it is an iterable, got {len(image_std)}")
else:
image_std = [image_std] * num_channels
rev_image_mean = tuple(-mean / std for mean, std in zip(image_mean, image_std))
rev_image_std = tuple(1 / std for std in image_std)
image = self.normalize(
image=image, mean=rev_image_mean, std=rev_image_std, input_data_format=input_data_format
)
return image
__all__ = ["Emu3ImageProcessor"]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,217 @@
# coding=utf-8
# Copyright 2024 HuggingFace Inc. team. All rights reserved.
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Union
from ...image_processing_utils import BatchFeature
from ...image_utils import ImageInput
from ...processing_utils import ImagesKwargs, ProcessingKwargs, ProcessorMixin, TextKwargs, Unpack
from ...tokenization_utils_base import PreTokenizedInput, TextInput
class Emu3TextKwargs(TextKwargs, total=False):
return_for_image_generation: bool
class Emu3ImagesKwargs(ImagesKwargs, total=False):
ratio: str
image_area: int
class Emu3ProcessorKwargs(ProcessingKwargs, total=False):
text_kwargs: Emu3TextKwargs
images_kwargs: Emu3ImagesKwargs
_defaults = {
"text_kwargs": {
"return_for_image_generation": False,
},
"images_kwargs": {
"ratio": "1:1",
"image_area": 518400,
},
}
class Emu3Processor(ProcessorMixin):
r"""
Constructs a Emu3 processor which wraps a Emu3 image processor and a GPT2 tokenizer into a single
processor.
[`Emu3Processor`] offers all the functionalities of [`Emu3ImageProcessor`] and [`GPT2TokenizerFast`].
See the [`~Emu3Processor.__call__`] and [`~Emu3Processor.decode`] for more information.
Args:
image_processor ([`Emu3ImageProcessor`]):
The image processor is a required input.
tokenizer ([`Emu3TokenizerFast`]):
The tokenizer is a required input.
chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
"""
attributes = ["image_processor", "tokenizer"]
tokenizer_class = ("GPT2Tokenizer", "GPT2TokenizerFast")
image_processor_class = "Emu3ImageProcessor"
def __init__(
self,
image_processor,
tokenizer,
chat_template=None,
**kwargs,
):
self.image_token = tokenizer.image_token # image_token as placeholder to be replaced by vq-vae tokens
self.image_start_token = tokenizer.boi_token # "<|image start|>" fixed tokens for start and end of image
self.image_end_token = tokenizer.eoi_token # "<|image end|>"
self.fake_token_around_image = tokenizer.image_wrapper_token # "<|image token|>" every image starts with it
self.eof_token = tokenizer.eof_token # "<|extra_201|>"
self.bos_token = tokenizer.bos_token
self.downsample_ratio = 8
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
images: Optional[ImageInput] = None,
text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
audio=None,
videos=None,
**kwargs: Unpack[Emu3ProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to Emu3TokenizerFast's [`~Emu3TokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
# check if images and text inputs are reversed for BC
if isinstance(text, str):
text = [text]
elif not isinstance(text, list) and not isinstance(text[0], str):
raise TypeError("Invalid input text. Please provide a string, or a list of strings")
output_kwargs = self._merge_kwargs(
Emu3ProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
return_for_image_generation = output_kwargs["text_kwargs"].pop("return_for_image_generation", False)
ratio = output_kwargs["images_kwargs"].pop("ratio", None)
image_area = output_kwargs["images_kwargs"].pop("image_area", None)
if return_for_image_generation and images is not None:
raise ValueError("You should not provide `images` when `return_for_image_generation=True`")
if not return_for_image_generation and text is None and images is None:
raise ValueError("You must provide either text or images when `return_for_image_generation=False`")
image_features = {}
image_start_tokens = f"{self.image_start_token}"
image_end_tokens = f"{self.eof_token}{self.image_end_token}"
# generate text from image + text input, so we add placeholders for image tokens
if not return_for_image_generation and images is not None:
image_features = self.image_processor(images, **output_kwargs["images_kwargs"])
image_sizes = iter(image_features.image_sizes)
prompt_strings = []
for sample in text:
while self.image_token in sample:
image_size = next(image_sizes)
height, width = image_size
height = height // self.downsample_ratio
width = width // self.downsample_ratio
image_seq_length = height * (width + 1) # +1 for extra row when converting to BPE in modeling code
image_placeholder = f"{image_start_tokens}{height}*{width}{self.fake_token_around_image}{'<placeholder>' * image_seq_length}{image_end_tokens}"
sample = sample.replace(self.image_token, image_placeholder, 1)
sample = f"{self.bos_token}{sample}" # add BOS because PT tokenizer doesn't add it
prompt_strings.append(sample)
text = [sample.replace("<placeholder>", self.image_token) for sample in prompt_strings]
# generate image from text input, so we add begin-of-image tokens from where image generation starts
elif return_for_image_generation:
height, width = self.calculate_generate_size(ratio, image_area, self.downsample_ratio)
image_prompt = f"{image_start_tokens}{height}*{width}{self.fake_token_around_image}"
text = [f"{self.bos_token}{sample}{image_prompt}" for sample in text]
image_features["image_sizes"] = [[height, width]] * len(text)
# else just generate from text-only input, and we do no special treatment for text
data = self.tokenizer(text, **output_kwargs["text_kwargs"])
data.update(**image_features)
return BatchFeature(data=data, tensor_type=output_kwargs["common_kwargs"]["return_tensors"])
def calculate_generate_size(self, ratio, image_area, spatial_factor):
width, height = map(int, ratio.split(":"))
current_area = width * height
target_ratio = (image_area / current_area) ** 0.5
token_height = int(round(height * target_ratio / spatial_factor))
token_width = int(round(width * target_ratio / spatial_factor))
return token_height, token_width
def postprocess(self, images: ImageInput, **kwargs):
return self.image_processor.postprocess(images, **kwargs)
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to Emu3TokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to Emu3TokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
__all__ = ["Emu3Processor"]

View File

@@ -3933,6 +3933,41 @@ def load_tf_weights_in_electra(*args, **kwargs):
requires_backends(load_tf_weights_in_electra, ["torch"])
class Emu3ForCausalLM(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class Emu3ForConditionalGeneration(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class Emu3PreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class Emu3TextModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class Emu3VQVAE(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class EncodecModel(metaclass=DummyObject):
_backends = ["torch"]

View File

@@ -233,6 +233,13 @@ class EfficientNetImageProcessor(metaclass=DummyObject):
requires_backends(self, ["vision"])
class Emu3ImageProcessor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class FlavaFeatureExtractor(metaclass=DummyObject):
_backends = ["vision"]

View File

@@ -1626,7 +1626,7 @@ class GenerationTesterMixin:
# checks without adding test complexity. Ditto for `pixel_values_videos` and `pixel_values_images`
pixel_values_is_mutually_exclusive = any(
model_name in model_class.__name__.lower()
for model_name in ["llava", "idefics2", "idefics3", "mllama", "paligemma"]
for model_name in ["llava", "idefics2", "idefics3", "mllama", "paligemma", "emu3"]
)
if pixel_values_is_mutually_exclusive:
inputs_dict.pop("pixel_values", None)
@@ -1700,6 +1700,18 @@ class GenerationTesterMixin:
if "inputs_embeds" not in inspect.signature(model.prepare_inputs_for_generation).parameters.keys():
self.skipTest(reason="This model does not support `inputs_embeds` in generation")
# Some VLMs assume `inputs_embeds` and `pixel_values` are mutually exclusive AND fall in the
# exception above (complex `inputs_embeds` computation). Popping `pixel_values` allow us to run the
# checks without adding test complexity. Ditto for `pixel_values_videos` and `pixel_values_images`
pixel_values_is_mutually_exclusive = any(
model_name in model_class.__name__.lower()
for model_name in ["llava", "idefics2", "idefics3", "mllama", "paligemma", "emu3"]
)
if pixel_values_is_mutually_exclusive:
inputs_dict.pop("pixel_values", None)
inputs_dict.pop("pixel_values_videos", None)
inputs_dict.pop("pixel_values_images", None)
input_ids = inputs_dict.pop("input_ids")
model.config.use_cache = True
@@ -1941,6 +1953,10 @@ class GenerationTesterMixin:
for dtype in (torch.float32, torch.float16):
model = model_class(config).to(torch_device).to(dtype).eval()
inputs_dict = {
k: v.to(dtype) if isinstance(v, torch.Tensor) and torch.is_floating_point(v) else v
for k, v in inputs_dict.items()
}
set_model_for_less_flaky_test(model)
generation_kwargs = {

View File

View File

@@ -0,0 +1,550 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Testing suite for the PyTorch emu3 model."""
import unittest
import numpy as np
import requests
from huggingface_hub import hf_hub_download
from parameterized import parameterized
from transformers import Emu3Config, Emu3TextConfig, is_torch_available, is_vision_available, set_seed
from transformers.testing_utils import (
require_bitsandbytes,
require_torch,
slow,
torch_device,
)
from ...generation.test_utils import GenerationTesterMixin
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor
from ...test_pipeline_mixin import PipelineTesterMixin
if is_vision_available():
from PIL import Image
if is_torch_available():
import torch
from transformers import (
Emu3ForCausalLM,
Emu3ForConditionalGeneration,
Emu3Processor,
Emu3TextModel,
)
class Emu3Text2TextModelTester:
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=False,
vocab_size=99,
hidden_size=32,
num_hidden_layers=2,
num_attention_heads=2,
num_key_value_heads=2,
intermediate_size=37,
max_position_embeddings=512,
initializer_range=0.02,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.intermediate_size = intermediate_size
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
self.eos_token_id = eos_token_id
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
attention_mask = input_ids.ne(1).to(torch_device)
config = self.get_config()
return config, input_ids, attention_mask
def get_config(self):
return Emu3TextConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
num_key_value_heads=self.num_key_value_heads,
intermediate_size=self.intermediate_size,
max_position_embeddings=self.max_position_embeddings,
is_decoder=False,
initializer_range=self.initializer_range,
pad_token_id=self.pad_token_id,
bos_token_id=self.bos_token_id,
eos_token_id=self.eos_token_id,
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
attention_mask,
) = config_and_inputs
inputs_dict = {"input_ids": input_ids, "attention_mask": attention_mask}
return config, inputs_dict
@require_torch
class Emu3Text2TextModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin, unittest.TestCase):
all_model_classes = (Emu3ForCausalLM,) if is_torch_available() else ()
all_generative_model_classes = (Emu3ForCausalLM,) if is_torch_available() else ()
pipeline_model_mapping = (
{
"text-generation": Emu3ForCausalLM,
}
if is_torch_available()
else {}
)
test_headmasking = False
test_pruning = False
fx_compatible = False
def setUp(self):
self.model_tester = Emu3Text2TextModelTester(self)
self.config_tester = ConfigTester(self, config_class=Emu3TextConfig, hidden_size=37)
def test_config(self):
self.config_tester.run_common_tests()
@parameterized.expand([("linear",), ("dynamic",)])
def test_model_rope_scaling(self, scaling_type):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
short_input = ids_tensor([1, 10], config.vocab_size)
long_input = ids_tensor([1, int(config.max_position_embeddings * 1.5)], config.vocab_size)
set_seed(42) # Fixed seed at init time so the two models get the same random weights
original_model = Emu3TextModel(config)
original_model.to(torch_device)
original_model.eval()
original_short_output = original_model(short_input).last_hidden_state
original_long_output = original_model(long_input).last_hidden_state
set_seed(42) # Fixed seed at init time so the two models get the same random weights
config.rope_scaling = {"type": scaling_type, "factor": 10.0}
scaled_model = Emu3TextModel(config)
scaled_model.to(torch_device)
scaled_model.eval()
scaled_short_output = scaled_model(short_input).last_hidden_state
scaled_long_output = scaled_model(long_input).last_hidden_state
# Dynamic scaling does not change the RoPE embeddings until it receives an input longer than the original
# maximum sequence length, so the outputs for the short input should match.
if scaling_type == "dynamic":
self.assertTrue(torch.allclose(original_short_output, scaled_short_output, atol=1e-5))
else:
self.assertFalse(torch.allclose(original_short_output, scaled_short_output, atol=1e-5))
# The output should be different for long inputs
self.assertFalse(torch.allclose(original_long_output, scaled_long_output, atol=1e-5))
@unittest.skip("Doesn't work, tensors are not almost same") # TODO raushan fixme
def test_custom_4d_attention_mask(self):
pass
@unittest.skip("Fails with unknown error only on end-to-end compile") # TODO raushan fixme
def test_generate_compile_1_end_to_end(self):
pass
class Emu3Vision2TextModelTester:
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=False,
vocab_size=99,
hidden_size=32,
num_hidden_layers=2,
num_attention_heads=2,
num_key_value_heads=2,
intermediate_size=37,
max_position_embeddings=512,
initializer_range=0.02,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
image_token_id=3,
image_size=30,
codebook_size=20,
temporal_downsample_factor=1,
base_channels=32,
vq_channel_multiplier=[1, 1],
image_seq_length=100,
vq_img_token_start_id=3,
):
self.parent = parent
self.batch_size = batch_size
self.is_training = is_training
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.intermediate_size = intermediate_size
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
self.eos_token_id = eos_token_id
self.image_token_id = image_token_id
self.image_size = image_size
self.codebook_size = codebook_size
self.temporal_downsample_factor = temporal_downsample_factor
self.vq_channel_multiplier = vq_channel_multiplier
self.vq_img_token_start_id = vq_img_token_start_id
self.base_channels = base_channels
self.seq_length = seq_length + image_seq_length
self.image_seq_length = image_seq_length
def prepare_config_and_inputs(self):
config = self.get_config()
input_ids = ids_tensor([self.batch_size, self.seq_length], config.text_config.vocab_size)
attention_mask = input_ids.ne(1).to(torch_device)
input_ids[input_ids == self.image_token_id] = self.pad_token_id
input_ids[:, : self.image_seq_length] = self.image_token_id
pixel_values = floats_tensor(
[
self.batch_size,
3,
self.image_size,
self.image_size,
]
)
image_sizes = [[self.image_size, self.image_size]] * self.batch_size
image_sizes = torch.tensor(image_sizes, device=torch_device, dtype=torch.int64)
return config, input_ids, attention_mask, pixel_values, image_sizes
def get_config(self):
# create dummy vocab map for image2bpe mapping if it needs remapping
# we assume that vocab size is big enough to account for `codebook_size` amount of
# image tokens somewhere at the beginning of total vocab size
vocab_map = {i: chr(i) for i in range(self.vocab_size)}
start = self.vq_img_token_start_id
end = self.vq_img_token_start_id + self.codebook_size
for i in range(start, end):
# dummy str for each token, anything that fits pattern "<|visual token XXXXXX|>"
vocab_map[i] = f"<|visual token{i:06d}|>"
# add tokens that have to be in the vocab, we'll retrieve their ids later in modeling code
vocab_map[self.image_token_id] = "<image>"
vocab_map[self.image_token_id + 1] = "<|extra_200|>"
vocab_map = {v: k for k, v in vocab_map.items()}
text_config = Emu3TextConfig(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
num_key_value_heads=self.num_key_value_heads,
intermediate_size=self.intermediate_size,
max_position_embeddings=self.max_position_embeddings,
initializer_range=self.initializer_range,
pad_token_id=self.pad_token_id,
bos_token_id=self.bos_token_id,
eos_token_id=self.eos_token_id,
)
vq_config = {
"codebook_size": self.codebook_size,
"temporal_downsample_factor": self.temporal_downsample_factor,
"base_channels": self.base_channels,
"channel_multiplier": self.vq_channel_multiplier,
"hidden_size": self.base_channels,
}
return Emu3Config(text_config=text_config, vq_config=vq_config, vocabulary_map=vocab_map)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
config,
input_ids,
attention_mask,
pixel_values,
image_sizes,
) = config_and_inputs
inputs_dict = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"pixel_values": pixel_values,
"image_sizes": image_sizes,
}
return config, inputs_dict
@require_torch
class Emu3Vision2TextModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin, unittest.TestCase):
all_model_classes = (Emu3ForConditionalGeneration,) if is_torch_available() else ()
all_generative_model_classes = (Emu3ForConditionalGeneration,) if is_torch_available() else ()
pipeline_model_mapping = {}
test_headmasking = False
test_pruning = False
fx_compatible = False
def setUp(self):
self.model_tester = Emu3Vision2TextModelTester(self)
self.config_tester = ConfigTester(
self, config_class=Emu3Config, has_text_modality=False, common_properties=["vocabulary_map"]
)
def test_config(self):
self.config_tester.run_common_tests()
# overwrite inputs_embeds tests because we need to delete "pixel values" for LVLMs
def test_inputs_embeds(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = self._prepare_for_class(inputs_dict, model_class)
input_ids = inputs["input_ids"]
del inputs["input_ids"]
del inputs["pixel_values"]
wte = model.get_input_embeddings()
inputs["inputs_embeds"] = wte(input_ids)
with torch.no_grad():
model(**inputs)
# overwrite inputs_embeds tests because we need to delete "pixel values" for LVLMs
# while some other models require pixel_values to be present
def test_inputs_embeds_matches_input_ids(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = self._prepare_for_class(inputs_dict, model_class)
input_ids = inputs["input_ids"]
del inputs["input_ids"]
del inputs["pixel_values"]
inputs_embeds = model.get_input_embeddings()(input_ids)
with torch.no_grad():
out_ids = model(input_ids=input_ids, **inputs)[0]
out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
self.assertTrue(torch.allclose(out_embeds, out_ids))
@unittest.skip(
"Emu3 has a VQ module that uses `weight.data` directly in forward which prevent offloding on that module"
)
def test_disk_offload_safetensors(self):
pass
@unittest.skip(
"Emu3 has a VQ module that uses `weight.data` directly in forward which prevent offloding on that module"
)
def test_disk_offload_bin(self):
pass
@unittest.skip(
"Emu3 has a VQ module that uses `weight.data` directly in forward which prevent offloding on that module"
)
def test_cpu_offload(self):
pass
@unittest.skip("Doesn't work, tensors are not almost same") # TODO raushan fixme
def test_custom_4d_attention_mask(self):
pass
@unittest.skip("VQ-VAE module doesn't initialize weights properly")
def test_initialization(self):
pass
@unittest.skip("End-to-end compilation is not supported due to dynamic control in `prepare_inputs_for_generation`")
def test_generate_compile_1_end_to_end(self):
pass
@require_torch
class Emu3IntegrationTest(unittest.TestCase):
@slow
@require_bitsandbytes
def test_model_generation(self):
model = Emu3ForConditionalGeneration.from_pretrained(
"Emu3-community/Emu3-Chat-hf", load_in_4bit=True, device_map="auto"
)
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
image = Image.open(
requests.get("https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg", stream=True).raw
)
prompt = "USER: <image>Describe what do you see here and tell me about the history behind it? ASSISTANT:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device, torch.float16)
# greedy generation outputs
EXPECTED_TEXT_COMPLETION = ['USER: 114*143Describe what do you see here and tell me about the history behind it? ASSISTANT: The image depicts the constellation of Ursa Minor, also known as the Little Bear. This constellation was one of the 24 modern constellations introduced by Charles Messier in 178'] # fmt: skip
generated_ids = model.generate(**inputs, max_new_tokens=40, do_sample=False)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
self.assertEqual(EXPECTED_TEXT_COMPLETION, text)
@slow
@require_bitsandbytes
def test_model_generation_batched(self):
model = Emu3ForConditionalGeneration.from_pretrained(
"Emu3-community/Emu3-Chat-hf", load_in_4bit=True, device_map="auto"
)
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
processor.tokenizer.padding_side = "left"
image = Image.open(
requests.get("https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg", stream=True).raw
)
image_2 = Image.open(
requests.get("https://www.kxan.com/wp-content/uploads/sites/40/2020/10/ORION.jpg", stream=True).raw
)
prompts = [
"USER: <image>Describe what do you see here and tell me about the history behind it? ASSISTANT:",
"USER: <image>What do you know about the constellation in this image? ASSISTANT:",
]
inputs = processor(images=[image, image_2], text=prompts, padding=True, return_tensors="pt").to(
model.device, torch.float16
)
# greedy generation outputs
EXPECTED_TEXT_COMPLETION = [
'USER: 114*143Describe what do you see here and tell me about the history behind it? ASSISTANT: The image depicts the constellation of Ursa Minor, also known as the Little Bear. This constellation was one of the 24 modern constellations introduced by Charles Messier in 178',
'USER: 75*125What do you know about the constellation in this image? ASSISTANT: The image shows a segment of a wire rope, characterized by its consistent pattern and regular twists, indicative of a high-quality, well-made rope. This type of detail suggests careful manufacturing processes and attention to'
] # fmt: skip
generated_ids = model.generate(**inputs, max_new_tokens=40, do_sample=False)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
self.assertEqual(EXPECTED_TEXT_COMPLETION, text)
@slow
@require_bitsandbytes
def test_model_generation_multi_image(self):
model = Emu3ForConditionalGeneration.from_pretrained(
"Emu3-community/Emu3-Chat-hf", load_in_4bit=True, device_map="auto"
)
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
image = Image.open(
requests.get("https://nineplanets.org/wp-content/uploads/2020/12/the-big-dipper-1.jpg", stream=True).raw
)
image_2 = Image.open(
requests.get("https://www.kxan.com/wp-content/uploads/sites/40/2020/10/ORION.jpg", stream=True).raw
)
prompt = "USER: <image><image>What do these two images have in common? ASSISTANT:"
inputs = processor(images=[image, image_2], text=prompt, return_tensors="pt").to(model.device, torch.float16)
# greedy generation outputs
EXPECTED_TEXT_COMPLETION = ['USER: 114*14375*125What do these two images have in common? ASSISTANT: The two images both depict a geometric shape - a triangle in the larger image and a line segment in the smaller image. They share a common feature of being created with a series of connected dots, which'] # fmt: skip
generated_ids = model.generate(**inputs, max_new_tokens=40, do_sample=False)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)
self.assertEqual(EXPECTED_TEXT_COMPLETION, text)
@slow
@require_bitsandbytes
def test_model_generate_images(self):
model = Emu3ForConditionalGeneration.from_pretrained(
"Emu3-community/Emu3-Gen-hf", load_in_4bit=True, device_map="auto"
)
processor = Emu3Processor.from_pretrained("Emu3-community/Emu3-Chat-hf")
inputs = processor(
text=["a portrait of young girl. masterpiece, film grained, best quality."],
padding=True,
return_tensors="pt",
return_for_image_generation=True,
).to(model.device)
self.assertTrue(inputs.input_ids.shape[1] == 23)
image_sizes = inputs.pop("image_sizes")
HEIGHT, WIDTH = image_sizes[0]
VISUAL_TOKENS = model.vocabulary_mapping.image_tokens
def prefix_allowed_tokens_fn(batch_id, input_ids):
height, width = HEIGHT, WIDTH
visual_tokens = VISUAL_TOKENS
image_wrapper_token_id = torch.tensor([processor.tokenizer.image_wrapper_token_id], device=model.device)
eoi_token_id = torch.tensor([processor.tokenizer.eoi_token_id], device=model.device)
eos_token_id = torch.tensor([processor.tokenizer.eos_token_id], device=model.device)
pad_token_id = torch.tensor([processor.tokenizer.pad_token_id], device=model.device)
eof_token_id = torch.tensor([processor.tokenizer.eof_token_id], device=model.device)
eol_token_id = processor.tokenizer.encode("<|extra_200|>", return_tensors="pt")[0]
position = torch.nonzero(input_ids == image_wrapper_token_id, as_tuple=True)[0][0]
offset = input_ids.shape[0] - position
if offset % (width + 1) == 0:
return (eol_token_id,)
elif offset == (width + 1) * height + 1:
return (eof_token_id,)
elif offset == (width + 1) * height + 2:
return (eoi_token_id,)
elif offset == (width + 1) * height + 3:
return (eos_token_id,)
elif offset > (width + 1) * height + 3:
return (pad_token_id,)
else:
return visual_tokens
out = model.generate(
**inputs,
max_new_tokens=50_000,
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
do_sample=False,
)
self.assertTrue(out.shape[1] == 8216)
image = model.decode_image_tokens(out[:, inputs.input_ids.shape[1] :], height=HEIGHT, width=WIDTH)
images = processor.postprocess(list(image.float()), return_tensors="np")
self.assertTrue(images["pixel_values"].shape == (3, 720, 720))
self.assertTrue(isinstance(images["pixel_values"], np.ndarray))
filepath = hf_hub_download(
repo_id="raushan-testing-hf/images_test",
filename="emu3_generated_pixels.npy",
repo_type="dataset",
)
original_pixels = np.load(filepath)
self.assertTrue(np.allclose(original_pixels, images["pixel_values"]))

View File

@@ -0,0 +1,85 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Testing suite for the PyTorch emu3 model."""
import tempfile
import unittest
import numpy as np
from transformers import Emu3Processor, GPT2TokenizerFast
from transformers.utils import is_vision_available
from ...test_processing_common import ProcessorTesterMixin
if is_vision_available():
from transformers import Emu3ImageProcessor
class Emu3ProcessorTest(ProcessorTesterMixin, unittest.TestCase):
processor_class = Emu3Processor
def setUp(self):
self.tmpdirname = tempfile.mkdtemp()
image_processor = Emu3ImageProcessor()
extra_special_tokens = extra_special_tokens = {
"image_token": "<image>",
"boi_token": "<|image start|>",
"eoi_token": "<|image end|>",
"image_wrapper_token": "<|image token|>",
"eof_token": "<|extra_201|>",
}
tokenizer = GPT2TokenizerFast.from_pretrained(
"openai-community/gpt2", extra_special_tokens=extra_special_tokens
)
tokenizer.pad_token_id = 0
tokenizer.sep_token_id = 1
processor = self.processor_class(
image_processor=image_processor, tokenizer=tokenizer, chat_template="dummy_template"
)
processor.save_pretrained(self.tmpdirname)
def test_processor_for_generation(self):
processor_components = self.prepare_components()
processor = self.processor_class(**processor_components)
# we don't need an image as input because the model will generate one
input_str = "lower newer"
image_input = self.prepare_image_inputs()
inputs = processor(text=input_str, return_for_image_generation=True, return_tensors="pt")
self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "image_sizes"])
self.assertEqual(inputs[self.text_input_name].shape[-1], 8)
# when `return_for_image_generation` is set, we raise an error that image should not be provided
with self.assertRaises(ValueError):
inputs = processor(
text=input_str, images=image_input, return_for_image_generation=True, return_tensors="pt"
)
def test_processor_postprocess(self):
processor_components = self.prepare_components()
processor = self.processor_class(**processor_components)
input_str = "lower newer"
orig_image_input = self.prepare_image_inputs()
orig_image = np.array(orig_image_input).transpose(2, 0, 1)
inputs = processor(text=input_str, images=orig_image, do_resize=False, return_tensors="np")
normalized_image_input = inputs.pixel_values
unnormalized_images = processor.postprocess(normalized_image_input, return_tensors="np")["pixel_values"]
# For an image where pixels go from 0 to 255 the diff can be 1 due to some numerical precision errors when scaling and unscaling
self.assertTrue(np.abs(orig_image - unnormalized_images).max() >= 1)

View File

@@ -3877,7 +3877,7 @@ class ModelTesterMixin:
for name, submodule in model_eager.named_modules():
class_name = submodule.__class__.__name__
if "SdpaAttention" in class_name or "SdpaSelfAttention" in class_name:
raise ValueError("The eager model should not have SDPA attention layers")
raise ValueError(f"The eager model should not have SDPA attention layers but got {class_name}")
@require_torch_sdpa
def test_sdpa_can_dispatch_composite_models(self):

View File

@@ -266,6 +266,10 @@ def check_attribute_being_used(config_class, attributes, default_value, source_s
f"config.{attribute}" in modeling_source
or f'getattr(config, "{attribute}"' in modeling_source
or f'getattr(self.config, "{attribute}"' in modeling_source
or (
"TextConfig" in config_class.__name__
and f"config.get_text_config().{attribute}" in modeling_source
)
):
attribute_used = True
# Deal with multi-line cases

View File

@@ -139,6 +139,8 @@ IGNORE_NON_TESTED = (
"Qwen2VLModel", # Building part of bigger (tested) model. Tested implicitly through Qwen2VLForConditionalGeneration.
"MllamaTextModel", # Building part of bigger (tested) model. # TODO: add tests
"MllamaVisionModel", # Building part of bigger (tested) model. # TODO: add tests
"Emu3VQVAE", # Building part of bigger (tested) model
"Emu3TextModel", # Building part of bigger (tested) model
]
)
@@ -333,6 +335,8 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
"VitPoseForPoseEstimation",
"CLIPTextModel",
"MoshiForConditionalGeneration", # no auto class for speech-to-speech
"Emu3VQVAE", # no autoclass for VQ-VAE models
"Emu3TextModel", # Building part of bigger (tested) model
]
# DO NOT edit this list!