Add evolla rebase main (#36232)

* add evolla * adding protein encoder part * add initial processing test * save processor * add docstring * add evolla processor * add two test * change vision to protein * change resampler to sequence_compressor * change vision to protein * initial update for llama * add initial update for llamaForCausalLM * add `test_processor`, `test_saprot_output`, `test_protein_encoder_output` * change evolla, but still working on it * add test_single_forward * pass test_attention_outputs * pass test_hidden_states_output * pass test_save_load and test_from_pretrained_no_checkpoint * pass test_cpu_offload * skip some tests * update new progress * skip test_model_is_small * pass test_model_weights_reload_no_missing_tied_weights * pass test_model_get_set_embeddings * pass test_cpu_offload * skip test_resize_embeddings * add pipeline_model_mapping * remote old setUp * pass processor save_pretrained and load_pretrained * remove pooling layer * pass test_inputs_embeds_matches_input_ids * pass test_model_is_small * pass test_attention_outputs * pass test_initialization * pass test_model_get_set_embeddings * pass test_single_forward * skip test_disk_offload_bin and test_disk_offload_safetensors * fix most tests * pass test_protein_encoder_output * remove useless code * add EvollaForProteinText2Text * pass test_saprot_output * pass all EvollaModelTest test and remove processor test * add processor test to its own file * skip is_training since esm skipped it and the saprot code causes error when setting is_training True * pass processor tests * solve all except config * pass most cases * change init * add doc to `configuration_evolla.py` * remove image_processing test * remove extra processor test * remove extra modules * remove extra modules * change all configs into one config * pass all evolla test * pass `make fixup` * update short summary * update Evolla-10B-hf * pass check_dummies.py and check_code_quality * fix `tests/models/auto/test_tokenization_auto.py::AutoTokenizerTest::test_model_name_edge_cases_in_mappings` * remove dummy codes * change format * fix llava issue * update format * update to solve llama3 access issue * update to make forward right * solve processor save load problem from instructblip solution * remove unexpected file * skip `test_generation_tester_mixin_inheritance` * add `test_single_forward_correct` and `test_inference_natural_language_protein_reasoning` * add `modular_evolla.py` * solved issue #36362 * run `make fixup` * update modular * solve float32 training * add fix * solve `utils/check_docstrings.py` * update * update * update * remove other files and replace sequential and einsum * add use case in document * update the models * update model * change some wrong code * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * fix issues mentioned in PR * update style and rearrange the placement * fix return_dict argument issue * solve SaProtConfig issue * Solve EvollaSaProtRotaryEmbedding issue * solve attention_mask issue * solve almosst all issues * make style * update config * remove unrelated pickle file * delete pickle files * fix config * simplify a lot * remove past k-v from encoder * continue work * style * skip it from init * fix init * fix init * simplify more * fill in docstrings * change test for generation * skip test * fix style --------- Co-authored-by: Chenchen Han <13980209828@163.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
2025-07-26 01:11:57 +08:00
parent 2670da66ce
commit 45c7bfb157
15 changed files with 4120 additions and 0 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -975,6 +975,8 @@
        title: Donut
      - local: model_doc/emu3
        title: Emu3
      - local: model_doc/evolla
        title: Evolla
      - local: model_doc/flava
        title: FLAVA
      - local: model_doc/gemma3
--- a/docs/source/en/model_doc/evolla.md
+++ b/docs/source/en/model_doc/evolla.md
@@ -0,0 +1,95 @@
 <!--Copyright 2025 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
 rendered properly in your Markdown viewer.
 -->
 # Evolla
 ## Overview
 The Evolla model was proposed in [Decoding the Molecular Language of Proteins with Evolla](https://doi.org/10.1101/2025.01.05.630192) by [Zhou et al.](https://doi.org/10.1101/2025.01.05.630192).
 Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
 The abstract from the paper is the following:
 *Proteins, nature’s intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and structures encode and determine biological functions - remains a corner-stone challenge in modern biology. Here, we introduce Evolla, an 80 billion frontier protein-language generative model designed to decode the molecular language of proteins. By integrating information from protein sequences, structures, and user queries, Evolla generates precise and contextually nuanced insights into protein function. A key innovation of Evolla lies in its training on an unprecedented AI-generated dataset: 546 million protein question-answer pairs and 150 billion word tokens, designed to reflect the immense complexity and functional diversity of proteins. Post-pretraining, Evolla integrates Direct Preference Optimization (DPO) to refine the model based on preference signals and Retrieval-Augmented Generation (RAG) for external knowledge incorporation, improving response quality and relevance. To evaluate its performance, we propose a novel framework, Instructional Response Space (IRS), demonstrating that Evolla delivers expert-level insights, advancing research in proteomics and functional genomics while shedding light on the molecular logic encoded in proteins. The online demo is available at http://www.chat-protein.com/.*
 Examples:
 ```python
 processor = EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-DPO-hf")
 model = EvollaForProteinText2Text.from_pretrained("westlake-repl/Evolla-10B-DPO-hf")
 # aa_seq should have same length as foldseek
 protein_inputs = [
    {
        "aa_seq": "MATGGRRG...",
        "foldseek": "###lqpfd...", # hashtag means the low-confidence foldseek tokens
    },
    {
        "aa_seq": "MLPGLALL...",
        "foldseek": "dfwwkwad...",
    }
 ]
 message_list = [
    [
        {
            "role": "system",
            "content": "You are an AI expert that can answer any questions about protein.",
        },
        {"role": "user", "content": "What is the function of this protein?"},
    ],
    [
        {
            "role": "system",
            "content": "You are an AI expert that can answer any questions about protein.",
        },
        {"role": "user", "content": "What is the function of this protein?"},
    ]
 ]
 input_dict = processor(
    protein_informations, messages_list, return_tensors="pt", text_max_length=512, protein_max_length=1024
 )
 with torch.no_grad():
    generated_ids = hf_model.generate(**input_dict)
 generated_texts = processor.batch_decode(
    generated_ids, skip_special_tokens=True
 )
 ```
 Tips:
 - This model was contributed by [Xibin Bayes Zhou](https://huggingface.co/XibinBayesZhou).
 - The original code can be found [here](https://github.com/westlake-repl/Evolla).
 ## EvollaConfig
 [[autodoc]] EvollaConfig
 ## EvollaModel
 [[autodoc]] EvollaModel
    - forward
 ## EvollaForProteinText2Text
 [[autodoc]] EvollaForProteinText2Text
    - forward
 ## EvollaProcessor
 [[autodoc]] EvollaProcessor
    - __call__
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@@ -110,6 +110,7 @@ if TYPE_CHECKING:
    from .encoder_decoder import *
    from .ernie import *
    from .esm import *
    from .evolla import *
    from .falcon import *
    from .falcon_h1 import *
    from .falcon_mamba import *
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -133,6 +133,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
        ("ernie4_5_moe", "Ernie4_5_MoeConfig"),
        ("ernie_m", "ErnieMConfig"),
        ("esm", "EsmConfig"),
        ("evolla", "EvollaConfig"),
        ("falcon", "FalconConfig"),
        ("falcon_h1", "FalconH1Config"),
        ("falcon_mamba", "FalconMambaConfig"),
@@ -528,6 +529,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
        ("ernie4_5_moe", "Ernie4_5_MoE"),
        ("ernie_m", "ErnieM"),
        ("esm", "ESM"),
        ("evolla", "Evolla"),
        ("falcon", "Falcon"),
        ("falcon3", "Falcon3"),
        ("falcon_h1", "FalconH1"),
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -124,6 +124,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("ernie4_5_moe", "Ernie4_5_MoeModel"),
        ("ernie_m", "ErnieMModel"),
        ("esm", "EsmModel"),
        ("evolla", "EvollaModel"),
        ("falcon", "FalconModel"),
        ("falcon_h1", "FalconH1Model"),
        ("falcon_mamba", "FalconMambaModel"),
@@ -402,6 +403,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
        ("distilbert", "DistilBertForMaskedLM"),
        ("electra", "ElectraForPreTraining"),
        ("ernie", "ErnieForPreTraining"),
        ("evolla", "EvollaForProteinText2Text"),
        ("falcon_mamba", "FalconMambaForCausalLM"),
        ("flaubert", "FlaubertWithLMHeadModel"),
        ("flava", "FlavaForPreTraining"),
@@ -934,6 +936,7 @@ MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(
        ("blip-2", "Blip2ForConditionalGeneration"),
        ("chameleon", "ChameleonForConditionalGeneration"),
        ("emu3", "Emu3ForConditionalGeneration"),
        ("evolla", "EvollaForProteinText2Text"),
        ("fuyu", "FuyuForCausalLM"),
        ("gemma3", "Gemma3ForConditionalGeneration"),
        ("gemma3n", "Gemma3nForConditionalGeneration"),
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -64,6 +64,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
        ("colqwen2", "ColQwen2Processor"),
        ("dia", "DiaProcessor"),
        ("emu3", "Emu3Processor"),
        ("evolla", "EvollaProcessor"),
        ("flava", "FlavaProcessor"),
        ("fuyu", "FuyuProcessor"),
        ("gemma3", "Gemma3Processor"),
--- a/src/transformers/models/evolla/init.py
+++ b/src/transformers/models/evolla/init.py
@@ -0,0 +1,28 @@
 # Copyright 2025 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import TYPE_CHECKING
 from ...utils import _LazyModule
 from ...utils.import_utils import define_import_structure
 if TYPE_CHECKING:
    from .configuration_evolla import *
    from .modeling_evolla import *
    from .processing_evolla import *
 else:
    import sys
    _file = globals()["__file__"]
    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
--- a/src/transformers/models/evolla/configuration_evolla.py
+++ b/src/transformers/models/evolla/configuration_evolla.py
@@ -0,0 +1,279 @@
 # coding=utf-8
 # Copyright 2025 Westlake Representational Learning Lab (Fajie Yuan Lab) team and the HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Evolla model configuration"""
 from ...configuration_utils import PretrainedConfig
 from ...modeling_rope_utils import rope_config_validation
 from ...utils import logging
 logger = logging.get_logger(__name__)
 class SaProtConfig(PretrainedConfig):
    r"""This is the configuration class to store the configuration of a [`EvollaSaProtProteinEncoder`]. It is used to instantiate a
    SaProt model according to the specified arguments, defining the model architecture.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        vocab_size (`int`, *optional*, defaults to 446):
            Vocabulary size of the protein sequence model. Defines the number of different tokens that can be represented
            by the `inputs_ids` passed when calling [`EvollaModel`].
        mask_token_id (`int`, *optional*, defaults to 4):
            The id of the *mask* token in the protein sequence model.
        pad_token_id (`int`, *optional*, defaults to 1):
            The id of the *padding* token in the protein sequence model.
        hidden_size (`int`, *optional*, defaults to 1280):
            Dimensionality of the protein sequence model layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 33):
            Number of hidden layers in the protein sequence model.
        num_attention_heads (`int`, *optional*, defaults to 20):
            Number of attention heads for each attention layer in the protein sequence model.
        intermediate_size (`int`, *optional*, defaults to 5120):
            Dimensionality of the intermediate layers in the protein sequence model.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the hidden layers in the protein sequence model.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities in the protein sequence model.
        max_position_embeddings (`int`, *optional*, defaults to 1026):
            The maximum sequence length that the protein sequence model might ever be used with. Typically set this to
            something large just in case (e.g., 512 or 1024 or 2048).
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon value for the layer normalization layer in the protein sequence model.
        position_embedding_type (`str`, *optional*, defaults to `"rotary"`):
            The type of position embedding to use in the protein sequence model. Currently only `"rotary"` is supported.
        emb_layer_norm_before (`bool`, *optional*, defaults to `False`):
            Whether to apply layer normalization before the position embedding in the protein sequence model.
        token_dropout (`bool`, *optional*, defaults to `True`):
            Whether to apply dropout to the tokens in the protein sequence model."""
    def __init__(
        self,
        vocab_size=446,
        mask_token_id=4,
        pad_token_id=1,
        hidden_size=1280,
        num_hidden_layers=33,
        num_attention_heads=20,
        intermediate_size=5120,
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=1026,
        initializer_range=0.02,
        layer_norm_eps=1e-05,
        position_embedding_type="rotary",
        use_cache=True,
        emb_layer_norm_before=False,
        token_dropout=True,
        **kwargs,
    ):
        super().__init__(pad_token_id=pad_token_id, mask_token_id=mask_token_id, **kwargs)
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.position_embedding_type = position_embedding_type
        self.use_cache = use_cache
        self.emb_layer_norm_before = emb_layer_norm_before
        self.token_dropout = token_dropout
 class EvollaConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`EvollaModel`]. It is used to instantiate an
    Evolla model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the Evolla-10B.
    e.g. [westlake-repl/Evolla-10B-hf](https://huggingface.co/westlake-repl/Evolla-10B-hf)
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        protein_encoder_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`SaProtConfig`].
        vocab_size (`int`, *optional*, defaults to 128256):
            Vocabulary size of the Evolla llama model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`EvollaModel`].
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimensionality of the llama layers and the pooler layer.
        intermediate_size (`int`, *optional*, defaults to 14336):
            Dimensionality of the intermediate layers in the llama model.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the llama model.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the llama model.
        num_key_value_heads (`int`, *optional*, defaults to 8):
            Number of key-value pairs for each attention layer in the llama model.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the llama model. If string, `"gelu"`, `"relu"`,
            `"selu"` and `"silu"` are supported.
        max_position_embeddings (`int`, *optional*, defaults to 8192):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon value for the RMS-norm layer in the llama model.
        rope_theta (`float`, *optional*, defaults to 500000.0):
            The threshold value for the RoPE layer in the llama model.
        rope_scaling (`float`, *optional*):
            The scaling factor for the RoPE layer in the llama model.
        attention_bias (`bool`, *optional*, defaults to `False`):
            Whether to use bias in the attention layer.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention layer.
        mlp_bias (`bool`, *optional*, defaults to `False`):
            Whether to use bias in the MLP layer.
        aligner_ffn_mult (`int`, *optional*, defaults to 4):
            The FFN multiplier for the aligner layer.
        aligner_enable_bias (`bool`, *optional*, defaults to `True`):
            Whether to use bias in the aligner layer.
        aligner_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities in the aligner layer.
        aligner_num_add_layers (`int`, *optional*, defaults to 8):
            The number of additional layers for the aligner layer.
        resampler_depth (`int`, *optional*, defaults to 6):
            The depth of the resampler layer in the llama model.
        resampler_dim_head (`int`, *optional*, defaults to 64):
            The dimension of the heads in the resampler layer in the llama model.
        resampler_heads (`int`, *optional*, defaults to 8):
            The number of heads in the resampler layer in the llama model.
        resampler_num_latents (`int`, *optional*, defaults to 64):
            The number of latents in the resampler layer in the llama model.
        resampler_ff_mult (`int`, *optional*, defaults to 4):
            The FFN multiplier for the resampler layer.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        pad_token_id (`int`, *optional*):
            The id of the *padding* token.
        bos_token_id (`int`, *optional*, defaults to 128000):
            The id of the *beginning-of-sequence* token.
        eos_token_id (`int`, *optional*, defaults to 128009):
            The id of the *end-of-sequence* token.
        use_cache (`bool`, *optional*, defaults to `False`):
            Whether or not the model should return the last key/values attentions (not used by all models).
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether or not to tie the input and output word embeddings.
    Example:
    ```python
    >>> from transformers import EvollaModel, EvollaConfig
    >>> # Initializing a Evolla evolla-10b style configuration
    >>> configuration = EvollaConfig()
    >>> # Initializing a model from the evolla-10b style configuration
    >>> model = EvollaModel(configuration)
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "EvollaModel"
    sub_configs = {"protein_encoder_config": SaProtConfig}
    def __init__(
        self,
        protein_encoder_config=None,
        vocab_size=128256,  # llama vocab size
        hidden_size=4096,  # llama hidden size
        intermediate_size=14336,  # llama intermediate size
        num_hidden_layers=32,  # llama num layers
        num_attention_heads=32,  # llama num heads
        num_key_value_heads=8,  # llama num key-value heads
        hidden_act="silu",  # llama activation function
        max_position_embeddings=8192,  # llama rope max length
        rms_norm_eps=1e-05,
        rope_theta=500000.0,
        rope_scaling=None,
        attention_bias=False,
        attention_dropout=0.0,
        mlp_bias=False,
        aligner_ffn_mult=4,
        aligner_enable_bias=True,
        aligner_attention_probs_dropout_prob=0.1,
        aligner_num_add_layers=8,
        resampler_depth=6,
        resampler_dim_head=64,
        resampler_heads=8,
        resampler_num_latents=64,
        resampler_ff_mult=4,
        initializer_range=0.02,
        pad_token_id=None,
        bos_token_id=128000,
        eos_token_id=128009,
        use_cache=False,
        tie_word_embeddings=False,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.max_position_embeddings = max_position_embeddings
        self.rms_norm_eps = rms_norm_eps
        self.tie_word_embeddings = tie_word_embeddings
        self.attention_bias = attention_bias
        self.attention_dropout = attention_dropout
        self.mlp_bias = mlp_bias
        self.aligner_ffn_mult = aligner_ffn_mult
        self.aligner_enable_bias = aligner_enable_bias
        self.aligner_attention_probs_dropout_prob = aligner_attention_probs_dropout_prob
        self.aligner_num_add_layers = aligner_num_add_layers
        self.use_cache = use_cache
        self.initializer_range = initializer_range
        self.resampler_depth = resampler_depth
        self.resampler_dim_head = resampler_dim_head
        self.resampler_heads = resampler_heads
        self.resampler_num_latents = resampler_num_latents
        self.resampler_ff_mult = resampler_ff_mult
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        # Validate the correctness of rotary position embeddings parameters
        # BC: if there is a 'type' field, copy it it to 'rope_type'.
        if self.rope_scaling is not None and "type" in self.rope_scaling:
            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
        rope_config_validation(self)
        # Subconfig
        if protein_encoder_config is None:
            protein_encoder_config = {}
            logger.info("`protein_encoder_config` is `None`. Initializing the `SaProtConfig` with default values.")
        self.protein_encoder_config = SaProtConfig(**protein_encoder_config)
        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )
 __all__ = ["EvollaConfig"]
--- a/src/transformers/models/evolla/modeling_evolla.py
+++ b/src/transformers/models/evolla/modeling_evolla.py
--- a/src/transformers/models/evolla/modular_evolla.py
+++ b/src/transformers/models/evolla/modular_evolla.py
--- a/src/transformers/models/evolla/processing_evolla.py
+++ b/src/transformers/models/evolla/processing_evolla.py
@@ -0,0 +1,247 @@
 # coding=utf-8
 # Copyright 2025 The HuggingFace Inc. team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Processor class for EVOLLA.
 """
 import os
 from typing import Optional, Union
 from ...feature_extraction_utils import BatchFeature
 from ...processing_utils import (
    ProcessorMixin,
 )
 from ..auto import AutoTokenizer
 PROTEIN_VALID_KEYS = ["aa_seq", "foldseek", "msa"]
 class EvollaProcessor(ProcessorMixin):
    r"""
    Constructs a EVOLLA processor which wraps a LLama tokenizer and SaProt tokenizer (EsmTokenizer) into a single processor.
    [`EvollaProcessor`] offers all the functionalities of [`EsmTokenizer`] and [`LlamaTokenizerFast`]. See the
    docstring of [`~EvollaProcessor.__call__`] and [`~EvollaProcessor.decode`] for more information.
    Args:
        protein_tokenizer (`EsmTokenizer`):
            An instance of [`EsmTokenizer`]. The protein tokenizer is a required input.
        tokenizer (`LlamaTokenizerFast`, *optional*):
            An instance of [`LlamaTokenizerFast`]. The tokenizer is a required input.
        protein_max_length (`int`, *optional*, defaults to 1024):
            The maximum length of the sequence to be generated.
        text_max_length (`int`, *optional*, defaults to 512):
            The maximum length of the text to be generated.
    """
    attributes = ["protein_tokenizer", "tokenizer"]
    valid_kwargs = ["sequence_max_length"]
    # protein_tokenizer_class = "EsmTokenizer"
    # tokenizer_class = "LlamaTokenizerFast"
    protein_tokenizer_class = "AutoTokenizer"
    tokenizer_class = "AutoTokenizer"
    protein_tokenizer_dir_name = "protein_tokenizer"
    # tokenizer_dir_name = "text_tokenizer"
    def __init__(self, protein_tokenizer, tokenizer=None, protein_max_length=1024, text_max_length=512, **kwargs):
        if protein_tokenizer is None:
            raise ValueError("You need to specify an `protein_tokenizer`.")
        if tokenizer is None:
            raise ValueError("You need to specify a `tokenizer`.")
        super().__init__(protein_tokenizer, tokenizer)
        self.tokenizer.pad_token = "<|reserved_special_token_0|>"
        self.protein_max_length = protein_max_length
        self.text_max_length = text_max_length
    def process_proteins(self, proteins, protein_max_length=1024):
        sa_sequences = []
        for protein in proteins:
            aa_seq = protein.get("aa_seq")
            foldseek = protein.get("foldseek")
            sa_sequence = "".join([s.upper() + f.lower() for s, f in zip(aa_seq, foldseek)])
            sa_sequences.append(sa_sequence)
        sa_tokens = self.protein_tokenizer.batch_encode_plus(
            sa_sequences, return_tensors="pt", truncation=True, max_length=protein_max_length, padding=True
        )
        return sa_tokens
    def process_text(
        self,
        texts,
        text_max_length: int = 512,
    ):
        prompts = []
        for messages in texts:
            prompt = self.tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
            )
            prompts.append(prompt)
        prompt_inputs = self.tokenizer(
            prompts,
            add_special_tokens=False,
            return_tensors="pt",
            padding="longest",
            truncation=True,
            max_length=text_max_length,
        )
        return prompt_inputs
    def __call__(
        self,
        proteins: Optional[Union[list[dict], dict]] = None,
        messages_list: Optional[Union[list[list[dict]], list[dict]]] = None,
        protein_max_length: Optional[int] = None,
        text_max_length: Optional[int] = None,
        **kwargs,
    ):
        r"""This method takes batched or non-batched proteins and messages_list and converts them into format that can be used by
        the model.
        Args:
            proteins (`Union[List[dict], dict]`):
                A list of dictionaries or a single dictionary containing the following keys:
                    - `"aa_seq"` (`str`) -- The amino acid sequence of the protein.
                    - `"foldseek"` (`str`) -- The foldseek string of the protein.
            messages_list (`Union[List[List[dict]], List[dict]]`):
                A list of lists of dictionaries or a list of dictionaries containing the following keys:
                    - `"role"` (`str`) -- The role of the message.
                    - `"content"` (`str`) -- The content of the message.
            protein_max_length (`int`, *optional*, defaults to 1024):
                The maximum length of the sequence to be generated.
            text_max_length (`int`, *optional*, defaults to 512):
                The maximum length of the text.
        Return:
            a dict with following keys:
                - `protein_input_ids` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The input IDs for the protein sequence.
                - `protein_attention_mask` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The attention mask for the protein sequence.
                - `text_input_ids` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The input IDs for the text sequence.
                - `text_attention_mask` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The attention mask for the text sequence.
        """
        # proteins and messages_list should be provided
        if proteins is None or messages_list is None:
            raise ValueError("You need to specify `messages_list` and `proteins`.")
        protein_max_length = protein_max_length if protein_max_length is not None else self.protein_max_length
        text_max_length = text_max_length if text_max_length is not None else self.text_max_length
        # proteins should be List[dict]
        if isinstance(proteins, dict):
            proteins = [proteins]
        # messages_list should be List[List[dict]]
        if isinstance(messages_list, (list, tuple)) and not isinstance(messages_list[0], (list, tuple)):
            messages_list = [messages_list]
        # Check if batched proteins are in the correct format
        if isinstance(proteins, (list, tuple)) and not all(isinstance(p, dict) for p in proteins):
            raise ValueError("The proteins should be a list of dictionaries, but not all elements are dictionaries.")
        if isinstance(proteins, (list, tuple)) and not all(
            all(k in PROTEIN_VALID_KEYS for k in p.keys()) for p in proteins
        ):
            raise ValueError(
                "There should be a list of dictionaries with keys: "
                f"{', '.join(PROTEIN_VALID_KEYS)} for each protein."
                f"But got: {proteins}"
            )
        # Check if batched messages_list is in the correct format
        if isinstance(messages_list, (list, tuple)):
            for messages in messages_list:
                if not isinstance(messages, (list, tuple)):
                    raise ValueError(f"Each messages in messages_list should be a list instead of {type(messages)}.")
                if not all(isinstance(m, dict) for m in messages):
                    raise ValueError(
                        "Each message in messages_list should be a list of dictionaries, but not all elements are dictionaries."
                    )
                if any(len(m.keys()) != 2 for m in messages) or any(
                    set(m.keys()) != {"role", "content"} for m in messages
                ):
                    raise ValueError(
                        "Each message in messages_list should be a list of dictionaries with two keys: 'role' and 'content'."
                        f"But got: {messages}"
                    )
        else:
            raise ValueError(
                f"The messages_list should be a list of lists of dictionaries, but it's {type(messages_list)}."
            )
        sa_tokens = self.process_proteins(proteins, protein_max_length)
        text_tokens = self.process_text(messages_list, text_max_length)
        return BatchFeature(
            data={
                "protein_input_ids": sa_tokens["input_ids"],
                "protein_attention_mask": sa_tokens["attention_mask"],
                "input_ids": text_tokens["input_ids"],
                "attention_mask": text_tokens["attention_mask"],
            }
        )
    def batch_decode(self, *args, **kwargs):
        return self.tokenizer.batch_decode(*args, **kwargs)
    def decode(self, *args, **kwargs):
        return self.tokenizer.decode(*args, **kwargs)
    def protein_batch_decode(self, *args, **kwargs):
        return self.protein_tokenizer.batch_decode(*args, **kwargs)
    def protein_decode(self, *args, **kwargs):
        return self.protein_tokenizer.decode(*args, **kwargs)
    # overwrite to save the protein tokenizer in a separate folder
    # Adapted from instructblip.processing_instructblip.py (https://github.com/huggingface/transformers/blob/9b479a245b793cac2a8b2e87c6d8e81bb24e20c4/src/transformers/models/instructblip/processing_instructblip.py#L191-L221)
    def save_pretrained(self, save_directory, **kwargs):
        # only save the protein tokenizer in sub_dir
        self.protein_tokenizer.save_pretrained(os.path.join(save_directory, self.protein_tokenizer_dir_name))
        # we modify the attributes so that only the text tokenizer are saved in the main folder
        protein_tokenizer_present = "protein_tokenizer" in self.attributes
        # find the correct position of it in the attributes list
        protein_tokenizer_index = self.attributes.index("protein_tokenizer") if protein_tokenizer_present else None
        if protein_tokenizer_present and protein_tokenizer_index is not None:
            self.attributes.remove("protein_tokenizer")
        outputs = super().save_pretrained(save_directory, **kwargs)
        if protein_tokenizer_present and protein_tokenizer_index is not None:
            self.attributes.insert(protein_tokenizer_index, "protein_tokenizer")
        return outputs
    # overwirte to load the protein tokenizer from a separate folder
    # Adapted from instructblip.processing_instructblip.py (https://github.com/huggingface/transformers/blob/9b479a245b793cac2a8b2e87c6d8e81bb24e20c4/src/transformers/models/instructblip/processing_instructblip.py#L191-L221)
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
        processor = super().from_pretrained(pretrained_model_name_or_path, **kwargs)
        # if return_unused_kwargs a tuple is returned where the second element is 'unused_kwargs'
        if isinstance(processor, tuple):
            processor = processor[0]
        protein_tokenizer = AutoTokenizer.from_pretrained(
            pretrained_model_name_or_path, subfolder=cls.protein_tokenizer_dir_name
        )
        processor.protein_tokenizer = protein_tokenizer
        return processor
 __all__ = ["EvollaProcessor"]
--- a/tests/models/evolla/init.py
+++ b/tests/models/evolla/init.py
--- a/tests/models/evolla/test_modeling_evolla.py
+++ b/tests/models/evolla/test_modeling_evolla.py
@@ -0,0 +1,397 @@
 # coding=utf-8
 # Copyright 2025 The HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Testing suite for the PyTorch Evolla model."""
 import unittest
 from parameterized import parameterized
 from transformers import BitsAndBytesConfig, EvollaConfig, is_torch_available
 from transformers.testing_utils import (
    TestCasePlus,
    require_bitsandbytes,
    require_torch,
    require_torch_sdpa,
    slow,
    torch_device,
 )
 from transformers.utils import (
    cached_property,
 )
 from ...test_configuration_common import ConfigTester
 from ...test_modeling_common import (
    TEST_EAGER_MATCHES_SDPA_INFERENCE_PARAMETERIZATION,
    ModelTesterMixin,
    _config_zero_init,
    ids_tensor,
    random_attention_mask,
 )
 from ...test_pipeline_mixin import PipelineTesterMixin
 if is_torch_available():
    import torch
    from transformers import EvollaForProteinText2Text, EvollaModel, EvollaProcessor
 class EvollaModelTester:
    def __init__(
        self,
        parent,
        batch_size=1,
        is_training=False,
        text_seq_length=20,
        text_vocab_size=100,
        protein_seq_length=10,
        protein_vocab_size=20,
        hidden_size=4,  # llama hidden size
        intermediate_size=7,  # llama intermediate size
        num_hidden_layers=1,  # llama hidden layers
        num_attention_heads=2,  # llama attention heads
        num_key_value_heads=2,  # llama key value heads
        protein_hidden_size=8,  # protein encoder hidden size
        protein_num_hidden_layers=1,  # protein encoder hidden layers
        protein_num_attention_heads=4,  # protein encoder attention heads
        protein_intermediate_size=11,  # protein encoder intermediate size
        resampler_num_latents=7,  # sequence compressor num latents
        resampler_ff_mult=1,  # sequence compressor ff mult
        resampler_depth=2,  # sequence compressor depth
        resampler_dim_head=4,  # sequence compressor dim head
        resampler_heads=2,  # sequence compressor heads
        aligner_num_add_layers=1,  # sequence aligner num add layers
        aligner_ffn_mult=1,  # sequence aligner ffn mult
        use_input_mask=True,
    ):
        self.parent = parent
        self.batch_size = batch_size
        self.protein_seq_length = protein_seq_length
        self.protein_vocab_size = protein_vocab_size
        self.text_seq_length = text_seq_length
        self.text_vocab_size = text_vocab_size
        self.seq_length = text_seq_length
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        self.protein_hidden_size = protein_hidden_size
        self.protein_num_hidden_layers = protein_num_hidden_layers
        self.protein_num_attention_heads = protein_num_attention_heads
        self.protein_intermediate_size = protein_intermediate_size
        self.resampler_num_latents = resampler_num_latents
        self.resampler_ff_mult = resampler_ff_mult
        self.resampler_depth = resampler_depth
        self.resampler_dim_head = resampler_dim_head
        self.resampler_heads = resampler_heads
        self.aligner_num_add_layers = aligner_num_add_layers
        self.aligner_ffn_mult = aligner_ffn_mult
        self.use_input_mask = use_input_mask
        self.is_training = is_training
    @property
    def is_encoder_decoder(self):
        return False
    def prepare_config_and_inputs(self, num_proteins=None):
        batch_size = num_proteins if num_proteins is not None else self.batch_size
        text_input_ids = ids_tensor([batch_size, self.text_seq_length], self.text_vocab_size)
        protein_input_ids = ids_tensor([batch_size, self.protein_seq_length], self.protein_vocab_size)
        if self.use_input_mask:
            text_input_mask = random_attention_mask([batch_size, self.text_seq_length])
            protein_input_mask = random_attention_mask([batch_size, self.protein_seq_length])
        config = self.get_config()
        return (config, text_input_ids, text_input_mask, protein_input_ids, protein_input_mask)
    def get_config(self):
        return EvollaConfig(
            protein_encoder_config={
                "vocab_size": self.protein_vocab_size,
                "hidden_size": self.protein_hidden_size,
                "num_hidden_layers": self.protein_num_hidden_layers,
                "num_attention_heads": self.protein_num_attention_heads,
                "intermediate_size": self.protein_intermediate_size,
            },
            vocab_size=self.text_vocab_size,
            hidden_size=self.hidden_size,
            intermediate_size=self.intermediate_size,
            num_hidden_layers=self.num_hidden_layers,
            num_attention_heads=self.num_attention_heads,
            num_key_value_heads=self.num_key_value_heads,
            aligner_ffn_mult=self.aligner_ffn_mult,
            aligner_num_add_layers=self.aligner_num_add_layers,
            resampler_depth=self.resampler_depth,
            resampler_dim_head=self.resampler_dim_head,
            resampler_heads=self.resampler_heads,
            resampler_num_latents=self.resampler_num_latents,
            resampler_ff_mult=self.resampler_ff_mult,
        )
    def create_and_check_model(
        self,
        config,
        input_ids,
        input_mask,
        protein_input_ids,
        protein_input_mask,
        batch_size=None,
    ):
        batch_size = batch_size if batch_size is not None else self.batch_size
        model = EvollaModel(config=config)
        model.to(torch_device)
        model.eval()
        result = model(
            input_ids,
            attention_mask=input_mask,
            protein_input_ids=protein_input_ids,
            protein_attention_mask=protein_input_mask,
        )
        self.parent.assertEqual(result.last_hidden_state.shape, (batch_size, input_ids.shape[1], self.hidden_size))
    def create_and_check_model_gen(
        self,
        config,
        input_ids,
        input_mask,
        protein_input_ids,
        protein_input_mask,
    ):
        model = EvollaForProteinText2Text(config)
        model.to(torch_device)
        model.eval()
        model.generate(
            input_ids,
            attention_mask=input_mask,
            protein_input_ids=protein_input_ids,
            protein_attention_mask=protein_input_mask,
            max_length=self.seq_length + 2,
        )
    def prepare_config_and_inputs_for_common(self):
        config_and_inputs = self.prepare_config_and_inputs()
        (config, text_input_ids, text_input_mask, protein_input_ids, protein_input_mask) = config_and_inputs
        inputs_dict = {
            "input_ids": text_input_ids,
            "attention_mask": text_input_mask,
            "protein_input_ids": protein_input_ids,
            "protein_attention_mask": protein_input_mask,
        }
        return config, inputs_dict
@require_torch
 class EvollaModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
    all_model_classes = (EvollaModel, EvollaForProteinText2Text) if is_torch_available() else ()
    pipeline_model_mapping = {"feature-extraction": EvollaModel} if is_torch_available() else {}
    test_pruning = False
    test_headmasking = False
    test_torchscript = False
    test_resize_embeddings = False
    maxDiff = None
    def setUp(self):
        self.model_tester = EvollaModelTester(self)
        self.config_tester = ConfigTester(self, config_class=EvollaConfig, hidden_size=37)
    @property
    def is_encoder_decoder(self):
        return self.model_tester.is_encoder_decoder
    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
        inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
        # XXX: EvollaForProteinText2Text has no MODEL_FOR group yet, but it should be the same
        # as MODEL_FOR_CAUSAL_LM_MAPPING_NAMES, so for now manually changing to do the right thing
        # as super won't do it
        if return_labels:
            inputs_dict["labels"] = torch.zeros(
                (self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
            )
        return inputs_dict
    def test_model_outputs_equivalence(self):
        try:
            orig = self.all_model_classes
            # EvollaModel.forward doesn't have labels input arg - only EvollaForProteinText2Text does
            self.all_model_classes = (EvollaForProteinText2Text,) if is_torch_available() else ()
            super().test_model_outputs_equivalence()
        finally:
            self.all_model_classes = orig
    def test_config(self):
        self.config_tester.run_common_tests()
    def test_model_single_protein(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=1)
        self.model_tester.create_and_check_model(*config_and_inputs, batch_size=1)
    def test_model_multiple_proteins(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=2)
        self.model_tester.create_and_check_model(*config_and_inputs, batch_size=2)
    def test_generate_single_protein(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=1)
        self.model_tester.create_and_check_model_gen(*config_and_inputs)
    def test_generate_multiple_proteins(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=2)
        self.model_tester.create_and_check_model_gen(*config_and_inputs)
    def test_saprot_output(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.return_dict = True
        protein_informations = {
            "input_ids": inputs_dict["protein_input_ids"],
            "attention_mask": inputs_dict["protein_attention_mask"],
        }
        for model_class in self.all_model_classes:
            if model_class is not EvollaModel:
                continue
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            protein_encoder_outputs = model.protein_encoder.model(**protein_informations, return_dict=True)
            print(model_class, protein_encoder_outputs)
    def test_protein_encoder_output(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.return_dict = True
        protein_informations = {
            "input_ids": inputs_dict["protein_input_ids"],
            "attention_mask": inputs_dict["protein_attention_mask"],
        }
        for model_class in self.all_model_classes:
            if model_class is not EvollaModel:
                continue
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            protein_encoder_outputs = model.protein_encoder(**protein_informations, return_dict=True)
            print(model_class, protein_encoder_outputs)
    def test_single_forward(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        config.return_dict = True
        for model_class in self.all_model_classes:
            inputs_dict["output_attentions"] = True
            inputs_dict["output_hidden_states"] = False
            config.return_dict = True
            model = model_class(config)
            model.to(torch_device)
            model.eval()
            with torch.no_grad():
                outputs = model(**self._prepare_for_class(inputs_dict, model_class))
            print(outputs)
    def test_initialization(self):
        # we skip the latents initialization test
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
        configs_no_init = _config_zero_init(config)
        for model_class in self.all_model_classes:
            model = model_class(config=configs_no_init)
            for name, param in model.named_parameters():
                if param.requires_grad:
                    # skip latents
                    if name.endswith("latents"):
                        print(f"Skipping latents {name}")
                        continue
                    self.assertIn(
                        ((param.data.mean() * 1e9).round() / 1e9).item(),
                        [0.0, 1.0],
                        msg=f"Parameter {name} of model {model_class} seems not properly initialized",
                    )
    @parameterized.expand(TEST_EAGER_MATCHES_SDPA_INFERENCE_PARAMETERIZATION)
    @require_torch_sdpa
    @unittest.skip("Evolla requires both text and protein inputs which is currently not done in this test.")
    def test_eager_matches_sdpa_inference(self):
        pass
    @unittest.skip("Evolla does not support eager attention implementation.")
    def test_eager_padding_matches_padding_free_with_position_ids(self):
        pass
    @unittest.skip(
        "Evolla has a separate test runner for generation tests with complex inheritance, causing this check to fail."
    )
    def test_generation_tester_mixin_inheritance(self):
        pass
    @unittest.skip("Evolla requires both text and protein inputs which is currently not done in this test.")
    def test_flex_attention_with_grads(self):
        pass
@require_torch
 class EvollaModelIntegrationTest(TestCasePlus):
    def _prepare_for_inputs(self):
        aa_seq = "MLLEETLKSCPIVKRGKYHYFIHPISDGVPLVEPKLLREVATRIIKIGNFEGVNKIVTAEAMGIPLVTTLSLYTDIPYVIMRKREYKLPGEVPVFQSTGYSKGQLYLNGIEKGDKVIIIDDVISTGGTMIAIINALERAGAEIKDIICVIERGDGKKIVEEKTGYKIKTLVKIDVVDGEVVIL"
        foldseek = "dvvvvqqqpfawdddppdtdgcgclapvpdpddpvvlvvllvlcvvpadpvqaqeeeeeddscpsnvvsncvvpvhyydywylddppdppkdwqwf######gitidpdqaaaheyeyeeaeqdqlrvvlsvvvrcvvrnyhhrayeyaeyhycnqvvccvvpvghyhynwywdqdpsgidtd"
        question = "What is the function of this protein?"
        protein_information = {
            "aa_seq": aa_seq,
            "foldseek": foldseek,
        }
        messages = [
            {"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
            {"role": "user", "content": question},
        ]
        return protein_information, messages
    @cached_property
    def default_processor(self):
        return EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-hf", revision="refs/pr/11")
    @require_bitsandbytes
    @slow
    def test_inference_natural_language_protein_reasoning(self):
        protein_information, messages = self._prepare_for_inputs()
        processor = self.default_processor
        inputs = processor(
            messages_list=[messages], proteins=[protein_information], return_tensors="pt", padding="longest"
        ).to(torch_device)
        # the CI gpu is small so using quantization to fit
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype="float16",
        )
        model = EvollaForProteinText2Text.from_pretrained(
            "westlake-repl/Evolla-10B-hf",
            quantization_config=quantization_config,
            device_map="auto",
        )
        generated_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
        # keep for debugging
        for i, t in enumerate(generated_text):
            t = bytes(t, "utf-8").decode("unicode_escape")
            print(f"{i}:\n{t}\n")
        self.assertIn("This protein", generated_text[0])
        self.assertIn("purine", generated_text[0])
--- a/tests/models/evolla/test_processor_evolla.py
+++ b/tests/models/evolla/test_processor_evolla.py
@@ -0,0 +1,295 @@
 # Copyright 2025 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import random
 import shutil
 import tempfile
 import unittest
 from transformers import (
    AutoProcessor,
    EvollaProcessor,
 )
 from transformers.testing_utils import require_torch
 from transformers.utils import is_torch_available
 from ...test_processing_common import ProcessorTesterMixin
 if is_torch_available():
    import torch
 EVOLLA_VALID_AA = list("ACDEFGHIKLMNPQRSTVWY#")
 EVOLLA_VALID_FS = list("pynwrqhgdlvtmfsaeikc#")
@require_torch
 class EvollaProcessorTest(ProcessorTesterMixin, unittest.TestCase):
    processor_class = EvollaProcessor
    def setUp(self):
        self.tmpdirname = tempfile.mkdtemp()
        processor = EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-hf")
        processor.save_pretrained(self.tmpdirname)
        self.input_keys = ["protein_input_ids", "protein_attention_mask", "input_ids", "attention_mask"]
    def prepare_input_and_expected_output(self):
        amino_acid_sequence = "AAAA"
        foldseek_sequence = "dddd"
        question = "What is the function of this protein?"
        expected_output = {
            "protein_input_ids": torch.tensor([[0, 13, 13, 13, 13, 2]]),
            "protein_attention_mask": torch.tensor([[1, 1, 1, 1, 1, 1]]),
            "input_ids": torch.tensor(
                [
                    [
                        128000,
                        128006,
                        9125,
                        128007,
                        271,
                        2675,
                        527,
                        459,
                        15592,
                        6335,
                        430,
                        649,
                        4320,
                        904,
                        4860,
                        922,
                        13128,
                        13,
                        128009,
                        128006,
                        882,
                        128007,
                        271,
                        3923,
                        374,
                        279,
                        734,
                        315,
                        420,
                        13128,
                        30,
                        128009,
                        128006,
                        78191,
                        128007,
                        271,
                    ]
                ]
            ),
            "attention_mask": torch.tensor(
                [
                    [
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                        1,
                    ]
                ]
            ),
        }
        protein_dict = {"aa_seq": amino_acid_sequence, "foldseek": foldseek_sequence}
        message = [
            {"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
            {"role": "user", "content": question},
        ]
        return protein_dict, message, expected_output
    def test_processor(self):
        protein_tokenizer = self.get_protein_tokenizer()
        tokenizer = self.get_tokenizer()
        processor = EvollaProcessor(protein_tokenizer, tokenizer)
        protein_dict, message, expected_output = self.prepare_input_and_expected_output()
        inputs = processor(proteins=[protein_dict], messages_list=[message])
        # check if the input is correct
        for key, value in expected_output.items():
            self.assertTrue(
                torch.equal(inputs[key], value),
                f"inputs[key] is {inputs[key]} and expected_output[key] is {expected_output[key]}",
            )
    def get_tokenizer(self, **kwargs):
        return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
    def get_protein_tokenizer(self, **kwargs):
        return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).protein_tokenizer
    def tearDown(self):
        shutil.rmtree(self.tmpdirname)
    def prepare_inputs_single(self):
        proteins = {
            "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
            "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
        }
        return proteins
    def prepare_inputs_pair(self):
        proteins = [
            {
                "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
                "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
            },
            {
                "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
                "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
            },
        ]
        return proteins
    def prepare_inputs_long(self):
        proteins = [
            {
                "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
                "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
            },
            {
                "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=2000)),
                "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=2000)),
            },
        ]
        return proteins
    def prepare_inputs_short(self):
        proteins = [
            {
                "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=1)),
                "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=1)),
            },
            {
                "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
                "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
            },
        ]
        return proteins
    def prepare_inputs_empty(self):
        proteins = [
            {
                "aa_seq": "",
                "foldseek": "",
            },
            {
                "aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
                "foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
            },
        ]
        return proteins
    def prepare_inputs(self, protein_types="pair"):
        r"""
        Prepare inputs for the test.
        Args:
            protein_types (`str`): the types of proteins to prepare.
                - "single": a single correct protein.
                - "pair": a pair of correct proteins.
                - "long": a long sequence of correct proteins and a correct protein.
                - "short": a short sequence of correct proteins (only have 1 aa) and a correct protein.
                - "empty": an empty sequence of proteins and a correct protein.
        """
        if protein_types == "single":
            proteins = self.prepare_inputs_single()
        elif protein_types == "pair":
            proteins = self.prepare_inputs_pair()
        elif protein_types == "long":
            proteins = self.prepare_inputs_long()
        elif protein_types == "short":
            proteins = self.prepare_inputs_short()
        elif protein_types == "empty":
            proteins = self.prepare_inputs_empty()
        else:
            raise ValueError(
                f"protein_types should be one of 'single', 'pair', 'long','short', 'empty', but got {protein_types}"
            )
        questions = ["What is the function of the protein?"] * len(proteins)
        messages_list = []
        for question in questions:
            messages = [
                {"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
                {"role": "user", "content": question},
            ]
            messages_list.append(messages)
        return proteins, messages_list
    def test_tokenizer_decode(self):
        protein_tokenizer = self.get_protein_tokenizer()
        tokenizer = self.get_tokenizer()
        processor = EvollaProcessor(tokenizer=tokenizer, protein_tokenizer=protein_tokenizer, return_tensors="pt")
        predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
        decoded_processor = processor.batch_decode(predicted_ids)
        decoded_tok = tokenizer.batch_decode(predicted_ids)
        self.assertListEqual(decoded_tok, decoded_processor)
    def test_model_input_names(self):
        protein_tokenizer = self.get_protein_tokenizer()
        tokenizer = self.get_tokenizer()
        processor = EvollaProcessor(tokenizer=tokenizer, protein_tokenizer=protein_tokenizer)
        proteins, messages_list = self.prepare_inputs()
        inputs = processor(messages_list=messages_list, proteins=proteins, padding="longest", return_tensors="pt")
        # For now the processor supports only ['pixel_values', 'input_ids', 'attention_mask']
        self.assertSetEqual(set(inputs.keys()), set(self.input_keys))
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -92,6 +92,7 @@ PRIVATE_MODELS = [
    "Phi4MultimodalAudioModel",
    "Phi4MultimodalVisionModel",
    "Glm4vVisionModel",
    "EvollaSaProtPreTrainedModel",
 ]
 # Update this list for models that are not tested with a comment explaining the reason it should not be.