Add evolla rebase main (#36232)

* add evolla

* adding protein encoder part

* add initial processing test

* save processor

* add docstring

* add evolla processor

* add two test

* change vision to protein

* change resampler to sequence_compressor

* change vision to protein

* initial update for llama

* add initial update for llamaForCausalLM

* add `test_processor`, `test_saprot_output`, `test_protein_encoder_output`

* change evolla, but still working on it

* add test_single_forward

* pass test_attention_outputs

* pass test_hidden_states_output

* pass test_save_load and test_from_pretrained_no_checkpoint

* pass test_cpu_offload

* skip some tests

* update new progress

* skip test_model_is_small

* pass test_model_weights_reload_no_missing_tied_weights

* pass test_model_get_set_embeddings

* pass test_cpu_offload

* skip test_resize_embeddings

* add pipeline_model_mapping

* remote old setUp

* pass processor save_pretrained and load_pretrained

* remove pooling layer

* pass test_inputs_embeds_matches_input_ids

* pass test_model_is_small

* pass test_attention_outputs

* pass test_initialization

* pass test_model_get_set_embeddings

* pass test_single_forward

* skip test_disk_offload_bin and test_disk_offload_safetensors

* fix most tests

* pass test_protein_encoder_output

* remove useless code

* add EvollaForProteinText2Text

* pass test_saprot_output

* pass all EvollaModelTest test and remove processor test

* add processor test to its own file

* skip is_training since esm skipped it and the saprot code causes error when setting is_training True

* pass processor tests

* solve all except config

* pass most cases

* change init

* add doc to `configuration_evolla.py`

* remove image_processing test

* remove extra processor test

* remove extra modules

* remove extra modules

* change all configs into one config

* pass all evolla test

* pass `make fixup`

* update short summary

* update Evolla-10B-hf

* pass check_dummies.py and check_code_quality

* fix  `tests/models/auto/test_tokenization_auto.py::AutoTokenizerTest::test_model_name_edge_cases_in_mappings`

* remove dummy codes

* change format

* fix llava issue

* update format

* update to solve llama3 access issue

* update to make forward right

* solve processor save load problem from instructblip solution

* remove unexpected file

* skip `test_generation_tester_mixin_inheritance`

* add `test_single_forward_correct` and `test_inference_natural_language_protein_reasoning`

* add `modular_evolla.py`

* solved issue #36362

* run `make fixup`

* update modular

* solve float32 training

* add fix

* solve `utils/check_docstrings.py`

* update

* update

* update

* remove other files and replace sequential and einsum

* add use case in document

* update the models

* update model

* change some wrong code

* Update src/transformers/models/evolla/modular_evolla.py

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* Update src/transformers/models/evolla/modular_evolla.py

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* Update src/transformers/models/evolla/modular_evolla.py

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* Update src/transformers/models/evolla/modular_evolla.py

Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>

* fix issues mentioned in PR

* update style and rearrange the placement

* fix return_dict argument issue

* solve SaProtConfig issue

* Solve EvollaSaProtRotaryEmbedding issue

* solve attention_mask issue

* solve almosst all issues

* make style

* update config

* remove unrelated pickle file

* delete pickle files

* fix config

* simplify a lot

* remove past k-v from encoder

* continue work

* style

* skip it from init

* fix init

* fix init

* simplify more

* fill in docstrings

* change test for generation

* skip test

* fix style

---------

Co-authored-by: Chenchen Han <13980209828@163.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
This commit is contained in:
Xibin Bayes Zhou
2025-07-26 01:11:57 +08:00
committed by GitHub
parent 2670da66ce
commit 45c7bfb157
15 changed files with 4120 additions and 0 deletions

View File

@@ -975,6 +975,8 @@
title: Donut title: Donut
- local: model_doc/emu3 - local: model_doc/emu3
title: Emu3 title: Emu3
- local: model_doc/evolla
title: Evolla
- local: model_doc/flava - local: model_doc/flava
title: FLAVA title: FLAVA
- local: model_doc/gemma3 - local: model_doc/gemma3

View File

@@ -0,0 +1,95 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Evolla
## Overview
The Evolla model was proposed in [Decoding the Molecular Language of Proteins with Evolla](https://doi.org/10.1101/2025.01.05.630192) by [Zhou et al.](https://doi.org/10.1101/2025.01.05.630192).
Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
The abstract from the paper is the following:
*Proteins, natures intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and structures encode and determine biological functions - remains a corner-stone challenge in modern biology. Here, we introduce Evolla, an 80 billion frontier protein-language generative model designed to decode the molecular language of proteins. By integrating information from protein sequences, structures, and user queries, Evolla generates precise and contextually nuanced insights into protein function. A key innovation of Evolla lies in its training on an unprecedented AI-generated dataset: 546 million protein question-answer pairs and 150 billion word tokens, designed to reflect the immense complexity and functional diversity of proteins. Post-pretraining, Evolla integrates Direct Preference Optimization (DPO) to refine the model based on preference signals and Retrieval-Augmented Generation (RAG) for external knowledge incorporation, improving response quality and relevance. To evaluate its performance, we propose a novel framework, Instructional Response Space (IRS), demonstrating that Evolla delivers expert-level insights, advancing research in proteomics and functional genomics while shedding light on the molecular logic encoded in proteins. The online demo is available at http://www.chat-protein.com/.*
Examples:
```python
processor = EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-DPO-hf")
model = EvollaForProteinText2Text.from_pretrained("westlake-repl/Evolla-10B-DPO-hf")
# aa_seq should have same length as foldseek
protein_inputs = [
{
"aa_seq": "MATGGRRG...",
"foldseek": "###lqpfd...", # hashtag means the low-confidence foldseek tokens
},
{
"aa_seq": "MLPGLALL...",
"foldseek": "dfwwkwad...",
}
]
message_list = [
[
{
"role": "system",
"content": "You are an AI expert that can answer any questions about protein.",
},
{"role": "user", "content": "What is the function of this protein?"},
],
[
{
"role": "system",
"content": "You are an AI expert that can answer any questions about protein.",
},
{"role": "user", "content": "What is the function of this protein?"},
]
]
input_dict = processor(
protein_informations, messages_list, return_tensors="pt", text_max_length=512, protein_max_length=1024
)
with torch.no_grad():
generated_ids = hf_model.generate(**input_dict)
generated_texts = processor.batch_decode(
generated_ids, skip_special_tokens=True
)
```
Tips:
- This model was contributed by [Xibin Bayes Zhou](https://huggingface.co/XibinBayesZhou).
- The original code can be found [here](https://github.com/westlake-repl/Evolla).
## EvollaConfig
[[autodoc]] EvollaConfig
## EvollaModel
[[autodoc]] EvollaModel
- forward
## EvollaForProteinText2Text
[[autodoc]] EvollaForProteinText2Text
- forward
## EvollaProcessor
[[autodoc]] EvollaProcessor
- __call__

View File

@@ -110,6 +110,7 @@ if TYPE_CHECKING:
from .encoder_decoder import * from .encoder_decoder import *
from .ernie import * from .ernie import *
from .esm import * from .esm import *
from .evolla import *
from .falcon import * from .falcon import *
from .falcon_h1 import * from .falcon_h1 import *
from .falcon_mamba import * from .falcon_mamba import *

View File

@@ -133,6 +133,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
("ernie4_5_moe", "Ernie4_5_MoeConfig"), ("ernie4_5_moe", "Ernie4_5_MoeConfig"),
("ernie_m", "ErnieMConfig"), ("ernie_m", "ErnieMConfig"),
("esm", "EsmConfig"), ("esm", "EsmConfig"),
("evolla", "EvollaConfig"),
("falcon", "FalconConfig"), ("falcon", "FalconConfig"),
("falcon_h1", "FalconH1Config"), ("falcon_h1", "FalconH1Config"),
("falcon_mamba", "FalconMambaConfig"), ("falcon_mamba", "FalconMambaConfig"),
@@ -528,6 +529,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
("ernie4_5_moe", "Ernie4_5_MoE"), ("ernie4_5_moe", "Ernie4_5_MoE"),
("ernie_m", "ErnieM"), ("ernie_m", "ErnieM"),
("esm", "ESM"), ("esm", "ESM"),
("evolla", "Evolla"),
("falcon", "Falcon"), ("falcon", "Falcon"),
("falcon3", "Falcon3"), ("falcon3", "Falcon3"),
("falcon_h1", "FalconH1"), ("falcon_h1", "FalconH1"),

View File

@@ -124,6 +124,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("ernie4_5_moe", "Ernie4_5_MoeModel"), ("ernie4_5_moe", "Ernie4_5_MoeModel"),
("ernie_m", "ErnieMModel"), ("ernie_m", "ErnieMModel"),
("esm", "EsmModel"), ("esm", "EsmModel"),
("evolla", "EvollaModel"),
("falcon", "FalconModel"), ("falcon", "FalconModel"),
("falcon_h1", "FalconH1Model"), ("falcon_h1", "FalconH1Model"),
("falcon_mamba", "FalconMambaModel"), ("falcon_mamba", "FalconMambaModel"),
@@ -402,6 +403,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
("distilbert", "DistilBertForMaskedLM"), ("distilbert", "DistilBertForMaskedLM"),
("electra", "ElectraForPreTraining"), ("electra", "ElectraForPreTraining"),
("ernie", "ErnieForPreTraining"), ("ernie", "ErnieForPreTraining"),
("evolla", "EvollaForProteinText2Text"),
("falcon_mamba", "FalconMambaForCausalLM"), ("falcon_mamba", "FalconMambaForCausalLM"),
("flaubert", "FlaubertWithLMHeadModel"), ("flaubert", "FlaubertWithLMHeadModel"),
("flava", "FlavaForPreTraining"), ("flava", "FlavaForPreTraining"),
@@ -934,6 +936,7 @@ MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(
("blip-2", "Blip2ForConditionalGeneration"), ("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"), ("chameleon", "ChameleonForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"), ("emu3", "Emu3ForConditionalGeneration"),
("evolla", "EvollaForProteinText2Text"),
("fuyu", "FuyuForCausalLM"), ("fuyu", "FuyuForCausalLM"),
("gemma3", "Gemma3ForConditionalGeneration"), ("gemma3", "Gemma3ForConditionalGeneration"),
("gemma3n", "Gemma3nForConditionalGeneration"), ("gemma3n", "Gemma3nForConditionalGeneration"),

View File

@@ -64,6 +64,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("colqwen2", "ColQwen2Processor"), ("colqwen2", "ColQwen2Processor"),
("dia", "DiaProcessor"), ("dia", "DiaProcessor"),
("emu3", "Emu3Processor"), ("emu3", "Emu3Processor"),
("evolla", "EvollaProcessor"),
("flava", "FlavaProcessor"), ("flava", "FlavaProcessor"),
("fuyu", "FuyuProcessor"), ("fuyu", "FuyuProcessor"),
("gemma3", "Gemma3Processor"), ("gemma3", "Gemma3Processor"),

View File

@@ -0,0 +1,28 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure
if TYPE_CHECKING:
from .configuration_evolla import *
from .modeling_evolla import *
from .processing_evolla import *
else:
import sys
_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

View File

@@ -0,0 +1,279 @@
# coding=utf-8
# Copyright 2025 Westlake Representational Learning Lab (Fajie Yuan Lab) team and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Evolla model configuration"""
from ...configuration_utils import PretrainedConfig
from ...modeling_rope_utils import rope_config_validation
from ...utils import logging
logger = logging.get_logger(__name__)
class SaProtConfig(PretrainedConfig):
r"""This is the configuration class to store the configuration of a [`EvollaSaProtProteinEncoder`]. It is used to instantiate a
SaProt model according to the specified arguments, defining the model architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 446):
Vocabulary size of the protein sequence model. Defines the number of different tokens that can be represented
by the `inputs_ids` passed when calling [`EvollaModel`].
mask_token_id (`int`, *optional*, defaults to 4):
The id of the *mask* token in the protein sequence model.
pad_token_id (`int`, *optional*, defaults to 1):
The id of the *padding* token in the protein sequence model.
hidden_size (`int`, *optional*, defaults to 1280):
Dimensionality of the protein sequence model layers and the pooler layer.
num_hidden_layers (`int`, *optional*, defaults to 33):
Number of hidden layers in the protein sequence model.
num_attention_heads (`int`, *optional*, defaults to 20):
Number of attention heads for each attention layer in the protein sequence model.
intermediate_size (`int`, *optional*, defaults to 5120):
Dimensionality of the intermediate layers in the protein sequence model.
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the hidden layers in the protein sequence model.
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities in the protein sequence model.
max_position_embeddings (`int`, *optional*, defaults to 1026):
The maximum sequence length that the protein sequence model might ever be used with. Typically set this to
something large just in case (e.g., 512 or 1024 or 2048).
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon value for the layer normalization layer in the protein sequence model.
position_embedding_type (`str`, *optional*, defaults to `"rotary"`):
The type of position embedding to use in the protein sequence model. Currently only `"rotary"` is supported.
emb_layer_norm_before (`bool`, *optional*, defaults to `False`):
Whether to apply layer normalization before the position embedding in the protein sequence model.
token_dropout (`bool`, *optional*, defaults to `True`):
Whether to apply dropout to the tokens in the protein sequence model."""
def __init__(
self,
vocab_size=446,
mask_token_id=4,
pad_token_id=1,
hidden_size=1280,
num_hidden_layers=33,
num_attention_heads=20,
intermediate_size=5120,
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=1026,
initializer_range=0.02,
layer_norm_eps=1e-05,
position_embedding_type="rotary",
use_cache=True,
emb_layer_norm_before=False,
token_dropout=True,
**kwargs,
):
super().__init__(pad_token_id=pad_token_id, mask_token_id=mask_token_id, **kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.initializer_range = initializer_range
self.layer_norm_eps = layer_norm_eps
self.position_embedding_type = position_embedding_type
self.use_cache = use_cache
self.emb_layer_norm_before = emb_layer_norm_before
self.token_dropout = token_dropout
class EvollaConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`EvollaModel`]. It is used to instantiate an
Evolla model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Evolla-10B.
e.g. [westlake-repl/Evolla-10B-hf](https://huggingface.co/westlake-repl/Evolla-10B-hf)
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
protein_encoder_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`SaProtConfig`].
vocab_size (`int`, *optional*, defaults to 128256):
Vocabulary size of the Evolla llama model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`EvollaModel`].
hidden_size (`int`, *optional*, defaults to 4096):
Dimensionality of the llama layers and the pooler layer.
intermediate_size (`int`, *optional*, defaults to 14336):
Dimensionality of the intermediate layers in the llama model.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the llama model.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the llama model.
num_key_value_heads (`int`, *optional*, defaults to 8):
Number of key-value pairs for each attention layer in the llama model.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the llama model. If string, `"gelu"`, `"relu"`,
`"selu"` and `"silu"` are supported.
max_position_embeddings (`int`, *optional*, defaults to 8192):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
rms_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon value for the RMS-norm layer in the llama model.
rope_theta (`float`, *optional*, defaults to 500000.0):
The threshold value for the RoPE layer in the llama model.
rope_scaling (`float`, *optional*):
The scaling factor for the RoPE layer in the llama model.
attention_bias (`bool`, *optional*, defaults to `False`):
Whether to use bias in the attention layer.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention layer.
mlp_bias (`bool`, *optional*, defaults to `False`):
Whether to use bias in the MLP layer.
aligner_ffn_mult (`int`, *optional*, defaults to 4):
The FFN multiplier for the aligner layer.
aligner_enable_bias (`bool`, *optional*, defaults to `True`):
Whether to use bias in the aligner layer.
aligner_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
The dropout ratio for the attention probabilities in the aligner layer.
aligner_num_add_layers (`int`, *optional*, defaults to 8):
The number of additional layers for the aligner layer.
resampler_depth (`int`, *optional*, defaults to 6):
The depth of the resampler layer in the llama model.
resampler_dim_head (`int`, *optional*, defaults to 64):
The dimension of the heads in the resampler layer in the llama model.
resampler_heads (`int`, *optional*, defaults to 8):
The number of heads in the resampler layer in the llama model.
resampler_num_latents (`int`, *optional*, defaults to 64):
The number of latents in the resampler layer in the llama model.
resampler_ff_mult (`int`, *optional*, defaults to 4):
The FFN multiplier for the resampler layer.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
pad_token_id (`int`, *optional*):
The id of the *padding* token.
bos_token_id (`int`, *optional*, defaults to 128000):
The id of the *beginning-of-sequence* token.
eos_token_id (`int`, *optional*, defaults to 128009):
The id of the *end-of-sequence* token.
use_cache (`bool`, *optional*, defaults to `False`):
Whether or not the model should return the last key/values attentions (not used by all models).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether or not to tie the input and output word embeddings.
Example:
```python
>>> from transformers import EvollaModel, EvollaConfig
>>> # Initializing a Evolla evolla-10b style configuration
>>> configuration = EvollaConfig()
>>> # Initializing a model from the evolla-10b style configuration
>>> model = EvollaModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "EvollaModel"
sub_configs = {"protein_encoder_config": SaProtConfig}
def __init__(
self,
protein_encoder_config=None,
vocab_size=128256, # llama vocab size
hidden_size=4096, # llama hidden size
intermediate_size=14336, # llama intermediate size
num_hidden_layers=32, # llama num layers
num_attention_heads=32, # llama num heads
num_key_value_heads=8, # llama num key-value heads
hidden_act="silu", # llama activation function
max_position_embeddings=8192, # llama rope max length
rms_norm_eps=1e-05,
rope_theta=500000.0,
rope_scaling=None,
attention_bias=False,
attention_dropout=0.0,
mlp_bias=False,
aligner_ffn_mult=4,
aligner_enable_bias=True,
aligner_attention_probs_dropout_prob=0.1,
aligner_num_add_layers=8,
resampler_depth=6,
resampler_dim_head=64,
resampler_heads=8,
resampler_num_latents=64,
resampler_ff_mult=4,
initializer_range=0.02,
pad_token_id=None,
bos_token_id=128000,
eos_token_id=128009,
use_cache=False,
tie_word_embeddings=False,
**kwargs,
):
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.max_position_embeddings = max_position_embeddings
self.rms_norm_eps = rms_norm_eps
self.tie_word_embeddings = tie_word_embeddings
self.attention_bias = attention_bias
self.attention_dropout = attention_dropout
self.mlp_bias = mlp_bias
self.aligner_ffn_mult = aligner_ffn_mult
self.aligner_enable_bias = aligner_enable_bias
self.aligner_attention_probs_dropout_prob = aligner_attention_probs_dropout_prob
self.aligner_num_add_layers = aligner_num_add_layers
self.use_cache = use_cache
self.initializer_range = initializer_range
self.resampler_depth = resampler_depth
self.resampler_dim_head = resampler_dim_head
self.resampler_heads = resampler_heads
self.resampler_num_latents = resampler_num_latents
self.resampler_ff_mult = resampler_ff_mult
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
# Validate the correctness of rotary position embeddings parameters
# BC: if there is a 'type' field, copy it it to 'rope_type'.
if self.rope_scaling is not None and "type" in self.rope_scaling:
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
rope_config_validation(self)
# Subconfig
if protein_encoder_config is None:
protein_encoder_config = {}
logger.info("`protein_encoder_config` is `None`. Initializing the `SaProtConfig` with default values.")
self.protein_encoder_config = SaProtConfig(**protein_encoder_config)
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
__all__ = ["EvollaConfig"]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,247 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for EVOLLA.
"""
import os
from typing import Optional, Union
from ...feature_extraction_utils import BatchFeature
from ...processing_utils import (
ProcessorMixin,
)
from ..auto import AutoTokenizer
PROTEIN_VALID_KEYS = ["aa_seq", "foldseek", "msa"]
class EvollaProcessor(ProcessorMixin):
r"""
Constructs a EVOLLA processor which wraps a LLama tokenizer and SaProt tokenizer (EsmTokenizer) into a single processor.
[`EvollaProcessor`] offers all the functionalities of [`EsmTokenizer`] and [`LlamaTokenizerFast`]. See the
docstring of [`~EvollaProcessor.__call__`] and [`~EvollaProcessor.decode`] for more information.
Args:
protein_tokenizer (`EsmTokenizer`):
An instance of [`EsmTokenizer`]. The protein tokenizer is a required input.
tokenizer (`LlamaTokenizerFast`, *optional*):
An instance of [`LlamaTokenizerFast`]. The tokenizer is a required input.
protein_max_length (`int`, *optional*, defaults to 1024):
The maximum length of the sequence to be generated.
text_max_length (`int`, *optional*, defaults to 512):
The maximum length of the text to be generated.
"""
attributes = ["protein_tokenizer", "tokenizer"]
valid_kwargs = ["sequence_max_length"]
# protein_tokenizer_class = "EsmTokenizer"
# tokenizer_class = "LlamaTokenizerFast"
protein_tokenizer_class = "AutoTokenizer"
tokenizer_class = "AutoTokenizer"
protein_tokenizer_dir_name = "protein_tokenizer"
# tokenizer_dir_name = "text_tokenizer"
def __init__(self, protein_tokenizer, tokenizer=None, protein_max_length=1024, text_max_length=512, **kwargs):
if protein_tokenizer is None:
raise ValueError("You need to specify an `protein_tokenizer`.")
if tokenizer is None:
raise ValueError("You need to specify a `tokenizer`.")
super().__init__(protein_tokenizer, tokenizer)
self.tokenizer.pad_token = "<|reserved_special_token_0|>"
self.protein_max_length = protein_max_length
self.text_max_length = text_max_length
def process_proteins(self, proteins, protein_max_length=1024):
sa_sequences = []
for protein in proteins:
aa_seq = protein.get("aa_seq")
foldseek = protein.get("foldseek")
sa_sequence = "".join([s.upper() + f.lower() for s, f in zip(aa_seq, foldseek)])
sa_sequences.append(sa_sequence)
sa_tokens = self.protein_tokenizer.batch_encode_plus(
sa_sequences, return_tensors="pt", truncation=True, max_length=protein_max_length, padding=True
)
return sa_tokens
def process_text(
self,
texts,
text_max_length: int = 512,
):
prompts = []
for messages in texts:
prompt = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
prompts.append(prompt)
prompt_inputs = self.tokenizer(
prompts,
add_special_tokens=False,
return_tensors="pt",
padding="longest",
truncation=True,
max_length=text_max_length,
)
return prompt_inputs
def __call__(
self,
proteins: Optional[Union[list[dict], dict]] = None,
messages_list: Optional[Union[list[list[dict]], list[dict]]] = None,
protein_max_length: Optional[int] = None,
text_max_length: Optional[int] = None,
**kwargs,
):
r"""This method takes batched or non-batched proteins and messages_list and converts them into format that can be used by
the model.
Args:
proteins (`Union[List[dict], dict]`):
A list of dictionaries or a single dictionary containing the following keys:
- `"aa_seq"` (`str`) -- The amino acid sequence of the protein.
- `"foldseek"` (`str`) -- The foldseek string of the protein.
messages_list (`Union[List[List[dict]], List[dict]]`):
A list of lists of dictionaries or a list of dictionaries containing the following keys:
- `"role"` (`str`) -- The role of the message.
- `"content"` (`str`) -- The content of the message.
protein_max_length (`int`, *optional*, defaults to 1024):
The maximum length of the sequence to be generated.
text_max_length (`int`, *optional*, defaults to 512):
The maximum length of the text.
Return:
a dict with following keys:
- `protein_input_ids` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The input IDs for the protein sequence.
- `protein_attention_mask` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The attention mask for the protein sequence.
- `text_input_ids` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The input IDs for the text sequence.
- `text_attention_mask` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The attention mask for the text sequence.
"""
# proteins and messages_list should be provided
if proteins is None or messages_list is None:
raise ValueError("You need to specify `messages_list` and `proteins`.")
protein_max_length = protein_max_length if protein_max_length is not None else self.protein_max_length
text_max_length = text_max_length if text_max_length is not None else self.text_max_length
# proteins should be List[dict]
if isinstance(proteins, dict):
proteins = [proteins]
# messages_list should be List[List[dict]]
if isinstance(messages_list, (list, tuple)) and not isinstance(messages_list[0], (list, tuple)):
messages_list = [messages_list]
# Check if batched proteins are in the correct format
if isinstance(proteins, (list, tuple)) and not all(isinstance(p, dict) for p in proteins):
raise ValueError("The proteins should be a list of dictionaries, but not all elements are dictionaries.")
if isinstance(proteins, (list, tuple)) and not all(
all(k in PROTEIN_VALID_KEYS for k in p.keys()) for p in proteins
):
raise ValueError(
"There should be a list of dictionaries with keys: "
f"{', '.join(PROTEIN_VALID_KEYS)} for each protein."
f"But got: {proteins}"
)
# Check if batched messages_list is in the correct format
if isinstance(messages_list, (list, tuple)):
for messages in messages_list:
if not isinstance(messages, (list, tuple)):
raise ValueError(f"Each messages in messages_list should be a list instead of {type(messages)}.")
if not all(isinstance(m, dict) for m in messages):
raise ValueError(
"Each message in messages_list should be a list of dictionaries, but not all elements are dictionaries."
)
if any(len(m.keys()) != 2 for m in messages) or any(
set(m.keys()) != {"role", "content"} for m in messages
):
raise ValueError(
"Each message in messages_list should be a list of dictionaries with two keys: 'role' and 'content'."
f"But got: {messages}"
)
else:
raise ValueError(
f"The messages_list should be a list of lists of dictionaries, but it's {type(messages_list)}."
)
sa_tokens = self.process_proteins(proteins, protein_max_length)
text_tokens = self.process_text(messages_list, text_max_length)
return BatchFeature(
data={
"protein_input_ids": sa_tokens["input_ids"],
"protein_attention_mask": sa_tokens["attention_mask"],
"input_ids": text_tokens["input_ids"],
"attention_mask": text_tokens["attention_mask"],
}
)
def batch_decode(self, *args, **kwargs):
return self.tokenizer.batch_decode(*args, **kwargs)
def decode(self, *args, **kwargs):
return self.tokenizer.decode(*args, **kwargs)
def protein_batch_decode(self, *args, **kwargs):
return self.protein_tokenizer.batch_decode(*args, **kwargs)
def protein_decode(self, *args, **kwargs):
return self.protein_tokenizer.decode(*args, **kwargs)
# overwrite to save the protein tokenizer in a separate folder
# Adapted from instructblip.processing_instructblip.py (https://github.com/huggingface/transformers/blob/9b479a245b793cac2a8b2e87c6d8e81bb24e20c4/src/transformers/models/instructblip/processing_instructblip.py#L191-L221)
def save_pretrained(self, save_directory, **kwargs):
# only save the protein tokenizer in sub_dir
self.protein_tokenizer.save_pretrained(os.path.join(save_directory, self.protein_tokenizer_dir_name))
# we modify the attributes so that only the text tokenizer are saved in the main folder
protein_tokenizer_present = "protein_tokenizer" in self.attributes
# find the correct position of it in the attributes list
protein_tokenizer_index = self.attributes.index("protein_tokenizer") if protein_tokenizer_present else None
if protein_tokenizer_present and protein_tokenizer_index is not None:
self.attributes.remove("protein_tokenizer")
outputs = super().save_pretrained(save_directory, **kwargs)
if protein_tokenizer_present and protein_tokenizer_index is not None:
self.attributes.insert(protein_tokenizer_index, "protein_tokenizer")
return outputs
# overwirte to load the protein tokenizer from a separate folder
# Adapted from instructblip.processing_instructblip.py (https://github.com/huggingface/transformers/blob/9b479a245b793cac2a8b2e87c6d8e81bb24e20c4/src/transformers/models/instructblip/processing_instructblip.py#L191-L221)
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
processor = super().from_pretrained(pretrained_model_name_or_path, **kwargs)
# if return_unused_kwargs a tuple is returned where the second element is 'unused_kwargs'
if isinstance(processor, tuple):
processor = processor[0]
protein_tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path, subfolder=cls.protein_tokenizer_dir_name
)
processor.protein_tokenizer = protein_tokenizer
return processor
__all__ = ["EvollaProcessor"]

View File

View File

@@ -0,0 +1,397 @@
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Testing suite for the PyTorch Evolla model."""
import unittest
from parameterized import parameterized
from transformers import BitsAndBytesConfig, EvollaConfig, is_torch_available
from transformers.testing_utils import (
TestCasePlus,
require_bitsandbytes,
require_torch,
require_torch_sdpa,
slow,
torch_device,
)
from transformers.utils import (
cached_property,
)
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import (
TEST_EAGER_MATCHES_SDPA_INFERENCE_PARAMETERIZATION,
ModelTesterMixin,
_config_zero_init,
ids_tensor,
random_attention_mask,
)
from ...test_pipeline_mixin import PipelineTesterMixin
if is_torch_available():
import torch
from transformers import EvollaForProteinText2Text, EvollaModel, EvollaProcessor
class EvollaModelTester:
def __init__(
self,
parent,
batch_size=1,
is_training=False,
text_seq_length=20,
text_vocab_size=100,
protein_seq_length=10,
protein_vocab_size=20,
hidden_size=4, # llama hidden size
intermediate_size=7, # llama intermediate size
num_hidden_layers=1, # llama hidden layers
num_attention_heads=2, # llama attention heads
num_key_value_heads=2, # llama key value heads
protein_hidden_size=8, # protein encoder hidden size
protein_num_hidden_layers=1, # protein encoder hidden layers
protein_num_attention_heads=4, # protein encoder attention heads
protein_intermediate_size=11, # protein encoder intermediate size
resampler_num_latents=7, # sequence compressor num latents
resampler_ff_mult=1, # sequence compressor ff mult
resampler_depth=2, # sequence compressor depth
resampler_dim_head=4, # sequence compressor dim head
resampler_heads=2, # sequence compressor heads
aligner_num_add_layers=1, # sequence aligner num add layers
aligner_ffn_mult=1, # sequence aligner ffn mult
use_input_mask=True,
):
self.parent = parent
self.batch_size = batch_size
self.protein_seq_length = protein_seq_length
self.protein_vocab_size = protein_vocab_size
self.text_seq_length = text_seq_length
self.text_vocab_size = text_vocab_size
self.seq_length = text_seq_length
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.protein_hidden_size = protein_hidden_size
self.protein_num_hidden_layers = protein_num_hidden_layers
self.protein_num_attention_heads = protein_num_attention_heads
self.protein_intermediate_size = protein_intermediate_size
self.resampler_num_latents = resampler_num_latents
self.resampler_ff_mult = resampler_ff_mult
self.resampler_depth = resampler_depth
self.resampler_dim_head = resampler_dim_head
self.resampler_heads = resampler_heads
self.aligner_num_add_layers = aligner_num_add_layers
self.aligner_ffn_mult = aligner_ffn_mult
self.use_input_mask = use_input_mask
self.is_training = is_training
@property
def is_encoder_decoder(self):
return False
def prepare_config_and_inputs(self, num_proteins=None):
batch_size = num_proteins if num_proteins is not None else self.batch_size
text_input_ids = ids_tensor([batch_size, self.text_seq_length], self.text_vocab_size)
protein_input_ids = ids_tensor([batch_size, self.protein_seq_length], self.protein_vocab_size)
if self.use_input_mask:
text_input_mask = random_attention_mask([batch_size, self.text_seq_length])
protein_input_mask = random_attention_mask([batch_size, self.protein_seq_length])
config = self.get_config()
return (config, text_input_ids, text_input_mask, protein_input_ids, protein_input_mask)
def get_config(self):
return EvollaConfig(
protein_encoder_config={
"vocab_size": self.protein_vocab_size,
"hidden_size": self.protein_hidden_size,
"num_hidden_layers": self.protein_num_hidden_layers,
"num_attention_heads": self.protein_num_attention_heads,
"intermediate_size": self.protein_intermediate_size,
},
vocab_size=self.text_vocab_size,
hidden_size=self.hidden_size,
intermediate_size=self.intermediate_size,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
num_key_value_heads=self.num_key_value_heads,
aligner_ffn_mult=self.aligner_ffn_mult,
aligner_num_add_layers=self.aligner_num_add_layers,
resampler_depth=self.resampler_depth,
resampler_dim_head=self.resampler_dim_head,
resampler_heads=self.resampler_heads,
resampler_num_latents=self.resampler_num_latents,
resampler_ff_mult=self.resampler_ff_mult,
)
def create_and_check_model(
self,
config,
input_ids,
input_mask,
protein_input_ids,
protein_input_mask,
batch_size=None,
):
batch_size = batch_size if batch_size is not None else self.batch_size
model = EvollaModel(config=config)
model.to(torch_device)
model.eval()
result = model(
input_ids,
attention_mask=input_mask,
protein_input_ids=protein_input_ids,
protein_attention_mask=protein_input_mask,
)
self.parent.assertEqual(result.last_hidden_state.shape, (batch_size, input_ids.shape[1], self.hidden_size))
def create_and_check_model_gen(
self,
config,
input_ids,
input_mask,
protein_input_ids,
protein_input_mask,
):
model = EvollaForProteinText2Text(config)
model.to(torch_device)
model.eval()
model.generate(
input_ids,
attention_mask=input_mask,
protein_input_ids=protein_input_ids,
protein_attention_mask=protein_input_mask,
max_length=self.seq_length + 2,
)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(config, text_input_ids, text_input_mask, protein_input_ids, protein_input_mask) = config_and_inputs
inputs_dict = {
"input_ids": text_input_ids,
"attention_mask": text_input_mask,
"protein_input_ids": protein_input_ids,
"protein_attention_mask": protein_input_mask,
}
return config, inputs_dict
@require_torch
class EvollaModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
all_model_classes = (EvollaModel, EvollaForProteinText2Text) if is_torch_available() else ()
pipeline_model_mapping = {"feature-extraction": EvollaModel} if is_torch_available() else {}
test_pruning = False
test_headmasking = False
test_torchscript = False
test_resize_embeddings = False
maxDiff = None
def setUp(self):
self.model_tester = EvollaModelTester(self)
self.config_tester = ConfigTester(self, config_class=EvollaConfig, hidden_size=37)
@property
def is_encoder_decoder(self):
return self.model_tester.is_encoder_decoder
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
# XXX: EvollaForProteinText2Text has no MODEL_FOR group yet, but it should be the same
# as MODEL_FOR_CAUSAL_LM_MAPPING_NAMES, so for now manually changing to do the right thing
# as super won't do it
if return_labels:
inputs_dict["labels"] = torch.zeros(
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
)
return inputs_dict
def test_model_outputs_equivalence(self):
try:
orig = self.all_model_classes
# EvollaModel.forward doesn't have labels input arg - only EvollaForProteinText2Text does
self.all_model_classes = (EvollaForProteinText2Text,) if is_torch_available() else ()
super().test_model_outputs_equivalence()
finally:
self.all_model_classes = orig
def test_config(self):
self.config_tester.run_common_tests()
def test_model_single_protein(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=1)
self.model_tester.create_and_check_model(*config_and_inputs, batch_size=1)
def test_model_multiple_proteins(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=2)
self.model_tester.create_and_check_model(*config_and_inputs, batch_size=2)
def test_generate_single_protein(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=1)
self.model_tester.create_and_check_model_gen(*config_and_inputs)
def test_generate_multiple_proteins(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=2)
self.model_tester.create_and_check_model_gen(*config_and_inputs)
def test_saprot_output(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
protein_informations = {
"input_ids": inputs_dict["protein_input_ids"],
"attention_mask": inputs_dict["protein_attention_mask"],
}
for model_class in self.all_model_classes:
if model_class is not EvollaModel:
continue
model = model_class(config)
model.to(torch_device)
model.eval()
protein_encoder_outputs = model.protein_encoder.model(**protein_informations, return_dict=True)
print(model_class, protein_encoder_outputs)
def test_protein_encoder_output(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
protein_informations = {
"input_ids": inputs_dict["protein_input_ids"],
"attention_mask": inputs_dict["protein_attention_mask"],
}
for model_class in self.all_model_classes:
if model_class is not EvollaModel:
continue
model = model_class(config)
model.to(torch_device)
model.eval()
protein_encoder_outputs = model.protein_encoder(**protein_informations, return_dict=True)
print(model_class, protein_encoder_outputs)
def test_single_forward(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
print(outputs)
def test_initialization(self):
# we skip the latents initialization test
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
configs_no_init = _config_zero_init(config)
for model_class in self.all_model_classes:
model = model_class(config=configs_no_init)
for name, param in model.named_parameters():
if param.requires_grad:
# skip latents
if name.endswith("latents"):
print(f"Skipping latents {name}")
continue
self.assertIn(
((param.data.mean() * 1e9).round() / 1e9).item(),
[0.0, 1.0],
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
@parameterized.expand(TEST_EAGER_MATCHES_SDPA_INFERENCE_PARAMETERIZATION)
@require_torch_sdpa
@unittest.skip("Evolla requires both text and protein inputs which is currently not done in this test.")
def test_eager_matches_sdpa_inference(self):
pass
@unittest.skip("Evolla does not support eager attention implementation.")
def test_eager_padding_matches_padding_free_with_position_ids(self):
pass
@unittest.skip(
"Evolla has a separate test runner for generation tests with complex inheritance, causing this check to fail."
)
def test_generation_tester_mixin_inheritance(self):
pass
@unittest.skip("Evolla requires both text and protein inputs which is currently not done in this test.")
def test_flex_attention_with_grads(self):
pass
@require_torch
class EvollaModelIntegrationTest(TestCasePlus):
def _prepare_for_inputs(self):
aa_seq = "MLLEETLKSCPIVKRGKYHYFIHPISDGVPLVEPKLLREVATRIIKIGNFEGVNKIVTAEAMGIPLVTTLSLYTDIPYVIMRKREYKLPGEVPVFQSTGYSKGQLYLNGIEKGDKVIIIDDVISTGGTMIAIINALERAGAEIKDIICVIERGDGKKIVEEKTGYKIKTLVKIDVVDGEVVIL"
foldseek = "dvvvvqqqpfawdddppdtdgcgclapvpdpddpvvlvvllvlcvvpadpvqaqeeeeeddscpsnvvsncvvpvhyydywylddppdppkdwqwf######gitidpdqaaaheyeyeeaeqdqlrvvlsvvvrcvvrnyhhrayeyaeyhycnqvvccvvpvghyhynwywdqdpsgidtd"
question = "What is the function of this protein?"
protein_information = {
"aa_seq": aa_seq,
"foldseek": foldseek,
}
messages = [
{"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
{"role": "user", "content": question},
]
return protein_information, messages
@cached_property
def default_processor(self):
return EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-hf", revision="refs/pr/11")
@require_bitsandbytes
@slow
def test_inference_natural_language_protein_reasoning(self):
protein_information, messages = self._prepare_for_inputs()
processor = self.default_processor
inputs = processor(
messages_list=[messages], proteins=[protein_information], return_tensors="pt", padding="longest"
).to(torch_device)
# the CI gpu is small so using quantization to fit
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
)
model = EvollaForProteinText2Text.from_pretrained(
"westlake-repl/Evolla-10B-hf",
quantization_config=quantization_config,
device_map="auto",
)
generated_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
# keep for debugging
for i, t in enumerate(generated_text):
t = bytes(t, "utf-8").decode("unicode_escape")
print(f"{i}:\n{t}\n")
self.assertIn("This protein", generated_text[0])
self.assertIn("purine", generated_text[0])

View File

@@ -0,0 +1,295 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import shutil
import tempfile
import unittest
from transformers import (
AutoProcessor,
EvollaProcessor,
)
from transformers.testing_utils import require_torch
from transformers.utils import is_torch_available
from ...test_processing_common import ProcessorTesterMixin
if is_torch_available():
import torch
EVOLLA_VALID_AA = list("ACDEFGHIKLMNPQRSTVWY#")
EVOLLA_VALID_FS = list("pynwrqhgdlvtmfsaeikc#")
@require_torch
class EvollaProcessorTest(ProcessorTesterMixin, unittest.TestCase):
processor_class = EvollaProcessor
def setUp(self):
self.tmpdirname = tempfile.mkdtemp()
processor = EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-hf")
processor.save_pretrained(self.tmpdirname)
self.input_keys = ["protein_input_ids", "protein_attention_mask", "input_ids", "attention_mask"]
def prepare_input_and_expected_output(self):
amino_acid_sequence = "AAAA"
foldseek_sequence = "dddd"
question = "What is the function of this protein?"
expected_output = {
"protein_input_ids": torch.tensor([[0, 13, 13, 13, 13, 2]]),
"protein_attention_mask": torch.tensor([[1, 1, 1, 1, 1, 1]]),
"input_ids": torch.tensor(
[
[
128000,
128006,
9125,
128007,
271,
2675,
527,
459,
15592,
6335,
430,
649,
4320,
904,
4860,
922,
13128,
13,
128009,
128006,
882,
128007,
271,
3923,
374,
279,
734,
315,
420,
13128,
30,
128009,
128006,
78191,
128007,
271,
]
]
),
"attention_mask": torch.tensor(
[
[
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
]
]
),
}
protein_dict = {"aa_seq": amino_acid_sequence, "foldseek": foldseek_sequence}
message = [
{"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
{"role": "user", "content": question},
]
return protein_dict, message, expected_output
def test_processor(self):
protein_tokenizer = self.get_protein_tokenizer()
tokenizer = self.get_tokenizer()
processor = EvollaProcessor(protein_tokenizer, tokenizer)
protein_dict, message, expected_output = self.prepare_input_and_expected_output()
inputs = processor(proteins=[protein_dict], messages_list=[message])
# check if the input is correct
for key, value in expected_output.items():
self.assertTrue(
torch.equal(inputs[key], value),
f"inputs[key] is {inputs[key]} and expected_output[key] is {expected_output[key]}",
)
def get_tokenizer(self, **kwargs):
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
def get_protein_tokenizer(self, **kwargs):
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).protein_tokenizer
def tearDown(self):
shutil.rmtree(self.tmpdirname)
def prepare_inputs_single(self):
proteins = {
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
}
return proteins
def prepare_inputs_pair(self):
proteins = [
{
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
},
{
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
},
]
return proteins
def prepare_inputs_long(self):
proteins = [
{
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
},
{
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=2000)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=2000)),
},
]
return proteins
def prepare_inputs_short(self):
proteins = [
{
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=1)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=1)),
},
{
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
},
]
return proteins
def prepare_inputs_empty(self):
proteins = [
{
"aa_seq": "",
"foldseek": "",
},
{
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
},
]
return proteins
def prepare_inputs(self, protein_types="pair"):
r"""
Prepare inputs for the test.
Args:
protein_types (`str`): the types of proteins to prepare.
- "single": a single correct protein.
- "pair": a pair of correct proteins.
- "long": a long sequence of correct proteins and a correct protein.
- "short": a short sequence of correct proteins (only have 1 aa) and a correct protein.
- "empty": an empty sequence of proteins and a correct protein.
"""
if protein_types == "single":
proteins = self.prepare_inputs_single()
elif protein_types == "pair":
proteins = self.prepare_inputs_pair()
elif protein_types == "long":
proteins = self.prepare_inputs_long()
elif protein_types == "short":
proteins = self.prepare_inputs_short()
elif protein_types == "empty":
proteins = self.prepare_inputs_empty()
else:
raise ValueError(
f"protein_types should be one of 'single', 'pair', 'long','short', 'empty', but got {protein_types}"
)
questions = ["What is the function of the protein?"] * len(proteins)
messages_list = []
for question in questions:
messages = [
{"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
{"role": "user", "content": question},
]
messages_list.append(messages)
return proteins, messages_list
def test_tokenizer_decode(self):
protein_tokenizer = self.get_protein_tokenizer()
tokenizer = self.get_tokenizer()
processor = EvollaProcessor(tokenizer=tokenizer, protein_tokenizer=protein_tokenizer, return_tensors="pt")
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
decoded_processor = processor.batch_decode(predicted_ids)
decoded_tok = tokenizer.batch_decode(predicted_ids)
self.assertListEqual(decoded_tok, decoded_processor)
def test_model_input_names(self):
protein_tokenizer = self.get_protein_tokenizer()
tokenizer = self.get_tokenizer()
processor = EvollaProcessor(tokenizer=tokenizer, protein_tokenizer=protein_tokenizer)
proteins, messages_list = self.prepare_inputs()
inputs = processor(messages_list=messages_list, proteins=proteins, padding="longest", return_tensors="pt")
# For now the processor supports only ['pixel_values', 'input_ids', 'attention_mask']
self.assertSetEqual(set(inputs.keys()), set(self.input_keys))

View File

@@ -92,6 +92,7 @@ PRIVATE_MODELS = [
"Phi4MultimodalAudioModel", "Phi4MultimodalAudioModel",
"Phi4MultimodalVisionModel", "Phi4MultimodalVisionModel",
"Glm4vVisionModel", "Glm4vVisionModel",
"EvollaSaProtPreTrainedModel",
] ]
# Update this list for models that are not tested with a comment explaining the reason it should not be. # Update this list for models that are not tested with a comment explaining the reason it should not be.