Add evolla rebase main (#36232)
* add evolla * adding protein encoder part * add initial processing test * save processor * add docstring * add evolla processor * add two test * change vision to protein * change resampler to sequence_compressor * change vision to protein * initial update for llama * add initial update for llamaForCausalLM * add `test_processor`, `test_saprot_output`, `test_protein_encoder_output` * change evolla, but still working on it * add test_single_forward * pass test_attention_outputs * pass test_hidden_states_output * pass test_save_load and test_from_pretrained_no_checkpoint * pass test_cpu_offload * skip some tests * update new progress * skip test_model_is_small * pass test_model_weights_reload_no_missing_tied_weights * pass test_model_get_set_embeddings * pass test_cpu_offload * skip test_resize_embeddings * add pipeline_model_mapping * remote old setUp * pass processor save_pretrained and load_pretrained * remove pooling layer * pass test_inputs_embeds_matches_input_ids * pass test_model_is_small * pass test_attention_outputs * pass test_initialization * pass test_model_get_set_embeddings * pass test_single_forward * skip test_disk_offload_bin and test_disk_offload_safetensors * fix most tests * pass test_protein_encoder_output * remove useless code * add EvollaForProteinText2Text * pass test_saprot_output * pass all EvollaModelTest test and remove processor test * add processor test to its own file * skip is_training since esm skipped it and the saprot code causes error when setting is_training True * pass processor tests * solve all except config * pass most cases * change init * add doc to `configuration_evolla.py` * remove image_processing test * remove extra processor test * remove extra modules * remove extra modules * change all configs into one config * pass all evolla test * pass `make fixup` * update short summary * update Evolla-10B-hf * pass check_dummies.py and check_code_quality * fix `tests/models/auto/test_tokenization_auto.py::AutoTokenizerTest::test_model_name_edge_cases_in_mappings` * remove dummy codes * change format * fix llava issue * update format * update to solve llama3 access issue * update to make forward right * solve processor save load problem from instructblip solution * remove unexpected file * skip `test_generation_tester_mixin_inheritance` * add `test_single_forward_correct` and `test_inference_natural_language_protein_reasoning` * add `modular_evolla.py` * solved issue #36362 * run `make fixup` * update modular * solve float32 training * add fix * solve `utils/check_docstrings.py` * update * update * update * remove other files and replace sequential and einsum * add use case in document * update the models * update model * change some wrong code * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * Update src/transformers/models/evolla/modular_evolla.py Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com> * fix issues mentioned in PR * update style and rearrange the placement * fix return_dict argument issue * solve SaProtConfig issue * Solve EvollaSaProtRotaryEmbedding issue * solve attention_mask issue * solve almosst all issues * make style * update config * remove unrelated pickle file * delete pickle files * fix config * simplify a lot * remove past k-v from encoder * continue work * style * skip it from init * fix init * fix init * simplify more * fill in docstrings * change test for generation * skip test * fix style --------- Co-authored-by: Chenchen Han <13980209828@163.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co> Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
This commit is contained in:
@@ -975,6 +975,8 @@
|
|||||||
title: Donut
|
title: Donut
|
||||||
- local: model_doc/emu3
|
- local: model_doc/emu3
|
||||||
title: Emu3
|
title: Emu3
|
||||||
|
- local: model_doc/evolla
|
||||||
|
title: Evolla
|
||||||
- local: model_doc/flava
|
- local: model_doc/flava
|
||||||
title: FLAVA
|
title: FLAVA
|
||||||
- local: model_doc/gemma3
|
- local: model_doc/gemma3
|
||||||
|
|||||||
95
docs/source/en/model_doc/evolla.md
Normal file
95
docs/source/en/model_doc/evolla.md
Normal file
@@ -0,0 +1,95 @@
|
|||||||
|
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
|
||||||
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||||
|
rendered properly in your Markdown viewer.
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Evolla
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Evolla model was proposed in [Decoding the Molecular Language of Proteins with Evolla](https://doi.org/10.1101/2025.01.05.630192) by [Zhou et al.](https://doi.org/10.1101/2025.01.05.630192).
|
||||||
|
|
||||||
|
Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Proteins, nature’s intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and structures encode and determine biological functions - remains a corner-stone challenge in modern biology. Here, we introduce Evolla, an 80 billion frontier protein-language generative model designed to decode the molecular language of proteins. By integrating information from protein sequences, structures, and user queries, Evolla generates precise and contextually nuanced insights into protein function. A key innovation of Evolla lies in its training on an unprecedented AI-generated dataset: 546 million protein question-answer pairs and 150 billion word tokens, designed to reflect the immense complexity and functional diversity of proteins. Post-pretraining, Evolla integrates Direct Preference Optimization (DPO) to refine the model based on preference signals and Retrieval-Augmented Generation (RAG) for external knowledge incorporation, improving response quality and relevance. To evaluate its performance, we propose a novel framework, Instructional Response Space (IRS), demonstrating that Evolla delivers expert-level insights, advancing research in proteomics and functional genomics while shedding light on the molecular logic encoded in proteins. The online demo is available at http://www.chat-protein.com/.*
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
processor = EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-DPO-hf")
|
||||||
|
model = EvollaForProteinText2Text.from_pretrained("westlake-repl/Evolla-10B-DPO-hf")
|
||||||
|
# aa_seq should have same length as foldseek
|
||||||
|
protein_inputs = [
|
||||||
|
{
|
||||||
|
|
||||||
|
"aa_seq": "MATGGRRG...",
|
||||||
|
"foldseek": "###lqpfd...", # hashtag means the low-confidence foldseek tokens
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"aa_seq": "MLPGLALL...",
|
||||||
|
"foldseek": "dfwwkwad...",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
message_list = [
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "You are an AI expert that can answer any questions about protein.",
|
||||||
|
},
|
||||||
|
{"role": "user", "content": "What is the function of this protein?"},
|
||||||
|
],
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "You are an AI expert that can answer any questions about protein.",
|
||||||
|
},
|
||||||
|
{"role": "user", "content": "What is the function of this protein?"},
|
||||||
|
]
|
||||||
|
]
|
||||||
|
input_dict = processor(
|
||||||
|
protein_informations, messages_list, return_tensors="pt", text_max_length=512, protein_max_length=1024
|
||||||
|
)
|
||||||
|
with torch.no_grad():
|
||||||
|
generated_ids = hf_model.generate(**input_dict)
|
||||||
|
generated_texts = processor.batch_decode(
|
||||||
|
generated_ids, skip_special_tokens=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- This model was contributed by [Xibin Bayes Zhou](https://huggingface.co/XibinBayesZhou).
|
||||||
|
- The original code can be found [here](https://github.com/westlake-repl/Evolla).
|
||||||
|
|
||||||
|
|
||||||
|
## EvollaConfig
|
||||||
|
|
||||||
|
[[autodoc]] EvollaConfig
|
||||||
|
|
||||||
|
## EvollaModel
|
||||||
|
|
||||||
|
[[autodoc]] EvollaModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## EvollaForProteinText2Text
|
||||||
|
|
||||||
|
[[autodoc]] EvollaForProteinText2Text
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## EvollaProcessor
|
||||||
|
|
||||||
|
[[autodoc]] EvollaProcessor
|
||||||
|
- __call__
|
||||||
@@ -110,6 +110,7 @@ if TYPE_CHECKING:
|
|||||||
from .encoder_decoder import *
|
from .encoder_decoder import *
|
||||||
from .ernie import *
|
from .ernie import *
|
||||||
from .esm import *
|
from .esm import *
|
||||||
|
from .evolla import *
|
||||||
from .falcon import *
|
from .falcon import *
|
||||||
from .falcon_h1 import *
|
from .falcon_h1 import *
|
||||||
from .falcon_mamba import *
|
from .falcon_mamba import *
|
||||||
|
|||||||
@@ -133,6 +133,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str](
|
|||||||
("ernie4_5_moe", "Ernie4_5_MoeConfig"),
|
("ernie4_5_moe", "Ernie4_5_MoeConfig"),
|
||||||
("ernie_m", "ErnieMConfig"),
|
("ernie_m", "ErnieMConfig"),
|
||||||
("esm", "EsmConfig"),
|
("esm", "EsmConfig"),
|
||||||
|
("evolla", "EvollaConfig"),
|
||||||
("falcon", "FalconConfig"),
|
("falcon", "FalconConfig"),
|
||||||
("falcon_h1", "FalconH1Config"),
|
("falcon_h1", "FalconH1Config"),
|
||||||
("falcon_mamba", "FalconMambaConfig"),
|
("falcon_mamba", "FalconMambaConfig"),
|
||||||
@@ -528,6 +529,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str](
|
|||||||
("ernie4_5_moe", "Ernie4_5_MoE"),
|
("ernie4_5_moe", "Ernie4_5_MoE"),
|
||||||
("ernie_m", "ErnieM"),
|
("ernie_m", "ErnieM"),
|
||||||
("esm", "ESM"),
|
("esm", "ESM"),
|
||||||
|
("evolla", "Evolla"),
|
||||||
("falcon", "Falcon"),
|
("falcon", "Falcon"),
|
||||||
("falcon3", "Falcon3"),
|
("falcon3", "Falcon3"),
|
||||||
("falcon_h1", "FalconH1"),
|
("falcon_h1", "FalconH1"),
|
||||||
|
|||||||
@@ -124,6 +124,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("ernie4_5_moe", "Ernie4_5_MoeModel"),
|
("ernie4_5_moe", "Ernie4_5_MoeModel"),
|
||||||
("ernie_m", "ErnieMModel"),
|
("ernie_m", "ErnieMModel"),
|
||||||
("esm", "EsmModel"),
|
("esm", "EsmModel"),
|
||||||
|
("evolla", "EvollaModel"),
|
||||||
("falcon", "FalconModel"),
|
("falcon", "FalconModel"),
|
||||||
("falcon_h1", "FalconH1Model"),
|
("falcon_h1", "FalconH1Model"),
|
||||||
("falcon_mamba", "FalconMambaModel"),
|
("falcon_mamba", "FalconMambaModel"),
|
||||||
@@ -402,6 +403,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
|||||||
("distilbert", "DistilBertForMaskedLM"),
|
("distilbert", "DistilBertForMaskedLM"),
|
||||||
("electra", "ElectraForPreTraining"),
|
("electra", "ElectraForPreTraining"),
|
||||||
("ernie", "ErnieForPreTraining"),
|
("ernie", "ErnieForPreTraining"),
|
||||||
|
("evolla", "EvollaForProteinText2Text"),
|
||||||
("falcon_mamba", "FalconMambaForCausalLM"),
|
("falcon_mamba", "FalconMambaForCausalLM"),
|
||||||
("flaubert", "FlaubertWithLMHeadModel"),
|
("flaubert", "FlaubertWithLMHeadModel"),
|
||||||
("flava", "FlavaForPreTraining"),
|
("flava", "FlavaForPreTraining"),
|
||||||
@@ -934,6 +936,7 @@ MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(
|
|||||||
("blip-2", "Blip2ForConditionalGeneration"),
|
("blip-2", "Blip2ForConditionalGeneration"),
|
||||||
("chameleon", "ChameleonForConditionalGeneration"),
|
("chameleon", "ChameleonForConditionalGeneration"),
|
||||||
("emu3", "Emu3ForConditionalGeneration"),
|
("emu3", "Emu3ForConditionalGeneration"),
|
||||||
|
("evolla", "EvollaForProteinText2Text"),
|
||||||
("fuyu", "FuyuForCausalLM"),
|
("fuyu", "FuyuForCausalLM"),
|
||||||
("gemma3", "Gemma3ForConditionalGeneration"),
|
("gemma3", "Gemma3ForConditionalGeneration"),
|
||||||
("gemma3n", "Gemma3nForConditionalGeneration"),
|
("gemma3n", "Gemma3nForConditionalGeneration"),
|
||||||
|
|||||||
@@ -64,6 +64,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
|||||||
("colqwen2", "ColQwen2Processor"),
|
("colqwen2", "ColQwen2Processor"),
|
||||||
("dia", "DiaProcessor"),
|
("dia", "DiaProcessor"),
|
||||||
("emu3", "Emu3Processor"),
|
("emu3", "Emu3Processor"),
|
||||||
|
("evolla", "EvollaProcessor"),
|
||||||
("flava", "FlavaProcessor"),
|
("flava", "FlavaProcessor"),
|
||||||
("fuyu", "FuyuProcessor"),
|
("fuyu", "FuyuProcessor"),
|
||||||
("gemma3", "Gemma3Processor"),
|
("gemma3", "Gemma3Processor"),
|
||||||
|
|||||||
28
src/transformers/models/evolla/__init__.py
Normal file
28
src/transformers/models/evolla/__init__.py
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import _LazyModule
|
||||||
|
from ...utils.import_utils import define_import_structure
|
||||||
|
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_evolla import *
|
||||||
|
from .modeling_evolla import *
|
||||||
|
from .processing_evolla import *
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
_file = globals()["__file__"]
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
|
||||||
279
src/transformers/models/evolla/configuration_evolla.py
Normal file
279
src/transformers/models/evolla/configuration_evolla.py
Normal file
@@ -0,0 +1,279 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2025 Westlake Representational Learning Lab (Fajie Yuan Lab) team and the HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Evolla model configuration"""
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...modeling_rope_utils import rope_config_validation
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class SaProtConfig(PretrainedConfig):
|
||||||
|
r"""This is the configuration class to store the configuration of a [`EvollaSaProtProteinEncoder`]. It is used to instantiate a
|
||||||
|
SaProt model according to the specified arguments, defining the model architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 446):
|
||||||
|
Vocabulary size of the protein sequence model. Defines the number of different tokens that can be represented
|
||||||
|
by the `inputs_ids` passed when calling [`EvollaModel`].
|
||||||
|
mask_token_id (`int`, *optional*, defaults to 4):
|
||||||
|
The id of the *mask* token in the protein sequence model.
|
||||||
|
pad_token_id (`int`, *optional*, defaults to 1):
|
||||||
|
The id of the *padding* token in the protein sequence model.
|
||||||
|
hidden_size (`int`, *optional*, defaults to 1280):
|
||||||
|
Dimensionality of the protein sequence model layers and the pooler layer.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 33):
|
||||||
|
Number of hidden layers in the protein sequence model.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 20):
|
||||||
|
Number of attention heads for each attention layer in the protein sequence model.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 5120):
|
||||||
|
Dimensionality of the intermediate layers in the protein sequence model.
|
||||||
|
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the hidden layers in the protein sequence model.
|
||||||
|
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities in the protein sequence model.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 1026):
|
||||||
|
The maximum sequence length that the protein sequence model might ever be used with. Typically set this to
|
||||||
|
something large just in case (e.g., 512 or 1024 or 2048).
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-05):
|
||||||
|
The epsilon value for the layer normalization layer in the protein sequence model.
|
||||||
|
position_embedding_type (`str`, *optional*, defaults to `"rotary"`):
|
||||||
|
The type of position embedding to use in the protein sequence model. Currently only `"rotary"` is supported.
|
||||||
|
emb_layer_norm_before (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to apply layer normalization before the position embedding in the protein sequence model.
|
||||||
|
token_dropout (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to apply dropout to the tokens in the protein sequence model."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=446,
|
||||||
|
mask_token_id=4,
|
||||||
|
pad_token_id=1,
|
||||||
|
hidden_size=1280,
|
||||||
|
num_hidden_layers=33,
|
||||||
|
num_attention_heads=20,
|
||||||
|
intermediate_size=5120,
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
max_position_embeddings=1026,
|
||||||
|
initializer_range=0.02,
|
||||||
|
layer_norm_eps=1e-05,
|
||||||
|
position_embedding_type="rotary",
|
||||||
|
use_cache=True,
|
||||||
|
emb_layer_norm_before=False,
|
||||||
|
token_dropout=True,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(pad_token_id=pad_token_id, mask_token_id=mask_token_id, **kwargs)
|
||||||
|
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.position_embedding_type = position_embedding_type
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.emb_layer_norm_before = emb_layer_norm_before
|
||||||
|
self.token_dropout = token_dropout
|
||||||
|
|
||||||
|
|
||||||
|
class EvollaConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`EvollaModel`]. It is used to instantiate an
|
||||||
|
Evolla model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||||
|
with the defaults will yield a similar configuration to that of the Evolla-10B.
|
||||||
|
|
||||||
|
e.g. [westlake-repl/Evolla-10B-hf](https://huggingface.co/westlake-repl/Evolla-10B-hf)
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
protein_encoder_config (`dict`, *optional*):
|
||||||
|
Dictionary of configuration options used to initialize [`SaProtConfig`].
|
||||||
|
vocab_size (`int`, *optional*, defaults to 128256):
|
||||||
|
Vocabulary size of the Evolla llama model. Defines the number of different tokens that can be represented by the
|
||||||
|
`inputs_ids` passed when calling [`EvollaModel`].
|
||||||
|
hidden_size (`int`, *optional*, defaults to 4096):
|
||||||
|
Dimensionality of the llama layers and the pooler layer.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 14336):
|
||||||
|
Dimensionality of the intermediate layers in the llama model.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 32):
|
||||||
|
Number of hidden layers in the llama model.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 32):
|
||||||
|
Number of attention heads for each attention layer in the llama model.
|
||||||
|
num_key_value_heads (`int`, *optional*, defaults to 8):
|
||||||
|
Number of key-value pairs for each attention layer in the llama model.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
|
||||||
|
The non-linear activation function (function or string) in the llama model. If string, `"gelu"`, `"relu"`,
|
||||||
|
`"selu"` and `"silu"` are supported.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 8192):
|
||||||
|
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||||
|
just in case (e.g., 512 or 1024 or 2048).
|
||||||
|
rms_norm_eps (`float`, *optional*, defaults to 1e-05):
|
||||||
|
The epsilon value for the RMS-norm layer in the llama model.
|
||||||
|
rope_theta (`float`, *optional*, defaults to 500000.0):
|
||||||
|
The threshold value for the RoPE layer in the llama model.
|
||||||
|
rope_scaling (`float`, *optional*):
|
||||||
|
The scaling factor for the RoPE layer in the llama model.
|
||||||
|
attention_bias (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to use bias in the attention layer.
|
||||||
|
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention layer.
|
||||||
|
mlp_bias (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to use bias in the MLP layer.
|
||||||
|
aligner_ffn_mult (`int`, *optional*, defaults to 4):
|
||||||
|
The FFN multiplier for the aligner layer.
|
||||||
|
aligner_enable_bias (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to use bias in the aligner layer.
|
||||||
|
aligner_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities in the aligner layer.
|
||||||
|
aligner_num_add_layers (`int`, *optional*, defaults to 8):
|
||||||
|
The number of additional layers for the aligner layer.
|
||||||
|
resampler_depth (`int`, *optional*, defaults to 6):
|
||||||
|
The depth of the resampler layer in the llama model.
|
||||||
|
resampler_dim_head (`int`, *optional*, defaults to 64):
|
||||||
|
The dimension of the heads in the resampler layer in the llama model.
|
||||||
|
resampler_heads (`int`, *optional*, defaults to 8):
|
||||||
|
The number of heads in the resampler layer in the llama model.
|
||||||
|
resampler_num_latents (`int`, *optional*, defaults to 64):
|
||||||
|
The number of latents in the resampler layer in the llama model.
|
||||||
|
resampler_ff_mult (`int`, *optional*, defaults to 4):
|
||||||
|
The FFN multiplier for the resampler layer.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
pad_token_id (`int`, *optional*):
|
||||||
|
The id of the *padding* token.
|
||||||
|
bos_token_id (`int`, *optional*, defaults to 128000):
|
||||||
|
The id of the *beginning-of-sequence* token.
|
||||||
|
eos_token_id (`int`, *optional*, defaults to 128009):
|
||||||
|
The id of the *end-of-sequence* token.
|
||||||
|
use_cache (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||||
|
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not to tie the input and output word embeddings.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import EvollaModel, EvollaConfig
|
||||||
|
|
||||||
|
>>> # Initializing a Evolla evolla-10b style configuration
|
||||||
|
>>> configuration = EvollaConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the evolla-10b style configuration
|
||||||
|
>>> model = EvollaModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "EvollaModel"
|
||||||
|
sub_configs = {"protein_encoder_config": SaProtConfig}
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
protein_encoder_config=None,
|
||||||
|
vocab_size=128256, # llama vocab size
|
||||||
|
hidden_size=4096, # llama hidden size
|
||||||
|
intermediate_size=14336, # llama intermediate size
|
||||||
|
num_hidden_layers=32, # llama num layers
|
||||||
|
num_attention_heads=32, # llama num heads
|
||||||
|
num_key_value_heads=8, # llama num key-value heads
|
||||||
|
hidden_act="silu", # llama activation function
|
||||||
|
max_position_embeddings=8192, # llama rope max length
|
||||||
|
rms_norm_eps=1e-05,
|
||||||
|
rope_theta=500000.0,
|
||||||
|
rope_scaling=None,
|
||||||
|
attention_bias=False,
|
||||||
|
attention_dropout=0.0,
|
||||||
|
mlp_bias=False,
|
||||||
|
aligner_ffn_mult=4,
|
||||||
|
aligner_enable_bias=True,
|
||||||
|
aligner_attention_probs_dropout_prob=0.1,
|
||||||
|
aligner_num_add_layers=8,
|
||||||
|
resampler_depth=6,
|
||||||
|
resampler_dim_head=64,
|
||||||
|
resampler_heads=8,
|
||||||
|
resampler_num_latents=64,
|
||||||
|
resampler_ff_mult=4,
|
||||||
|
initializer_range=0.02,
|
||||||
|
pad_token_id=None,
|
||||||
|
bos_token_id=128000,
|
||||||
|
eos_token_id=128009,
|
||||||
|
use_cache=False,
|
||||||
|
tie_word_embeddings=False,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.num_key_value_heads = num_key_value_heads
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.rms_norm_eps = rms_norm_eps
|
||||||
|
self.tie_word_embeddings = tie_word_embeddings
|
||||||
|
self.attention_bias = attention_bias
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.mlp_bias = mlp_bias
|
||||||
|
self.aligner_ffn_mult = aligner_ffn_mult
|
||||||
|
self.aligner_enable_bias = aligner_enable_bias
|
||||||
|
self.aligner_attention_probs_dropout_prob = aligner_attention_probs_dropout_prob
|
||||||
|
self.aligner_num_add_layers = aligner_num_add_layers
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
|
||||||
|
self.resampler_depth = resampler_depth
|
||||||
|
self.resampler_dim_head = resampler_dim_head
|
||||||
|
self.resampler_heads = resampler_heads
|
||||||
|
self.resampler_num_latents = resampler_num_latents
|
||||||
|
self.resampler_ff_mult = resampler_ff_mult
|
||||||
|
|
||||||
|
self.rope_theta = rope_theta
|
||||||
|
self.rope_scaling = rope_scaling
|
||||||
|
# Validate the correctness of rotary position embeddings parameters
|
||||||
|
# BC: if there is a 'type' field, copy it it to 'rope_type'.
|
||||||
|
if self.rope_scaling is not None and "type" in self.rope_scaling:
|
||||||
|
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
|
||||||
|
rope_config_validation(self)
|
||||||
|
|
||||||
|
# Subconfig
|
||||||
|
if protein_encoder_config is None:
|
||||||
|
protein_encoder_config = {}
|
||||||
|
logger.info("`protein_encoder_config` is `None`. Initializing the `SaProtConfig` with default values.")
|
||||||
|
self.protein_encoder_config = SaProtConfig(**protein_encoder_config)
|
||||||
|
|
||||||
|
super().__init__(
|
||||||
|
pad_token_id=pad_token_id,
|
||||||
|
bos_token_id=bos_token_id,
|
||||||
|
eos_token_id=eos_token_id,
|
||||||
|
tie_word_embeddings=tie_word_embeddings,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["EvollaConfig"]
|
||||||
1761
src/transformers/models/evolla/modeling_evolla.py
Normal file
1761
src/transformers/models/evolla/modeling_evolla.py
Normal file
File diff suppressed because it is too large
Load Diff
1008
src/transformers/models/evolla/modular_evolla.py
Normal file
1008
src/transformers/models/evolla/modular_evolla.py
Normal file
File diff suppressed because it is too large
Load Diff
247
src/transformers/models/evolla/processing_evolla.py
Normal file
247
src/transformers/models/evolla/processing_evolla.py
Normal file
@@ -0,0 +1,247 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2025 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""
|
||||||
|
Processor class for EVOLLA.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import Optional, Union
|
||||||
|
|
||||||
|
from ...feature_extraction_utils import BatchFeature
|
||||||
|
from ...processing_utils import (
|
||||||
|
ProcessorMixin,
|
||||||
|
)
|
||||||
|
from ..auto import AutoTokenizer
|
||||||
|
|
||||||
|
|
||||||
|
PROTEIN_VALID_KEYS = ["aa_seq", "foldseek", "msa"]
|
||||||
|
|
||||||
|
|
||||||
|
class EvollaProcessor(ProcessorMixin):
|
||||||
|
r"""
|
||||||
|
Constructs a EVOLLA processor which wraps a LLama tokenizer and SaProt tokenizer (EsmTokenizer) into a single processor.
|
||||||
|
|
||||||
|
[`EvollaProcessor`] offers all the functionalities of [`EsmTokenizer`] and [`LlamaTokenizerFast`]. See the
|
||||||
|
docstring of [`~EvollaProcessor.__call__`] and [`~EvollaProcessor.decode`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
protein_tokenizer (`EsmTokenizer`):
|
||||||
|
An instance of [`EsmTokenizer`]. The protein tokenizer is a required input.
|
||||||
|
tokenizer (`LlamaTokenizerFast`, *optional*):
|
||||||
|
An instance of [`LlamaTokenizerFast`]. The tokenizer is a required input.
|
||||||
|
protein_max_length (`int`, *optional*, defaults to 1024):
|
||||||
|
The maximum length of the sequence to be generated.
|
||||||
|
text_max_length (`int`, *optional*, defaults to 512):
|
||||||
|
The maximum length of the text to be generated.
|
||||||
|
"""
|
||||||
|
|
||||||
|
attributes = ["protein_tokenizer", "tokenizer"]
|
||||||
|
valid_kwargs = ["sequence_max_length"]
|
||||||
|
# protein_tokenizer_class = "EsmTokenizer"
|
||||||
|
# tokenizer_class = "LlamaTokenizerFast"
|
||||||
|
protein_tokenizer_class = "AutoTokenizer"
|
||||||
|
tokenizer_class = "AutoTokenizer"
|
||||||
|
protein_tokenizer_dir_name = "protein_tokenizer"
|
||||||
|
# tokenizer_dir_name = "text_tokenizer"
|
||||||
|
|
||||||
|
def __init__(self, protein_tokenizer, tokenizer=None, protein_max_length=1024, text_max_length=512, **kwargs):
|
||||||
|
if protein_tokenizer is None:
|
||||||
|
raise ValueError("You need to specify an `protein_tokenizer`.")
|
||||||
|
if tokenizer is None:
|
||||||
|
raise ValueError("You need to specify a `tokenizer`.")
|
||||||
|
|
||||||
|
super().__init__(protein_tokenizer, tokenizer)
|
||||||
|
|
||||||
|
self.tokenizer.pad_token = "<|reserved_special_token_0|>"
|
||||||
|
self.protein_max_length = protein_max_length
|
||||||
|
self.text_max_length = text_max_length
|
||||||
|
|
||||||
|
def process_proteins(self, proteins, protein_max_length=1024):
|
||||||
|
sa_sequences = []
|
||||||
|
for protein in proteins:
|
||||||
|
aa_seq = protein.get("aa_seq")
|
||||||
|
foldseek = protein.get("foldseek")
|
||||||
|
sa_sequence = "".join([s.upper() + f.lower() for s, f in zip(aa_seq, foldseek)])
|
||||||
|
sa_sequences.append(sa_sequence)
|
||||||
|
|
||||||
|
sa_tokens = self.protein_tokenizer.batch_encode_plus(
|
||||||
|
sa_sequences, return_tensors="pt", truncation=True, max_length=protein_max_length, padding=True
|
||||||
|
)
|
||||||
|
return sa_tokens
|
||||||
|
|
||||||
|
def process_text(
|
||||||
|
self,
|
||||||
|
texts,
|
||||||
|
text_max_length: int = 512,
|
||||||
|
):
|
||||||
|
prompts = []
|
||||||
|
for messages in texts:
|
||||||
|
prompt = self.tokenizer.apply_chat_template(
|
||||||
|
messages,
|
||||||
|
tokenize=False,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
)
|
||||||
|
prompts.append(prompt)
|
||||||
|
|
||||||
|
prompt_inputs = self.tokenizer(
|
||||||
|
prompts,
|
||||||
|
add_special_tokens=False,
|
||||||
|
return_tensors="pt",
|
||||||
|
padding="longest",
|
||||||
|
truncation=True,
|
||||||
|
max_length=text_max_length,
|
||||||
|
)
|
||||||
|
return prompt_inputs
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
proteins: Optional[Union[list[dict], dict]] = None,
|
||||||
|
messages_list: Optional[Union[list[list[dict]], list[dict]]] = None,
|
||||||
|
protein_max_length: Optional[int] = None,
|
||||||
|
text_max_length: Optional[int] = None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
r"""This method takes batched or non-batched proteins and messages_list and converts them into format that can be used by
|
||||||
|
the model.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
proteins (`Union[List[dict], dict]`):
|
||||||
|
A list of dictionaries or a single dictionary containing the following keys:
|
||||||
|
- `"aa_seq"` (`str`) -- The amino acid sequence of the protein.
|
||||||
|
- `"foldseek"` (`str`) -- The foldseek string of the protein.
|
||||||
|
messages_list (`Union[List[List[dict]], List[dict]]`):
|
||||||
|
A list of lists of dictionaries or a list of dictionaries containing the following keys:
|
||||||
|
- `"role"` (`str`) -- The role of the message.
|
||||||
|
- `"content"` (`str`) -- The content of the message.
|
||||||
|
protein_max_length (`int`, *optional*, defaults to 1024):
|
||||||
|
The maximum length of the sequence to be generated.
|
||||||
|
text_max_length (`int`, *optional*, defaults to 512):
|
||||||
|
The maximum length of the text.
|
||||||
|
|
||||||
|
Return:
|
||||||
|
a dict with following keys:
|
||||||
|
- `protein_input_ids` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The input IDs for the protein sequence.
|
||||||
|
- `protein_attention_mask` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The attention mask for the protein sequence.
|
||||||
|
- `text_input_ids` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The input IDs for the text sequence.
|
||||||
|
- `text_attention_mask` (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- The attention mask for the text sequence.
|
||||||
|
"""
|
||||||
|
# proteins and messages_list should be provided
|
||||||
|
if proteins is None or messages_list is None:
|
||||||
|
raise ValueError("You need to specify `messages_list` and `proteins`.")
|
||||||
|
|
||||||
|
protein_max_length = protein_max_length if protein_max_length is not None else self.protein_max_length
|
||||||
|
text_max_length = text_max_length if text_max_length is not None else self.text_max_length
|
||||||
|
|
||||||
|
# proteins should be List[dict]
|
||||||
|
if isinstance(proteins, dict):
|
||||||
|
proteins = [proteins]
|
||||||
|
# messages_list should be List[List[dict]]
|
||||||
|
if isinstance(messages_list, (list, tuple)) and not isinstance(messages_list[0], (list, tuple)):
|
||||||
|
messages_list = [messages_list]
|
||||||
|
# Check if batched proteins are in the correct format
|
||||||
|
if isinstance(proteins, (list, tuple)) and not all(isinstance(p, dict) for p in proteins):
|
||||||
|
raise ValueError("The proteins should be a list of dictionaries, but not all elements are dictionaries.")
|
||||||
|
if isinstance(proteins, (list, tuple)) and not all(
|
||||||
|
all(k in PROTEIN_VALID_KEYS for k in p.keys()) for p in proteins
|
||||||
|
):
|
||||||
|
raise ValueError(
|
||||||
|
"There should be a list of dictionaries with keys: "
|
||||||
|
f"{', '.join(PROTEIN_VALID_KEYS)} for each protein."
|
||||||
|
f"But got: {proteins}"
|
||||||
|
)
|
||||||
|
# Check if batched messages_list is in the correct format
|
||||||
|
if isinstance(messages_list, (list, tuple)):
|
||||||
|
for messages in messages_list:
|
||||||
|
if not isinstance(messages, (list, tuple)):
|
||||||
|
raise ValueError(f"Each messages in messages_list should be a list instead of {type(messages)}.")
|
||||||
|
if not all(isinstance(m, dict) for m in messages):
|
||||||
|
raise ValueError(
|
||||||
|
"Each message in messages_list should be a list of dictionaries, but not all elements are dictionaries."
|
||||||
|
)
|
||||||
|
if any(len(m.keys()) != 2 for m in messages) or any(
|
||||||
|
set(m.keys()) != {"role", "content"} for m in messages
|
||||||
|
):
|
||||||
|
raise ValueError(
|
||||||
|
"Each message in messages_list should be a list of dictionaries with two keys: 'role' and 'content'."
|
||||||
|
f"But got: {messages}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ValueError(
|
||||||
|
f"The messages_list should be a list of lists of dictionaries, but it's {type(messages_list)}."
|
||||||
|
)
|
||||||
|
sa_tokens = self.process_proteins(proteins, protein_max_length)
|
||||||
|
|
||||||
|
text_tokens = self.process_text(messages_list, text_max_length)
|
||||||
|
|
||||||
|
return BatchFeature(
|
||||||
|
data={
|
||||||
|
"protein_input_ids": sa_tokens["input_ids"],
|
||||||
|
"protein_attention_mask": sa_tokens["attention_mask"],
|
||||||
|
"input_ids": text_tokens["input_ids"],
|
||||||
|
"attention_mask": text_tokens["attention_mask"],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
def batch_decode(self, *args, **kwargs):
|
||||||
|
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||||
|
|
||||||
|
def decode(self, *args, **kwargs):
|
||||||
|
return self.tokenizer.decode(*args, **kwargs)
|
||||||
|
|
||||||
|
def protein_batch_decode(self, *args, **kwargs):
|
||||||
|
return self.protein_tokenizer.batch_decode(*args, **kwargs)
|
||||||
|
|
||||||
|
def protein_decode(self, *args, **kwargs):
|
||||||
|
return self.protein_tokenizer.decode(*args, **kwargs)
|
||||||
|
|
||||||
|
# overwrite to save the protein tokenizer in a separate folder
|
||||||
|
# Adapted from instructblip.processing_instructblip.py (https://github.com/huggingface/transformers/blob/9b479a245b793cac2a8b2e87c6d8e81bb24e20c4/src/transformers/models/instructblip/processing_instructblip.py#L191-L221)
|
||||||
|
def save_pretrained(self, save_directory, **kwargs):
|
||||||
|
# only save the protein tokenizer in sub_dir
|
||||||
|
self.protein_tokenizer.save_pretrained(os.path.join(save_directory, self.protein_tokenizer_dir_name))
|
||||||
|
|
||||||
|
# we modify the attributes so that only the text tokenizer are saved in the main folder
|
||||||
|
protein_tokenizer_present = "protein_tokenizer" in self.attributes
|
||||||
|
# find the correct position of it in the attributes list
|
||||||
|
protein_tokenizer_index = self.attributes.index("protein_tokenizer") if protein_tokenizer_present else None
|
||||||
|
if protein_tokenizer_present and protein_tokenizer_index is not None:
|
||||||
|
self.attributes.remove("protein_tokenizer")
|
||||||
|
|
||||||
|
outputs = super().save_pretrained(save_directory, **kwargs)
|
||||||
|
|
||||||
|
if protein_tokenizer_present and protein_tokenizer_index is not None:
|
||||||
|
self.attributes.insert(protein_tokenizer_index, "protein_tokenizer")
|
||||||
|
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
# overwirte to load the protein tokenizer from a separate folder
|
||||||
|
# Adapted from instructblip.processing_instructblip.py (https://github.com/huggingface/transformers/blob/9b479a245b793cac2a8b2e87c6d8e81bb24e20c4/src/transformers/models/instructblip/processing_instructblip.py#L191-L221)
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
|
||||||
|
processor = super().from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
|
# if return_unused_kwargs a tuple is returned where the second element is 'unused_kwargs'
|
||||||
|
if isinstance(processor, tuple):
|
||||||
|
processor = processor[0]
|
||||||
|
protein_tokenizer = AutoTokenizer.from_pretrained(
|
||||||
|
pretrained_model_name_or_path, subfolder=cls.protein_tokenizer_dir_name
|
||||||
|
)
|
||||||
|
|
||||||
|
processor.protein_tokenizer = protein_tokenizer
|
||||||
|
|
||||||
|
return processor
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["EvollaProcessor"]
|
||||||
0
tests/models/evolla/__init__.py
Normal file
0
tests/models/evolla/__init__.py
Normal file
397
tests/models/evolla/test_modeling_evolla.py
Normal file
397
tests/models/evolla/test_modeling_evolla.py
Normal file
@@ -0,0 +1,397 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Testing suite for the PyTorch Evolla model."""
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from parameterized import parameterized
|
||||||
|
|
||||||
|
from transformers import BitsAndBytesConfig, EvollaConfig, is_torch_available
|
||||||
|
from transformers.testing_utils import (
|
||||||
|
TestCasePlus,
|
||||||
|
require_bitsandbytes,
|
||||||
|
require_torch,
|
||||||
|
require_torch_sdpa,
|
||||||
|
slow,
|
||||||
|
torch_device,
|
||||||
|
)
|
||||||
|
from transformers.utils import (
|
||||||
|
cached_property,
|
||||||
|
)
|
||||||
|
|
||||||
|
from ...test_configuration_common import ConfigTester
|
||||||
|
from ...test_modeling_common import (
|
||||||
|
TEST_EAGER_MATCHES_SDPA_INFERENCE_PARAMETERIZATION,
|
||||||
|
ModelTesterMixin,
|
||||||
|
_config_zero_init,
|
||||||
|
ids_tensor,
|
||||||
|
random_attention_mask,
|
||||||
|
)
|
||||||
|
from ...test_pipeline_mixin import PipelineTesterMixin
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import EvollaForProteinText2Text, EvollaModel, EvollaProcessor
|
||||||
|
|
||||||
|
|
||||||
|
class EvollaModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=1,
|
||||||
|
is_training=False,
|
||||||
|
text_seq_length=20,
|
||||||
|
text_vocab_size=100,
|
||||||
|
protein_seq_length=10,
|
||||||
|
protein_vocab_size=20,
|
||||||
|
hidden_size=4, # llama hidden size
|
||||||
|
intermediate_size=7, # llama intermediate size
|
||||||
|
num_hidden_layers=1, # llama hidden layers
|
||||||
|
num_attention_heads=2, # llama attention heads
|
||||||
|
num_key_value_heads=2, # llama key value heads
|
||||||
|
protein_hidden_size=8, # protein encoder hidden size
|
||||||
|
protein_num_hidden_layers=1, # protein encoder hidden layers
|
||||||
|
protein_num_attention_heads=4, # protein encoder attention heads
|
||||||
|
protein_intermediate_size=11, # protein encoder intermediate size
|
||||||
|
resampler_num_latents=7, # sequence compressor num latents
|
||||||
|
resampler_ff_mult=1, # sequence compressor ff mult
|
||||||
|
resampler_depth=2, # sequence compressor depth
|
||||||
|
resampler_dim_head=4, # sequence compressor dim head
|
||||||
|
resampler_heads=2, # sequence compressor heads
|
||||||
|
aligner_num_add_layers=1, # sequence aligner num add layers
|
||||||
|
aligner_ffn_mult=1, # sequence aligner ffn mult
|
||||||
|
use_input_mask=True,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.protein_seq_length = protein_seq_length
|
||||||
|
self.protein_vocab_size = protein_vocab_size
|
||||||
|
self.text_seq_length = text_seq_length
|
||||||
|
self.text_vocab_size = text_vocab_size
|
||||||
|
self.seq_length = text_seq_length
|
||||||
|
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.num_key_value_heads = num_key_value_heads
|
||||||
|
self.protein_hidden_size = protein_hidden_size
|
||||||
|
self.protein_num_hidden_layers = protein_num_hidden_layers
|
||||||
|
self.protein_num_attention_heads = protein_num_attention_heads
|
||||||
|
self.protein_intermediate_size = protein_intermediate_size
|
||||||
|
|
||||||
|
self.resampler_num_latents = resampler_num_latents
|
||||||
|
self.resampler_ff_mult = resampler_ff_mult
|
||||||
|
self.resampler_depth = resampler_depth
|
||||||
|
self.resampler_dim_head = resampler_dim_head
|
||||||
|
self.resampler_heads = resampler_heads
|
||||||
|
|
||||||
|
self.aligner_num_add_layers = aligner_num_add_layers
|
||||||
|
self.aligner_ffn_mult = aligner_ffn_mult
|
||||||
|
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.is_training = is_training
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_encoder_decoder(self):
|
||||||
|
return False
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self, num_proteins=None):
|
||||||
|
batch_size = num_proteins if num_proteins is not None else self.batch_size
|
||||||
|
text_input_ids = ids_tensor([batch_size, self.text_seq_length], self.text_vocab_size)
|
||||||
|
|
||||||
|
protein_input_ids = ids_tensor([batch_size, self.protein_seq_length], self.protein_vocab_size)
|
||||||
|
|
||||||
|
if self.use_input_mask:
|
||||||
|
text_input_mask = random_attention_mask([batch_size, self.text_seq_length])
|
||||||
|
protein_input_mask = random_attention_mask([batch_size, self.protein_seq_length])
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
return (config, text_input_ids, text_input_mask, protein_input_ids, protein_input_mask)
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return EvollaConfig(
|
||||||
|
protein_encoder_config={
|
||||||
|
"vocab_size": self.protein_vocab_size,
|
||||||
|
"hidden_size": self.protein_hidden_size,
|
||||||
|
"num_hidden_layers": self.protein_num_hidden_layers,
|
||||||
|
"num_attention_heads": self.protein_num_attention_heads,
|
||||||
|
"intermediate_size": self.protein_intermediate_size,
|
||||||
|
},
|
||||||
|
vocab_size=self.text_vocab_size,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
num_key_value_heads=self.num_key_value_heads,
|
||||||
|
aligner_ffn_mult=self.aligner_ffn_mult,
|
||||||
|
aligner_num_add_layers=self.aligner_num_add_layers,
|
||||||
|
resampler_depth=self.resampler_depth,
|
||||||
|
resampler_dim_head=self.resampler_dim_head,
|
||||||
|
resampler_heads=self.resampler_heads,
|
||||||
|
resampler_num_latents=self.resampler_num_latents,
|
||||||
|
resampler_ff_mult=self.resampler_ff_mult,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(
|
||||||
|
self,
|
||||||
|
config,
|
||||||
|
input_ids,
|
||||||
|
input_mask,
|
||||||
|
protein_input_ids,
|
||||||
|
protein_input_mask,
|
||||||
|
batch_size=None,
|
||||||
|
):
|
||||||
|
batch_size = batch_size if batch_size is not None else self.batch_size
|
||||||
|
model = EvollaModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(
|
||||||
|
input_ids,
|
||||||
|
attention_mask=input_mask,
|
||||||
|
protein_input_ids=protein_input_ids,
|
||||||
|
protein_attention_mask=protein_input_mask,
|
||||||
|
)
|
||||||
|
self.parent.assertEqual(result.last_hidden_state.shape, (batch_size, input_ids.shape[1], self.hidden_size))
|
||||||
|
|
||||||
|
def create_and_check_model_gen(
|
||||||
|
self,
|
||||||
|
config,
|
||||||
|
input_ids,
|
||||||
|
input_mask,
|
||||||
|
protein_input_ids,
|
||||||
|
protein_input_mask,
|
||||||
|
):
|
||||||
|
model = EvollaForProteinText2Text(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
model.generate(
|
||||||
|
input_ids,
|
||||||
|
attention_mask=input_mask,
|
||||||
|
protein_input_ids=protein_input_ids,
|
||||||
|
protein_attention_mask=protein_input_mask,
|
||||||
|
max_length=self.seq_length + 2,
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
(config, text_input_ids, text_input_mask, protein_input_ids, protein_input_mask) = config_and_inputs
|
||||||
|
inputs_dict = {
|
||||||
|
"input_ids": text_input_ids,
|
||||||
|
"attention_mask": text_input_mask,
|
||||||
|
"protein_input_ids": protein_input_ids,
|
||||||
|
"protein_attention_mask": protein_input_mask,
|
||||||
|
}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class EvollaModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (EvollaModel, EvollaForProteinText2Text) if is_torch_available() else ()
|
||||||
|
pipeline_model_mapping = {"feature-extraction": EvollaModel} if is_torch_available() else {}
|
||||||
|
test_pruning = False
|
||||||
|
test_headmasking = False
|
||||||
|
test_torchscript = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
maxDiff = None
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = EvollaModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=EvollaConfig, hidden_size=37)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_encoder_decoder(self):
|
||||||
|
return self.model_tester.is_encoder_decoder
|
||||||
|
|
||||||
|
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
|
||||||
|
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
|
||||||
|
# XXX: EvollaForProteinText2Text has no MODEL_FOR group yet, but it should be the same
|
||||||
|
# as MODEL_FOR_CAUSAL_LM_MAPPING_NAMES, so for now manually changing to do the right thing
|
||||||
|
# as super won't do it
|
||||||
|
if return_labels:
|
||||||
|
inputs_dict["labels"] = torch.zeros(
|
||||||
|
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
|
||||||
|
)
|
||||||
|
|
||||||
|
return inputs_dict
|
||||||
|
|
||||||
|
def test_model_outputs_equivalence(self):
|
||||||
|
try:
|
||||||
|
orig = self.all_model_classes
|
||||||
|
# EvollaModel.forward doesn't have labels input arg - only EvollaForProteinText2Text does
|
||||||
|
self.all_model_classes = (EvollaForProteinText2Text,) if is_torch_available() else ()
|
||||||
|
super().test_model_outputs_equivalence()
|
||||||
|
finally:
|
||||||
|
self.all_model_classes = orig
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_model_single_protein(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=1)
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs, batch_size=1)
|
||||||
|
|
||||||
|
def test_model_multiple_proteins(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=2)
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs, batch_size=2)
|
||||||
|
|
||||||
|
def test_generate_single_protein(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=1)
|
||||||
|
self.model_tester.create_and_check_model_gen(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_generate_multiple_proteins(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs(num_proteins=2)
|
||||||
|
self.model_tester.create_and_check_model_gen(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_saprot_output(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.return_dict = True
|
||||||
|
protein_informations = {
|
||||||
|
"input_ids": inputs_dict["protein_input_ids"],
|
||||||
|
"attention_mask": inputs_dict["protein_attention_mask"],
|
||||||
|
}
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
if model_class is not EvollaModel:
|
||||||
|
continue
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
protein_encoder_outputs = model.protein_encoder.model(**protein_informations, return_dict=True)
|
||||||
|
print(model_class, protein_encoder_outputs)
|
||||||
|
|
||||||
|
def test_protein_encoder_output(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.return_dict = True
|
||||||
|
protein_informations = {
|
||||||
|
"input_ids": inputs_dict["protein_input_ids"],
|
||||||
|
"attention_mask": inputs_dict["protein_attention_mask"],
|
||||||
|
}
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
if model_class is not EvollaModel:
|
||||||
|
continue
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
protein_encoder_outputs = model.protein_encoder(**protein_informations, return_dict=True)
|
||||||
|
print(model_class, protein_encoder_outputs)
|
||||||
|
|
||||||
|
def test_single_forward(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.return_dict = True
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_attentions"] = True
|
||||||
|
inputs_dict["output_hidden_states"] = False
|
||||||
|
config.return_dict = True
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
print(outputs)
|
||||||
|
|
||||||
|
def test_initialization(self):
|
||||||
|
# we skip the latents initialization test
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config)
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
for name, param in model.named_parameters():
|
||||||
|
if param.requires_grad:
|
||||||
|
# skip latents
|
||||||
|
if name.endswith("latents"):
|
||||||
|
print(f"Skipping latents {name}")
|
||||||
|
continue
|
||||||
|
self.assertIn(
|
||||||
|
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||||
|
[0.0, 1.0],
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
|
||||||
|
@parameterized.expand(TEST_EAGER_MATCHES_SDPA_INFERENCE_PARAMETERIZATION)
|
||||||
|
@require_torch_sdpa
|
||||||
|
@unittest.skip("Evolla requires both text and protein inputs which is currently not done in this test.")
|
||||||
|
def test_eager_matches_sdpa_inference(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip("Evolla does not support eager attention implementation.")
|
||||||
|
def test_eager_padding_matches_padding_free_with_position_ids(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(
|
||||||
|
"Evolla has a separate test runner for generation tests with complex inheritance, causing this check to fail."
|
||||||
|
)
|
||||||
|
def test_generation_tester_mixin_inheritance(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip("Evolla requires both text and protein inputs which is currently not done in this test.")
|
||||||
|
def test_flex_attention_with_grads(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class EvollaModelIntegrationTest(TestCasePlus):
|
||||||
|
def _prepare_for_inputs(self):
|
||||||
|
aa_seq = "MLLEETLKSCPIVKRGKYHYFIHPISDGVPLVEPKLLREVATRIIKIGNFEGVNKIVTAEAMGIPLVTTLSLYTDIPYVIMRKREYKLPGEVPVFQSTGYSKGQLYLNGIEKGDKVIIIDDVISTGGTMIAIINALERAGAEIKDIICVIERGDGKKIVEEKTGYKIKTLVKIDVVDGEVVIL"
|
||||||
|
foldseek = "dvvvvqqqpfawdddppdtdgcgclapvpdpddpvvlvvllvlcvvpadpvqaqeeeeeddscpsnvvsncvvpvhyydywylddppdppkdwqwf######gitidpdqaaaheyeyeeaeqdqlrvvlsvvvrcvvrnyhhrayeyaeyhycnqvvccvvpvghyhynwywdqdpsgidtd"
|
||||||
|
question = "What is the function of this protein?"
|
||||||
|
|
||||||
|
protein_information = {
|
||||||
|
"aa_seq": aa_seq,
|
||||||
|
"foldseek": foldseek,
|
||||||
|
}
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
|
||||||
|
{"role": "user", "content": question},
|
||||||
|
]
|
||||||
|
return protein_information, messages
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def default_processor(self):
|
||||||
|
return EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-hf", revision="refs/pr/11")
|
||||||
|
|
||||||
|
@require_bitsandbytes
|
||||||
|
@slow
|
||||||
|
def test_inference_natural_language_protein_reasoning(self):
|
||||||
|
protein_information, messages = self._prepare_for_inputs()
|
||||||
|
processor = self.default_processor
|
||||||
|
inputs = processor(
|
||||||
|
messages_list=[messages], proteins=[protein_information], return_tensors="pt", padding="longest"
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
|
# the CI gpu is small so using quantization to fit
|
||||||
|
quantization_config = BitsAndBytesConfig(
|
||||||
|
load_in_4bit=True,
|
||||||
|
bnb_4bit_compute_dtype="float16",
|
||||||
|
)
|
||||||
|
model = EvollaForProteinText2Text.from_pretrained(
|
||||||
|
"westlake-repl/Evolla-10B-hf",
|
||||||
|
quantization_config=quantization_config,
|
||||||
|
device_map="auto",
|
||||||
|
)
|
||||||
|
generated_ids = model.generate(**inputs, max_new_tokens=100, do_sample=False)
|
||||||
|
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
||||||
|
|
||||||
|
# keep for debugging
|
||||||
|
for i, t in enumerate(generated_text):
|
||||||
|
t = bytes(t, "utf-8").decode("unicode_escape")
|
||||||
|
print(f"{i}:\n{t}\n")
|
||||||
|
|
||||||
|
self.assertIn("This protein", generated_text[0])
|
||||||
|
|
||||||
|
self.assertIn("purine", generated_text[0])
|
||||||
295
tests/models/evolla/test_processor_evolla.py
Normal file
295
tests/models/evolla/test_processor_evolla.py
Normal file
@@ -0,0 +1,295 @@
|
|||||||
|
# Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import random
|
||||||
|
import shutil
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from transformers import (
|
||||||
|
AutoProcessor,
|
||||||
|
EvollaProcessor,
|
||||||
|
)
|
||||||
|
from transformers.testing_utils import require_torch
|
||||||
|
from transformers.utils import is_torch_available
|
||||||
|
|
||||||
|
from ...test_processing_common import ProcessorTesterMixin
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
|
||||||
|
EVOLLA_VALID_AA = list("ACDEFGHIKLMNPQRSTVWY#")
|
||||||
|
EVOLLA_VALID_FS = list("pynwrqhgdlvtmfsaeikc#")
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class EvollaProcessorTest(ProcessorTesterMixin, unittest.TestCase):
|
||||||
|
processor_class = EvollaProcessor
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.tmpdirname = tempfile.mkdtemp()
|
||||||
|
|
||||||
|
processor = EvollaProcessor.from_pretrained("westlake-repl/Evolla-10B-hf")
|
||||||
|
|
||||||
|
processor.save_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
self.input_keys = ["protein_input_ids", "protein_attention_mask", "input_ids", "attention_mask"]
|
||||||
|
|
||||||
|
def prepare_input_and_expected_output(self):
|
||||||
|
amino_acid_sequence = "AAAA"
|
||||||
|
foldseek_sequence = "dddd"
|
||||||
|
question = "What is the function of this protein?"
|
||||||
|
|
||||||
|
expected_output = {
|
||||||
|
"protein_input_ids": torch.tensor([[0, 13, 13, 13, 13, 2]]),
|
||||||
|
"protein_attention_mask": torch.tensor([[1, 1, 1, 1, 1, 1]]),
|
||||||
|
"input_ids": torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
128000,
|
||||||
|
128006,
|
||||||
|
9125,
|
||||||
|
128007,
|
||||||
|
271,
|
||||||
|
2675,
|
||||||
|
527,
|
||||||
|
459,
|
||||||
|
15592,
|
||||||
|
6335,
|
||||||
|
430,
|
||||||
|
649,
|
||||||
|
4320,
|
||||||
|
904,
|
||||||
|
4860,
|
||||||
|
922,
|
||||||
|
13128,
|
||||||
|
13,
|
||||||
|
128009,
|
||||||
|
128006,
|
||||||
|
882,
|
||||||
|
128007,
|
||||||
|
271,
|
||||||
|
3923,
|
||||||
|
374,
|
||||||
|
279,
|
||||||
|
734,
|
||||||
|
315,
|
||||||
|
420,
|
||||||
|
13128,
|
||||||
|
30,
|
||||||
|
128009,
|
||||||
|
128006,
|
||||||
|
78191,
|
||||||
|
128007,
|
||||||
|
271,
|
||||||
|
]
|
||||||
|
]
|
||||||
|
),
|
||||||
|
"attention_mask": torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
1,
|
||||||
|
]
|
||||||
|
]
|
||||||
|
),
|
||||||
|
}
|
||||||
|
protein_dict = {"aa_seq": amino_acid_sequence, "foldseek": foldseek_sequence}
|
||||||
|
message = [
|
||||||
|
{"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
|
||||||
|
{"role": "user", "content": question},
|
||||||
|
]
|
||||||
|
return protein_dict, message, expected_output
|
||||||
|
|
||||||
|
def test_processor(self):
|
||||||
|
protein_tokenizer = self.get_protein_tokenizer()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = EvollaProcessor(protein_tokenizer, tokenizer)
|
||||||
|
|
||||||
|
protein_dict, message, expected_output = self.prepare_input_and_expected_output()
|
||||||
|
inputs = processor(proteins=[protein_dict], messages_list=[message])
|
||||||
|
|
||||||
|
# check if the input is correct
|
||||||
|
for key, value in expected_output.items():
|
||||||
|
self.assertTrue(
|
||||||
|
torch.equal(inputs[key], value),
|
||||||
|
f"inputs[key] is {inputs[key]} and expected_output[key] is {expected_output[key]}",
|
||||||
|
)
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
|
||||||
|
|
||||||
|
def get_protein_tokenizer(self, **kwargs):
|
||||||
|
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).protein_tokenizer
|
||||||
|
|
||||||
|
def tearDown(self):
|
||||||
|
shutil.rmtree(self.tmpdirname)
|
||||||
|
|
||||||
|
def prepare_inputs_single(self):
|
||||||
|
proteins = {
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
|
||||||
|
}
|
||||||
|
return proteins
|
||||||
|
|
||||||
|
def prepare_inputs_pair(self):
|
||||||
|
proteins = [
|
||||||
|
{
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
|
||||||
|
},
|
||||||
|
]
|
||||||
|
return proteins
|
||||||
|
|
||||||
|
def prepare_inputs_long(self):
|
||||||
|
proteins = [
|
||||||
|
{
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=2000)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=2000)),
|
||||||
|
},
|
||||||
|
]
|
||||||
|
return proteins
|
||||||
|
|
||||||
|
def prepare_inputs_short(self):
|
||||||
|
proteins = [
|
||||||
|
{
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=1)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=1)),
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
|
||||||
|
},
|
||||||
|
]
|
||||||
|
return proteins
|
||||||
|
|
||||||
|
def prepare_inputs_empty(self):
|
||||||
|
proteins = [
|
||||||
|
{
|
||||||
|
"aa_seq": "",
|
||||||
|
"foldseek": "",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"aa_seq": "".join(random.choices(EVOLLA_VALID_AA, k=100)),
|
||||||
|
"foldseek": "".join(random.choices(EVOLLA_VALID_FS, k=100)),
|
||||||
|
},
|
||||||
|
]
|
||||||
|
return proteins
|
||||||
|
|
||||||
|
def prepare_inputs(self, protein_types="pair"):
|
||||||
|
r"""
|
||||||
|
Prepare inputs for the test.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
protein_types (`str`): the types of proteins to prepare.
|
||||||
|
- "single": a single correct protein.
|
||||||
|
- "pair": a pair of correct proteins.
|
||||||
|
- "long": a long sequence of correct proteins and a correct protein.
|
||||||
|
- "short": a short sequence of correct proteins (only have 1 aa) and a correct protein.
|
||||||
|
- "empty": an empty sequence of proteins and a correct protein.
|
||||||
|
"""
|
||||||
|
if protein_types == "single":
|
||||||
|
proteins = self.prepare_inputs_single()
|
||||||
|
elif protein_types == "pair":
|
||||||
|
proteins = self.prepare_inputs_pair()
|
||||||
|
elif protein_types == "long":
|
||||||
|
proteins = self.prepare_inputs_long()
|
||||||
|
elif protein_types == "short":
|
||||||
|
proteins = self.prepare_inputs_short()
|
||||||
|
elif protein_types == "empty":
|
||||||
|
proteins = self.prepare_inputs_empty()
|
||||||
|
else:
|
||||||
|
raise ValueError(
|
||||||
|
f"protein_types should be one of 'single', 'pair', 'long','short', 'empty', but got {protein_types}"
|
||||||
|
)
|
||||||
|
|
||||||
|
questions = ["What is the function of the protein?"] * len(proteins)
|
||||||
|
messages_list = []
|
||||||
|
for question in questions:
|
||||||
|
messages = [
|
||||||
|
{"role": "system", "content": "You are an AI expert that can answer any questions about protein."},
|
||||||
|
{"role": "user", "content": question},
|
||||||
|
]
|
||||||
|
messages_list.append(messages)
|
||||||
|
return proteins, messages_list
|
||||||
|
|
||||||
|
def test_tokenizer_decode(self):
|
||||||
|
protein_tokenizer = self.get_protein_tokenizer()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = EvollaProcessor(tokenizer=tokenizer, protein_tokenizer=protein_tokenizer, return_tensors="pt")
|
||||||
|
|
||||||
|
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
|
||||||
|
|
||||||
|
decoded_processor = processor.batch_decode(predicted_ids)
|
||||||
|
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||||
|
|
||||||
|
self.assertListEqual(decoded_tok, decoded_processor)
|
||||||
|
|
||||||
|
def test_model_input_names(self):
|
||||||
|
protein_tokenizer = self.get_protein_tokenizer()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = EvollaProcessor(tokenizer=tokenizer, protein_tokenizer=protein_tokenizer)
|
||||||
|
proteins, messages_list = self.prepare_inputs()
|
||||||
|
|
||||||
|
inputs = processor(messages_list=messages_list, proteins=proteins, padding="longest", return_tensors="pt")
|
||||||
|
|
||||||
|
# For now the processor supports only ['pixel_values', 'input_ids', 'attention_mask']
|
||||||
|
self.assertSetEqual(set(inputs.keys()), set(self.input_keys))
|
||||||
@@ -92,6 +92,7 @@ PRIVATE_MODELS = [
|
|||||||
"Phi4MultimodalAudioModel",
|
"Phi4MultimodalAudioModel",
|
||||||
"Phi4MultimodalVisionModel",
|
"Phi4MultimodalVisionModel",
|
||||||
"Glm4vVisionModel",
|
"Glm4vVisionModel",
|
||||||
|
"EvollaSaProtPreTrainedModel",
|
||||||
]
|
]
|
||||||
|
|
||||||
# Update this list for models that are not tested with a comment explaining the reason it should not be.
|
# Update this list for models that are not tested with a comment explaining the reason it should not be.
|
||||||
|
|||||||
Reference in New Issue
Block a user