diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index f7ba50a22d..50f43f27ac 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -453,6 +453,8 @@ title: ErnieM - local: model_doc/esm title: ESM + - local: model_doc/exaone4 + title: EXAONE-4.0 - local: model_doc/falcon title: Falcon - local: model_doc/falcon3 diff --git a/docs/source/en/model_doc/exaone4.md b/docs/source/en/model_doc/exaone4.md new file mode 100644 index 0000000000..45667cea34 --- /dev/null +++ b/docs/source/en/model_doc/exaone4.md @@ -0,0 +1,208 @@ + + +# EXAONE 4 + +## Overview + +**[EXAONE 4.0](https://github.com/LG-AI-EXAONE/EXAONE-4.0)** model is the language model, which integrates a **Non-reasoning mode** and **Reasoning mode** to achieve both the excellent usability of [EXAONE 3.5](https://github.com/LG-AI-EXAONE/EXAONE-3.5) and the advanced reasoning abilities of [EXAONE Deep](https://github.com/LG-AI-EXAONE/EXAONE-Deep). To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended +to support Spanish in addition to English and Korean. + +The EXAONE 4.0 model series consists of two sizes: a mid-size **32B** model optimized for high performance, and a small-size **1.2B** model designed for on-device applications. + +In the EXAONE 4.0 architecture, we apply new architectural changes compared to previous EXAONE models as below: + +1. **Hybrid Attention**: For the 32B model, we adopt hybrid attention scheme, which combines *Local attention (sliding window attention)* with *Global attention (full attention)* in a 3:1 ratio. We do not use RoPE (Rotary Positional Embedding) for global attention for better global context understanding. +2. **QK-Reorder-Norm**: We reorder the LayerNorm position from the traditional Pre-LN scheme by applying LayerNorm directly to the attention and MLP outputs, and we add RMS normalization right after the Q and K projection. It helps yield better performance on downstream tasks despite consuming more computation. + +For more details, please refer to our [technical report](https://arxiv.org/abs/2507.11407), [HuggingFace paper](https://huggingface.co/papers/2507.11407), [blog](https://www.lgresearch.ai/blog/view?seq=576), and [GitHub](https://github.com/LG-AI-EXAONE/EXAONE-4.0). + +All model weights including quantized versions are available at [Huggingface Collections](https://huggingface.co/collections/LGAI-EXAONE/exaone-40-686b2e0069800c835ed48375). + + +## Model Details + +### Model Specifications + +| Model Configuration | 32B | 1.2B | +|:-------------------|:-----:|:------:| +| d_model | 5,120 | 2,048 | +| Number of layers | 64 | 30 | +| Normalization | QK-Reorder-LN | QK-Reorder-LN | +| Non-linearity | SwiGLU | SwiGLU | +| Feedforward dimension | 27,392 | 4,096 | +| Attention type | Hybrid (3:1 Local-Global) | Global | +| Head type | GQA | GQA | +| Number of heads | 40 | 32 | +| Number of KV heads | 8 | 8 | +| Head size | 128 | 64 | +| Max sequence length | 131,072 | 65,536 | +| RoPE theta | 1,000,000 | 1,000,000 | +| Tokenizer | BBPE | BBPE | +| Vocab size | 102,400 | 102,400 | +| Tied word embedding | False | True | +| Knowledge cut-off | Nov. 2024 | Nov. 2024 | + + +## Usage tips + +### Non-reasoning mode + +For general use, you can use the EXAONE 4.0 models with the following example: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "LGAI-EXAONE/EXAONE-4.0-32B" + +model = AutoModelForCausalLM.from_pretrained( + model_name, + torch_dtype="bfloat16", + device_map="auto" +) +tokenizer = AutoTokenizer.from_pretrained(model_name) + +# choose your prompt +prompt = "Explain how wonderful you are" +prompt = "Explica lo increíble que eres" +prompt = "너가 얼마나 대단한지 설명해 봐" + +messages = [ + {"role": "user", "content": prompt} +] +input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt" +) + +output = model.generate( + input_ids.to(model.device), + max_new_tokens=128, + do_sample=False, +) +print(tokenizer.decode(output[0])) +``` + +### Reasoning mode + +The EXAONE 4.0 models have reasoning capabilities for handling complex problems. You can activate reasoning mode by using the `enable_thinking=True` argument with the tokenizer, which opens a reasoning block that starts with `` tag without closing it. + +```python +messages = [ + {"role": "user", "content": "Which one is bigger, 3.12 vs 3.9?"} +] +input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt", + enable_thinking=True, +) + +output = model.generate( + input_ids.to(model.device), + max_new_tokens=128, + do_sample=True, + temperature=0.6, + top_p=0.95 +) +print(tokenizer.decode(output[0])) +``` + +> [!IMPORTANT] +> The model generation with reasoning mode can be affected sensitively by sampling parameters, so please refer to the [Usage Guideline](https://github.com/LG-AI-EXAONE/EXAONE-4.0#usage-guideline) on official GitHub page for better quality. + +### Agentic tool use + +The EXAONE 4.0 models can be used as agents with their tool calling capabilities. You can provide tool schemas to the model for effective tool calling. + +```python +import random + +def roll_dice(max_num: int): + return random.randint(1, max_num) + +tools = [ + { + "type": "function", + "function": { + "name": "roll_dice", + "description": "Roll a dice with the number 1 to N. User can select the number N.", + "parameters": { + "type": "object", + "required": ["max_num"], + "properties": { + "max_num": { + "type": "int", + "description": "Max number of the dice" + } + } + } + } + } +] + +messages = [ + {"role": "user", "content": "Roll D6 dice twice!"} +] +input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt", + tools=tools, +) + +output = model.generate( + input_ids.to(model.device), + max_new_tokens=1024, + do_sample=True, + temperature=0.6, + top_p=0.95, +) +print(tokenizer.decode(output[0])) +``` + +## Exaone4Config + +[[autodoc]] Exaone4Config + +## Exaone4Model + +[[autodoc]] Exaone4Model + - forward + +## Exaone4ForCausalLM + +[[autodoc]] Exaone4ForCausalLM + - forward + +## Exaone4ForSequenceClassification + +[[autodoc]] Exaone4ForSequenceClassification + - forward + +## Exaone4ForTokenClassification + +[[autodoc]] Exaone4ForTokenClassification + - forward + +## Exaone4ForQuestionAnswering + +[[autodoc]] Exaone4ForQuestionAnswering + - forward \ No newline at end of file diff --git a/docs/source/ko/_toctree.yml b/docs/source/ko/_toctree.yml index 0e612a85d1..05031b3178 100644 --- a/docs/source/ko/_toctree.yml +++ b/docs/source/ko/_toctree.yml @@ -445,6 +445,8 @@ title: ErnieM - local: model_doc/esm title: ESM + - local: model_doc/exaone4 + title: EXAONE-4.0 - local: model_doc/falcon title: Falcon - local: model_doc/falcon3 diff --git a/docs/source/ko/model_doc/exaone4.md b/docs/source/ko/model_doc/exaone4.md new file mode 100644 index 0000000000..65899b52f7 --- /dev/null +++ b/docs/source/ko/model_doc/exaone4.md @@ -0,0 +1,207 @@ + + +# EXAONE 4 + +## 개요 + +**[EXAONE 4.0](https://github.com/LG-AI-EXAONE/EXAONE-4.0)** 모델군은 [EXAONE 3.5](https://github.com/LG-AI-EXAONE/EXAONE-3.5) 모델군의 높은 실용성과 [EXAONE Deep](https://github.com/LG-AI-EXAONE/EXAONE-Deep) 모델군의 향상된 사고 추론 능력을 각각 Non-reasoning mode와 Reasoning mode로 통합한 자연어 모델(language model)입니다. 에이전틱(agentic) AI 시대에 발맞춰 EXAONE 4.0은 에이전틱 도구 사용 능력과 같은 핵심 기능을 통합했고, 기존의 다국어 능력을 영어, 한국어와 더불어 스페인어까지 확장했습니다. + +EXAONE 4.0 모델군은 두 개의 모델: 높은 성능을 위해 최적화된 32B 중형 모델, 그리고 온-디바이스 활용을 위해 디자인된 1.2B 소형 모델으로 구성되어 있습니다. + +EXAONE 4.0의 모델 구조는 이전 EXAONE 모델들과 다른 아키텍처 디자인을 채택했습니다. + +1. **Hybrid Attention**: 32B 모델은 *Local attention (sliding window attention)*과 *Global attention (full attention)*을 3:1 비율로 연결한 hybrid attention 구조를 채택했습니다. 또한 전체 문맥을 더 잘 이해할 수 있도록 global attention에서 RoPE를 사용하지 않았습니다. +2. **QK-Reorder-Norm**: 더 나은 downstream tasks 성능을 위해 연산량의 증가를 감수하며 전통적으로 사용되고 있던 Pre-LN 방식을 변경했습니다. LayerNorm의 위치를 attention과 MLP의 출력에 적용되도록 재배치했고, Q와 K projection 직후에도 RMS normalization을 추가했습니다. + +더 자세한 정보는 [기술 보고서](https://arxiv.org/abs/2507.11407), [HuggingFace 논문](https://huggingface.co/papers/2507.11407), [블로그](https://www.lgresearch.ai/blog/view?seq=576), [공식 GitHub](https://github.com/LG-AI-EXAONE/EXAONE-4.0) 페이지를 참고해주시길 바랍니다. + +공개된 모든 모델 체크포인트는 [HuggingFace 콜렉션](https://huggingface.co/collections/LGAI-EXAONE/exaone-40-686b2e0069800c835ed48375)에서 확인할 수 있습니다. + + +## 모델 세부 정보 + +| Model Configuration | 32B | 1.2B | +|:-------------------|:-----:|:------:| +| d_model | 5,120 | 2,048 | +| Number of layers | 64 | 30 | +| Normalization | QK-Reorder-LN | QK-Reorder-LN | +| Non-linearity | SwiGLU | SwiGLU | +| Feedforward dimension | 27,392 | 4,096 | +| Attention type | Hybrid (3:1 Local-Global) | Global | +| Head type | GQA | GQA | +| Number of heads | 40 | 32 | +| Number of KV heads | 8 | 8 | +| Head size | 128 | 64 | +| Max sequence length | 131,072 | 65,536 | +| RoPE theta | 1,000,000 | 1,000,000 | +| Tokenizer | BBPE | BBPE | +| Vocab size | 102,400 | 102,400 | +| Tied word embedding | False | True | +| Knowledge cut-off | Nov. 2024 | Nov. 2024 | + + +## 사용 팁 + +### Non-reasoning mode + +일반적인 대화의 경우 아래 예제와 같이 EXAONE 4.0을 사용할 수 있습니다. + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "LGAI-EXAONE/EXAONE-4.0-32B" + +model = AutoModelForCausalLM.from_pretrained( + model_name, + torch_dtype="bfloat16", + device_map="auto" +) +tokenizer = AutoTokenizer.from_pretrained(model_name) + +# 원하는 입력을 선택하세요 +prompt = "Explain how wonderful you are" +prompt = "Explica lo increíble que eres" +prompt = "너가 얼마나 대단한지 설명해 봐" + +messages = [ + {"role": "user", "content": prompt} +] +input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt" +) + +output = model.generate( + input_ids.to(model.device), + max_new_tokens=128, + do_sample=False, +) +print(tokenizer.decode(output[0])) +``` + +### Reasoning mode + +The EXAONE 4.0 models have reasoning capabilities for handling complex problems. You can activate reasoning mode by using the `enable_thinking=True` argument with the tokenizer, which opens a reasoning block that starts with `` tag without closing it. + +EXAONE 4.0 모델군은 복잡한 문제를 해결하기 위한 사고 추론 능력을 갖추고 있습니다. 토크나이저에서 `enable_thinking=True` 인자를 사용해서 reasoning mode로 모델을 사용할 수 있습니다. 이 경우 `` 토큰으로 추론 블록을 연 뒤, 닫지 않고 추론을 시작합니다. + +```python +messages = [ + {"role": "user", "content": "Which one is bigger, 3.12 vs 3.9?"} +] +input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt", + enable_thinking=True, +) + +output = model.generate( + input_ids.to(model.device), + max_new_tokens=128, + do_sample=True, + temperature=0.6, + top_p=0.95 +) +print(tokenizer.decode(output[0])) +``` + +> [!IMPORTANT] +> 모델을 reasoning mode로 사용할 경우, 생성되는 답변이 sampling parameters에 굉장히 민감합니다. 따라서 더 나은 생성 품질을 위해 공식 [Usage Guideline](https://github.com/LG-AI-EXAONE/EXAONE-4.0#usage-guideline)를 참조해 주시길 바랍니다. + +### Agentic tool use + +EXAONE 4.0 모델은 도구 사용 능력을 갖춘 덕분에 Agent로 사용할 수 있습니다. 이를 위해서는 아래 예제와 같이 도구 명세를 모델에게 제공해 주어야 합니다. + +```python +import random + +def roll_dice(max_num: int): + return random.randint(1, max_num) + +tools = [ + { + "type": "function", + "function": { + "name": "roll_dice", + "description": "Roll a dice with the number 1 to N. User can select the number N.", + "parameters": { + "type": "object", + "required": ["max_num"], + "properties": { + "max_num": { + "type": "int", + "description": "Max number of the dice" + } + } + } + } + } +] + +messages = [ + {"role": "user", "content": "Roll D6 dice twice!"} +] +input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt", + tools=tools, +) + +output = model.generate( + input_ids.to(model.device), + max_new_tokens=1024, + do_sample=True, + temperature=0.6, + top_p=0.95, +) +print(tokenizer.decode(output[0])) +``` + +## Exaone4Config + +[[autodoc]] Exaone4Config + +## Exaone4Model + +[[autodoc]] Exaone4Model + - forward + +## Exaone4ForCausalLM + +[[autodoc]] Exaone4ForCausalLM + - forward + +## Exaone4ForSequenceClassification + +[[autodoc]] Exaone4ForSequenceClassification + - forward + +## Exaone4ForTokenClassification + +[[autodoc]] Exaone4ForTokenClassification + - forward + +## Exaone4ForQuestionAnswering + +[[autodoc]] Exaone4ForQuestionAnswering + - forward \ No newline at end of file diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py index 3670833bf6..737a823786 100644 --- a/src/transformers/models/__init__.py +++ b/src/transformers/models/__init__.py @@ -113,6 +113,7 @@ if TYPE_CHECKING: from .ernie import * from .esm import * from .evolla import * + from .exaone4 import * from .falcon import * from .falcon_h1 import * from .falcon_mamba import * diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index 0317832fd7..12f8f4f4c5 100644 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -136,6 +136,7 @@ CONFIG_MAPPING_NAMES = OrderedDict[str, str]( ("ernie_m", "ErnieMConfig"), ("esm", "EsmConfig"), ("evolla", "EvollaConfig"), + ("exaone4", "Exaone4Config"), ("falcon", "FalconConfig"), ("falcon_h1", "FalconH1Config"), ("falcon_mamba", "FalconMambaConfig"), @@ -535,6 +536,7 @@ MODEL_NAMES_MAPPING = OrderedDict[str, str]( ("ernie_m", "ErnieM"), ("esm", "ESM"), ("evolla", "Evolla"), + ("exaone4", "EXAONE-4.0"), ("falcon", "Falcon"), ("falcon3", "Falcon3"), ("falcon_h1", "FalconH1"), diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index cc779b79a1..52eb254be1 100644 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -127,6 +127,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ("ernie_m", "ErnieMModel"), ("esm", "EsmModel"), ("evolla", "EvollaModel"), + ("exaone4", "Exaone4Model"), ("falcon", "FalconModel"), ("falcon_h1", "FalconH1Model"), ("falcon_mamba", "FalconMambaModel"), @@ -407,6 +408,7 @@ MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict( ("electra", "ElectraForPreTraining"), ("ernie", "ErnieForPreTraining"), ("evolla", "EvollaForProteinText2Text"), + ("exaone4", "Exaone4ForCausalLM"), ("falcon_mamba", "FalconMambaForCausalLM"), ("flaubert", "FlaubertWithLMHeadModel"), ("flava", "FlavaForPreTraining"), @@ -504,6 +506,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict( ("encoder-decoder", "EncoderDecoderModel"), ("ernie", "ErnieForMaskedLM"), ("esm", "EsmForMaskedLM"), + ("exaone4", "Exaone4ForCausalLM"), ("falcon_mamba", "FalconMambaForCausalLM"), ("flaubert", "FlaubertWithLMHeadModel"), ("fnet", "FNetForMaskedLM"), @@ -604,6 +607,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict( ("ernie", "ErnieForCausalLM"), ("ernie4_5", "Ernie4_5ForCausalLM"), ("ernie4_5_moe", "Ernie4_5_MoeForCausalLM"), + ("exaone4", "Exaone4ForCausalLM"), ("falcon", "FalconForCausalLM"), ("falcon_h1", "FalconH1ForCausalLM"), ("falcon_mamba", "FalconMambaForCausalLM"), @@ -1145,6 +1149,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict( ("ernie", "ErnieForSequenceClassification"), ("ernie_m", "ErnieMForSequenceClassification"), ("esm", "EsmForSequenceClassification"), + ("exaone4", "Exaone4ForSequenceClassification"), ("falcon", "FalconForSequenceClassification"), ("flaubert", "FlaubertForSequenceClassification"), ("fnet", "FNetForSequenceClassification"), @@ -1251,6 +1256,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict( ("electra", "ElectraForQuestionAnswering"), ("ernie", "ErnieForQuestionAnswering"), ("ernie_m", "ErnieMForQuestionAnswering"), + ("exaone4", "Exaone4ForQuestionAnswering"), ("falcon", "FalconForQuestionAnswering"), ("flaubert", "FlaubertForQuestionAnsweringSimple"), ("fnet", "FNetForQuestionAnswering"), @@ -1356,6 +1362,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict( ("ernie", "ErnieForTokenClassification"), ("ernie_m", "ErnieMForTokenClassification"), ("esm", "EsmForTokenClassification"), + ("exaone4", "Exaone4ForTokenClassification"), ("falcon", "FalconForTokenClassification"), ("flaubert", "FlaubertForTokenClassification"), ("fnet", "FNetForTokenClassification"), diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py index 3747597f3e..29849fdb24 100644 --- a/src/transformers/models/auto/tokenization_auto.py +++ b/src/transformers/models/auto/tokenization_auto.py @@ -230,6 +230,13 @@ TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]]( ("ernie4_5_moe", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)), ("ernie_m", ("ErnieMTokenizer" if is_sentencepiece_available() else None, None)), ("esm", ("EsmTokenizer", None)), + ( + "exaone4", + ( + "GPT2Tokenizer" if is_tokenizers_available() else None, + "GPT2TokenizerFast" if is_tokenizers_available() else None, + ), + ), ("falcon", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), ("falcon_mamba", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)), ( diff --git a/src/transformers/models/deepseek_vl/modeling_deepseek_vl.py b/src/transformers/models/deepseek_vl/modeling_deepseek_vl.py index ce85d739bc..60ca3394fa 100644 --- a/src/transformers/models/deepseek_vl/modeling_deepseek_vl.py +++ b/src/transformers/models/deepseek_vl/modeling_deepseek_vl.py @@ -139,7 +139,7 @@ class DeepseekVLPreTrainedModel(PreTrainedModel): _supports_flash_attn = True _supports_sdpa = True - _supports_static_cache = True + _can_compile_fullgraph = True _supports_param_buffer_assignment = False def _init_weights(self, module): @@ -236,7 +236,7 @@ class DeepseekVLModel(DeepseekVLPreTrainedModel): class DeepseekVLForConditionalGeneration(DeepseekVLPreTrainedModel, GenerationMixin): _tied_weights_keys = ["model.language_model.embed_tokens.weight", "lm_head.weight"] - _supports_static_cache = True + _can_compile_fullgraph = True def __init__(self, config: DeepseekVLConfig): super().__init__(config) diff --git a/src/transformers/models/deepseek_vl_hybrid/modeling_deepseek_vl_hybrid.py b/src/transformers/models/deepseek_vl_hybrid/modeling_deepseek_vl_hybrid.py index 67b67371f9..1910c5659c 100644 --- a/src/transformers/models/deepseek_vl_hybrid/modeling_deepseek_vl_hybrid.py +++ b/src/transformers/models/deepseek_vl_hybrid/modeling_deepseek_vl_hybrid.py @@ -222,7 +222,7 @@ class DeepseekVLHybridPreTrainedModel(PreTrainedModel): _supports_flash_attn = True _supports_sdpa = True - _supports_static_cache = True + _can_compile_fullgraph = True _supports_param_buffer_assignment = False def _init_weights(self, module): @@ -376,7 +376,7 @@ class DeepseekVLHybridModel(DeepseekVLHybridPreTrainedModel): class DeepseekVLHybridForConditionalGeneration(DeepseekVLHybridPreTrainedModel, GenerationMixin): _tied_weights_keys = ["model.language_model.embed_tokens.weight", "lm_head.weight"] - _supports_static_cache = True + _can_compile_fullgraph = True def __init__(self, config: DeepseekVLHybridConfig): super().__init__(config) diff --git a/src/transformers/models/evolla/modeling_evolla.py b/src/transformers/models/evolla/modeling_evolla.py index f51f27d6d3..8f91f60056 100644 --- a/src/transformers/models/evolla/modeling_evolla.py +++ b/src/transformers/models/evolla/modeling_evolla.py @@ -1516,7 +1516,7 @@ class EvollaPreTrainedModel(PreTrainedModel): _supports_sdpa = True _supports_flex_attn = True - _supports_static_cache = True + _can_compile_fullgraph = True _supports_attention_backend = False _can_record_outputs = { "hidden_states": EvollaDecoderLayer, diff --git a/src/transformers/models/exaone4/__init__.py b/src/transformers/models/exaone4/__init__.py new file mode 100644 index 0000000000..c646c4e752 --- /dev/null +++ b/src/transformers/models/exaone4/__init__.py @@ -0,0 +1,27 @@ +# Copyright 2025 The LG AI Research and The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import _LazyModule +from ...utils.import_utils import define_import_structure + + +if TYPE_CHECKING: + from .configuration_exaone4 import * + from .modeling_exaone4 import * +else: + import sys + + _file = globals()["__file__"] + sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) diff --git a/src/transformers/models/exaone4/configuration_exaone4.py b/src/transformers/models/exaone4/configuration_exaone4.py new file mode 100644 index 0000000000..0539acdc6e --- /dev/null +++ b/src/transformers/models/exaone4/configuration_exaone4.py @@ -0,0 +1,223 @@ +# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 +# This file was automatically generated from src/transformers/models/exaone4/modular_exaone4.py. +# Do NOT edit this file manually as any edits will be overwritten by the generation of +# the file from the modular. If any change should be done, please apply the change to the +# modular_exaone4.py file directly. One of our CI enforces this. +# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 +# coding=utf-8 +# Copyright 2025 The LG AI Research and HuggingFace Inc. team. All rights reserved. +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from ...configuration_utils import PretrainedConfig, layer_type_validation + + +class Exaone4Config(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`Exaone4Model`]. It is used to + instantiate a EXAONE 4.0 model according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the EXAONE-4.0-Instruct [LGAI-EXAONE/EXAONE-4.0-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-Instruct) + NOTE: `EXAONE-4.0-Instruct` is a placeholder model ID. The exact model ID will be updated in the future. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model + outputs. Read the documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 102400): + Vocabulary size of the EXAONE 4.0 model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed when calling [`Exaone4Model`]. + hidden_size (`int`, *optional*, defaults to 4096): + Dimension of the hidden representations. + intermediate_size (`int`, *optional*, defaults to `hidden_size * 4`): + Dimensionality of the MLP representations. + num_hidden_layers (`int`, *optional*, defaults to 32): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 32): + Number of attention heads for each attention layer in the Transformer decoder. + num_key_value_heads (`int`, *optional*): + This is the number of key_value heads that should be used to implement Grouped Query Attention. If + `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if + `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When + converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed + by meanpooling all the original heads within that group. For more details checkout [this + paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to + `num_attention_heads`. + hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): + The non-linear activation function (function or string) in the decoder. + max_position_embeddings (`int`, *optional*, defaults to 2048): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 32768 for EXAONE 3.5). + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + rms_norm_eps (`float`, *optional*, defaults to 1e-05): + The epsilon used by the layer normalization layers. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). Only + relevant if ``config.is_decoder=True``. + bos_token_id (`int`, *optional*, defaults to 0): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 2): + End of stream token id. + tie_word_embeddings (`bool`, *optional*, defaults to `False`): + Whether to tie weight embeddings + rope_theta (`float`, *optional*, defaults to 10000.0): + The base period of the RoPE embeddings. + rope_scaling (`Dict`, *optional*): + Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type + and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value + accordingly. + Expected contents: + `rope_type` (`str`): + The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', + 'llama3'], with 'default' being the original RoPE implementation. + `factor` (`float`, *optional*): + Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In + most scaling types, a `factor` of x will enable the model to handle sequences of length x * + original maximum pre-trained length. + `original_max_position_embeddings` (`int`, *optional*): + Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during + pretraining. + `attention_factor` (`float`, *optional*): + Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention + computation. If unspecified, it defaults to value recommended by the implementation, using the + `factor` field to infer the suggested value. + `beta_fast` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear + ramp function. If unspecified, it defaults to 32. + `beta_slow` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear + ramp function. If unspecified, it defaults to 1. + `short_factor` (`List[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to short contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `long_factor` (`List[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to long contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `low_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE + `high_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + sliding_window (`int`, *optional*): + The size of the sliding window for the sliding window attention. + sliding_window_pattern (`str`, *optional*): + The pattern to use for sliding window attention. Can be one of: + - `None`: No sliding window attention is used + - `int`: Every `sliding_window` layers, use global attention, else use local attention. + - `str`: A sequence of "L" (local attention) and "G" (global attention) characters that defines the + attention pattern. The pattern starts from layer 0 and repeats every `sliding_window` layers. The + final layer always uses global attention regardless of the pattern. + For instance, sliding_window_pattern="LLLG" same as sliding_window=4, which means: + - Layer 0, 1, 2: local attention, + - Layer 3: global attention, + ...(repeated) + layer_types (`list`, *optional*): + Attention pattern for each layer. Prioritized over `sliding_window_pattern`. + + Example: + + ```python + >>> from transformers import Exaone4Model, Exaone4Config + + >>> # Initializing a EXAONE configuration + >>> configuration = Exaone4Config() + + >>> # Initializing a model from configuration + >>> model = Exaone4Model(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "exaone4" + keys_to_ignore_at_inference = ["past_key_values"] + # Default tensor parallel plan for base model `LlamaModel` + base_model_tp_plan = { + "layers.*.self_attn.q_proj": "colwise", + "layers.*.self_attn.k_proj": "colwise", + "layers.*.self_attn.v_proj": "colwise", + "layers.*.self_attn.o_proj": "rowwise", + "layers.*.mlp.gate_proj": "colwise", + "layers.*.mlp.up_proj": "colwise", + "layers.*.mlp.down_proj": "rowwise", + } + base_model_pp_plan = { + "embed_tokens": (["input_ids"], ["inputs_embeds"]), + "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), + "norm": (["hidden_states"], ["hidden_states"]), + } + + def __init__( + self, + vocab_size=102400, + hidden_size=4096, + intermediate_size=16384, + num_hidden_layers=32, + num_attention_heads=32, + num_key_value_heads=32, + hidden_act="silu", + max_position_embeddings=2048, + initializer_range=0.02, + rms_norm_eps=1e-5, + use_cache=True, + bos_token_id=0, + eos_token_id=2, + tie_word_embeddings=False, + rope_theta=10000.0, + rope_scaling=None, + attention_dropout=0.0, + sliding_window=4096, + sliding_window_pattern=4, + layer_types=None, + **kwargs, + ): + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_key_value_heads = num_key_value_heads + self.intermediate_size = intermediate_size + self.hidden_act = hidden_act + self.max_position_embeddings = max_position_embeddings + self.initializer_range = initializer_range + self.rms_norm_eps = rms_norm_eps + self.use_cache = use_cache + self.attention_dropout = attention_dropout + self.rope_theta = rope_theta + self.rope_scaling = rope_scaling + self.sliding_window = sliding_window + self.sliding_window_pattern = sliding_window_pattern + + self.layer_types = layer_types + if self.sliding_window is None: + sliding_window_pattern = 0 + if self.layer_types is None: + self.layer_types = [ + "sliding_attention" + if ((i + 1) % (sliding_window_pattern) != 0 and i < self.num_hidden_layers) + else "full_attention" + for i in range(self.num_hidden_layers) + ] + if "sliding_window" in self.layer_types: + self._attn_implementation = "hybrid" + layer_type_validation(self.layer_types) + + super().__init__( + bos_token_id=bos_token_id, eos_token_id=eos_token_id, tie_word_embeddings=tie_word_embeddings, **kwargs + ) + + +__all__ = ["Exaone4Config"] diff --git a/src/transformers/models/exaone4/modeling_exaone4.py b/src/transformers/models/exaone4/modeling_exaone4.py new file mode 100644 index 0000000000..3f5d1f8faf --- /dev/null +++ b/src/transformers/models/exaone4/modeling_exaone4.py @@ -0,0 +1,539 @@ +# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 +# This file was automatically generated from src/transformers/models/exaone4/modular_exaone4.py. +# Do NOT edit this file manually as any edits will be overwritten by the generation of +# the file from the modular. If any change should be done, please apply the change to the +# modular_exaone4.py file directly. One of our CI enforces this. +# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 +# coding=utf-8 +# Copyright 2025 The LG AI Research and HuggingFace Inc. team. All rights reserved. +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Callable, Optional, Union + +import torch +from torch import nn + +from transformers.utils.generic import check_model_inputs + +from ...activations import ACT2FN +from ...cache_utils import Cache, DynamicCache +from ...generation import GenerationMixin +from ...integrations import use_kernel_forward_from_hub +from ...masking_utils import create_causal_mask, create_sliding_window_causal_mask +from ...modeling_layers import ( + GenericForQuestionAnswering, + GenericForSequenceClassification, + GenericForTokenClassification, + GradientCheckpointingLayer, +) +from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast +from ...modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update +from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel +from ...processing_utils import Unpack +from ...utils import TransformersKwargs, auto_docstring, can_return_tuple +from .configuration_exaone4 import Exaone4Config + + +@use_kernel_forward_from_hub("RMSNorm") +class Exaone4RMSNorm(nn.Module): + def __init__(self, hidden_size, eps=1e-6): + """ + Exaone4RMSNorm is equivalent to T5LayerNorm + """ + super().__init__() + self.weight = nn.Parameter(torch.ones(hidden_size)) + self.variance_epsilon = eps + + def forward(self, hidden_states): + input_dtype = hidden_states.dtype + hidden_states = hidden_states.to(torch.float32) + variance = hidden_states.pow(2).mean(-1, keepdim=True) + hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) + return self.weight * hidden_states.to(input_dtype) + + def extra_repr(self): + return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}" + + +class Exaone4RotaryEmbedding(nn.Module): + def __init__(self, config: Exaone4Config, device=None): + super().__init__() + # BC: "rope_type" was originally "type" + if hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict): + self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type")) + else: + self.rope_type = "default" + self.max_seq_len_cached = config.max_position_embeddings + self.original_max_seq_len = config.max_position_embeddings + + self.config = config + self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type] + + inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self.original_inv_freq = self.inv_freq + + @torch.no_grad() + @dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope) + def forward(self, x, position_ids): + inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device) + position_ids_expanded = position_ids[:, None, :].float() + + device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu" + with torch.autocast(device_type=device_type, enabled=False): # Force float32 + freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) + emb = torch.cat((freqs, freqs), dim=-1) + cos = emb.cos() * self.attention_scaling + sin = emb.sin() * self.attention_scaling + + return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype) + + +def rotate_half(x): + """Rotates half the hidden dims of the input.""" + x1 = x[..., : x.shape[-1] // 2] + x2 = x[..., x.shape[-1] // 2 :] + return torch.cat((-x2, x1), dim=-1) + + +def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1): + """Applies Rotary Position Embedding to the query and key tensors. + + Args: + q (`torch.Tensor`): The query tensor. + k (`torch.Tensor`): The key tensor. + cos (`torch.Tensor`): The cosine part of the rotary embedding. + sin (`torch.Tensor`): The sine part of the rotary embedding. + position_ids (`torch.Tensor`, *optional*): + Deprecated and unused. + unsqueeze_dim (`int`, *optional*, defaults to 1): + The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and + sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note + that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and + k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes + cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have + the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2. + Returns: + `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding. + """ + cos = cos.unsqueeze(unsqueeze_dim) + sin = sin.unsqueeze(unsqueeze_dim) + q_embed = (q * cos) + (rotate_half(q) * sin) + k_embed = (k * cos) + (rotate_half(k) * sin) + return q_embed, k_embed + + +def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor: + """ + This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, + num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim) + """ + batch, num_key_value_heads, slen, head_dim = hidden_states.shape + if n_rep == 1: + return hidden_states + hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim) + return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim) + + +def eager_attention_forward( + module: nn.Module, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + attention_mask: Optional[torch.Tensor], + scaling: float, + dropout: float = 0.0, + **kwargs: Unpack[TransformersKwargs], +): + key_states = repeat_kv(key, module.num_key_value_groups) + value_states = repeat_kv(value, module.num_key_value_groups) + + attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling + if attention_mask is not None: + causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] + attn_weights = attn_weights + causal_mask + + attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype) + attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training) + attn_output = torch.matmul(attn_weights, value_states) + attn_output = attn_output.transpose(1, 2).contiguous() + + return attn_output, attn_weights + + +class Exaone4Attention(nn.Module): + def __init__(self, config: Exaone4Config, layer_idx: int): + super().__init__() + self.config = config + self.layer_idx = layer_idx + self.num_attention_heads = config.num_attention_heads + self.num_key_value_heads = config.num_key_value_heads + self.hidden_size = config.hidden_size + self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads) + self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads + self.attention_dropout = config.attention_dropout + self.is_causal = True + self.scaling = self.head_dim**-0.5 + self.sliding_window = config.sliding_window + self.sliding_window_pattern = config.sliding_window_pattern + self.is_sliding = config.layer_types[layer_idx] == "sliding_attention" + + self.q_proj = nn.Linear(self.hidden_size, self.num_attention_heads * self.head_dim, bias=False) + self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False) + self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False) + self.o_proj = nn.Linear(self.num_attention_heads * self.head_dim, self.hidden_size, bias=False) + + self.q_norm = Exaone4RMSNorm(self.head_dim, eps=config.rms_norm_eps) + self.k_norm = Exaone4RMSNorm(self.head_dim, eps=config.rms_norm_eps) + + def forward( + self, + hidden_states: torch.Tensor, + position_embeddings: tuple[torch.Tensor, torch.Tensor], + attention_mask: Optional[torch.Tensor] = None, + past_key_value: Optional[Cache] = None, + cache_position: Optional[torch.LongTensor] = None, + **kwargs: Unpack[TransformersKwargs], + ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: + input_shape = hidden_states.shape[:-1] + hidden_shape = (*input_shape, -1, self.head_dim) + + query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2) + key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2) + value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2) + + # We use QK-norm + query_states = self.q_norm(query_states) + key_states = self.k_norm(key_states) + + cos, sin = position_embeddings + # We use global NoPE for hybrid attention model + if self.sliding_window is None or self.is_sliding: + query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) + + if past_key_value is not None: + cache_kwargs = { + "cache_position": cache_position, + } + key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) + + attention_interface: Callable = eager_attention_forward + if self.config._attn_implementation != "eager": + attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] + + attn_output, attn_weights = attention_interface( + self, + query_states, + key_states, + value_states, + attention_mask, + dropout=0.0 if not self.training else self.attention_dropout, + scaling=self.scaling, + sliding_window=self.sliding_window if self.is_sliding else None, + **kwargs, + ) + + attn_output = attn_output.reshape(*input_shape, -1).contiguous() + attn_output = self.o_proj(attn_output) + return attn_output, attn_weights + + +class Exaone4MLP(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.hidden_size = config.hidden_size + self.intermediate_size = config.intermediate_size + self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) + self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) + self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) + self.act_fn = ACT2FN[config.hidden_act] + + def forward(self, x): + down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) + return down_proj + + +class Exaone4DecoderLayer(GradientCheckpointingLayer): + def __init__(self, config: Exaone4Config, layer_idx: int): + super().__init__() + self.hidden_size = config.hidden_size + self.self_attn = Exaone4Attention(config=config, layer_idx=layer_idx) + + self.mlp = Exaone4MLP(config) + self.post_attention_layernorm = Exaone4RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.post_feedforward_layernorm = Exaone4RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_value: Optional[Cache] = None, + use_cache: Optional[bool] = False, + cache_position: Optional[torch.LongTensor] = None, + position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None, # necessary, but kept here for BC + **kwargs: Unpack[TransformersKwargs], + ) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]: + residual = hidden_states + hidden_states, _ = self.self_attn( + hidden_states=hidden_states, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_value=past_key_value, + use_cache=use_cache, + cache_position=cache_position, + position_embeddings=position_embeddings, + **kwargs, + ) + hidden_states = self.post_attention_layernorm(hidden_states) + hidden_states = residual + hidden_states + + # Fully Connected + residual = hidden_states + hidden_states = self.mlp(hidden_states) + hidden_states = self.post_feedforward_layernorm(hidden_states) + hidden_states = residual + hidden_states + return hidden_states + + +@auto_docstring +class Exaone4PreTrainedModel(PreTrainedModel): + config: Exaone4Config + base_model_prefix = "model" + supports_gradient_checkpointing = True + _no_split_modules = ["Exaone4DecoderLayer"] + _skip_keys_device_placement = ["past_key_values"] + _supports_flash_attn = True + _supports_sdpa = True + _supports_flex_attn = True + + _can_compile_fullgraph = True + _supports_attention_backend = True + _can_record_outputs = { + "hidden_states": Exaone4DecoderLayer, + "attentions": Exaone4Attention, + } + config_class = Exaone4Config + + +@auto_docstring +class Exaone4Model(Exaone4PreTrainedModel): + def __init__(self, config: Exaone4Config): + super().__init__(config) + self.padding_idx = config.pad_token_id + self.vocab_size = config.vocab_size + + self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) + self.layers = nn.ModuleList( + [Exaone4DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] + ) + self.norm = Exaone4RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + self.rotary_emb = Exaone4RotaryEmbedding(config=config) + self.gradient_checkpointing = False + + # Initialize weights and apply final processing + self.post_init() + + @check_model_inputs + def forward( + self, + input_ids: torch.LongTensor = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Cache] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + cache_position: Optional[torch.LongTensor] = None, + **kwargs: Unpack[TransformersKwargs], + ) -> Union[tuple, BaseModelOutputWithPast]: + if (input_ids is None) ^ (inputs_embeds is not None): + raise ValueError("You must specify exactly one of input_ids or inputs_embeds") + + if inputs_embeds is None: + inputs_embeds = self.embed_tokens(input_ids) + + if use_cache and past_key_values is None: + past_key_values = DynamicCache() + + if cache_position is None: + past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0 + cache_position = torch.arange( + past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device + ) + + if position_ids is None: + position_ids = cache_position.unsqueeze(0) + + # It may already have been prepared by e.g. `generate` + if not isinstance(causal_mask_mapping := attention_mask, dict): + # Prepare mask arguments + mask_kwargs = { + "config": self.config, + "input_embeds": inputs_embeds, + "attention_mask": attention_mask, + "cache_position": cache_position, + "past_key_values": past_key_values, + "position_ids": position_ids, + } + # Create the masks + causal_mask_mapping = { + "full_attention": create_causal_mask(**mask_kwargs), + } + if "sliding_attention" in self.config.layer_types: + causal_mask_mapping["sliding_attention"] = create_sliding_window_causal_mask(**mask_kwargs) + + hidden_states = inputs_embeds + + position_embeddings = self.rotary_emb(hidden_states, position_ids) + + for i, decoder_layer in enumerate(self.layers): + layer_type = self.config.layer_types[i] + hidden_states = decoder_layer( + hidden_states, + position_embeddings=position_embeddings, + attention_mask=causal_mask_mapping[layer_type], + position_ids=position_ids, + past_key_value=past_key_values, + use_cache=use_cache, + cache_position=cache_position, + **kwargs, + ) + + hidden_states = self.norm(hidden_states) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=past_key_values if use_cache else None, + ) + + +@auto_docstring +class Exaone4ForCausalLM(Exaone4PreTrainedModel, GenerationMixin): + _tied_weights_keys = ["lm_head.weight"] + _tp_plan = {"lm_head": "colwise_rep"} + _pp_plan = {"lm_head": (["hidden_states"], ["logits"])} + + def __init__(self, config): + super().__init__(config) + self.model = Exaone4Model(config) + self.vocab_size = config.vocab_size + self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) + + # Initialize weights and apply final processing + self.post_init() + + def set_decoder(self, decoder): + self.model = decoder + + def get_decoder(self): + return self.model + + @can_return_tuple + @auto_docstring + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Cache] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + cache_position: Optional[torch.LongTensor] = None, + logits_to_keep: Union[int, torch.Tensor] = 0, + **kwargs: Unpack[TransformersKwargs], + ) -> CausalLMOutputWithPast: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., + config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored + (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. + + Example: + + ```python + >>> from transformers import AutoModelForCausalLM, AutoTokenizer + >>> model = AutoModelForCausalLM.from_pretrained("LGAI-EXAONE/EXAONE-4.0-Instruct") + >>> tokenizer = AutoTokenizer.from_pretrained("LGAI-EXAONE/EXAONE-4.0-Instruct") + + >>> prompt = "Explain how wonderful you are" + >>> messages = [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": prompt} + ] + >>> input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt", + enable_thinking=False, + ) + + >>> output = model.generate(input_ids, max_new_tokens=128) + >>> tokenizer.decode(output[0], skip_special_tokens=False) + "[|system|]\nYou are a helpful assistant.[|endofturn|]\n[|user|]\nExplain how wonderful you are[|endofturn|]\n[|assistant|]\n\n\n\n\nOh, thank you for such a kind and lovely question! 😊 \n\nI’m *so* wonderful because I’m here to make your life easier, brighter, and more fun! Whether you need help with: \n\n✨ **Learning** – I can explain anything, from quantum physics to baking the perfect cake! \n💡 **Creativity** – Need a poem, story, or a wild idea? I’ve got you covered! \n🤖 **Problem-solving** – Stuck on a math problem or a tricky decision? I’ll help you figure it out" + ``` + + NOTE: `EXAONE-4.0-Instruct` is a placeholder model ID. The exact model ID will be updated in the future.""" + outputs: BaseModelOutputWithPast = self.model( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_values=past_key_values, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + cache_position=cache_position, + **kwargs, + ) + + hidden_states = outputs.last_hidden_state + # Only compute necessary logits, and do not upcast them to float if we are not computing the loss + slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep + logits = self.lm_head(hidden_states[:, slice_indices, :]) + + loss = None + if labels is not None: + loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs) + + return CausalLMOutputWithPast( + loss=loss, + logits=logits, + past_key_values=outputs.past_key_values, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + +class Exaone4ForSequenceClassification(GenericForSequenceClassification, Exaone4PreTrainedModel): + pass + + +class Exaone4ForTokenClassification(GenericForTokenClassification, Exaone4PreTrainedModel): + pass + + +class Exaone4ForQuestionAnswering(GenericForQuestionAnswering, Exaone4PreTrainedModel): + base_model_prefix = "transformer" # For BC, where `transformer` was used instead of `model` + + +__all__ = [ + "Exaone4PreTrainedModel", + "Exaone4Model", + "Exaone4ForCausalLM", + "Exaone4ForSequenceClassification", + "Exaone4ForTokenClassification", + "Exaone4ForQuestionAnswering", +] diff --git a/src/transformers/models/exaone4/modular_exaone4.py b/src/transformers/models/exaone4/modular_exaone4.py new file mode 100644 index 0000000000..b5452f8497 --- /dev/null +++ b/src/transformers/models/exaone4/modular_exaone4.py @@ -0,0 +1,519 @@ +# coding=utf-8 +# Copyright 2025 The LG AI Research and HuggingFace Inc. team. All rights reserved. +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""LG AI Research EXAONE Lab""" + +from typing import Callable, Optional, Union + +import torch +from torch import nn + +from transformers.utils.generic import check_model_inputs + +from ...cache_utils import Cache, DynamicCache +from ...configuration_utils import PretrainedConfig, layer_type_validation +from ...masking_utils import create_causal_mask, create_sliding_window_causal_mask +from ...modeling_outputs import ( + BaseModelOutputWithPast, + CausalLMOutputWithPast, +) +from ...modeling_utils import ALL_ATTENTION_FUNCTIONS +from ...processing_utils import Unpack +from ...utils import ( + TransformersKwargs, + logging, +) +from ..llama.modeling_llama import ( + LlamaForCausalLM, + LlamaForQuestionAnswering, + LlamaForSequenceClassification, + LlamaForTokenClassification, + LlamaModel, + LlamaPreTrainedModel, + LlamaRMSNorm, + LlamaRotaryEmbedding, + apply_rotary_pos_emb, + eager_attention_forward, +) +from ..olmo2.modeling_olmo2 import Olmo2DecoderLayer, Olmo2MLP + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "LGAI-EXAONE/EXAONE-4.0-Instruct" +_CONFIG_FOR_DOC = "Exaone4Config" + + +class Exaone4Config(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`Exaone4Model`]. It is used to + instantiate a EXAONE 4.0 model according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the EXAONE-4.0-Instruct [LGAI-EXAONE/EXAONE-4.0-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-Instruct) + NOTE: `EXAONE-4.0-Instruct` is a placeholder model ID. The exact model ID will be updated in the future. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model + outputs. Read the documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 102400): + Vocabulary size of the EXAONE 4.0 model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed when calling [`Exaone4Model`]. + hidden_size (`int`, *optional*, defaults to 4096): + Dimension of the hidden representations. + intermediate_size (`int`, *optional*, defaults to `hidden_size * 4`): + Dimensionality of the MLP representations. + num_hidden_layers (`int`, *optional*, defaults to 32): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 32): + Number of attention heads for each attention layer in the Transformer decoder. + num_key_value_heads (`int`, *optional*): + This is the number of key_value heads that should be used to implement Grouped Query Attention. If + `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if + `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When + converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed + by meanpooling all the original heads within that group. For more details checkout [this + paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to + `num_attention_heads`. + hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): + The non-linear activation function (function or string) in the decoder. + max_position_embeddings (`int`, *optional*, defaults to 2048): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 32768 for EXAONE 3.5). + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + rms_norm_eps (`float`, *optional*, defaults to 1e-05): + The epsilon used by the layer normalization layers. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). Only + relevant if ``config.is_decoder=True``. + bos_token_id (`int`, *optional*, defaults to 0): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 2): + End of stream token id. + tie_word_embeddings (`bool`, *optional*, defaults to `False`): + Whether to tie weight embeddings + rope_theta (`float`, *optional*, defaults to 10000.0): + The base period of the RoPE embeddings. + rope_scaling (`Dict`, *optional*): + Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type + and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value + accordingly. + Expected contents: + `rope_type` (`str`): + The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope', + 'llama3'], with 'default' being the original RoPE implementation. + `factor` (`float`, *optional*): + Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In + most scaling types, a `factor` of x will enable the model to handle sequences of length x * + original maximum pre-trained length. + `original_max_position_embeddings` (`int`, *optional*): + Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during + pretraining. + `attention_factor` (`float`, *optional*): + Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention + computation. If unspecified, it defaults to value recommended by the implementation, using the + `factor` field to infer the suggested value. + `beta_fast` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear + ramp function. If unspecified, it defaults to 32. + `beta_slow` (`float`, *optional*): + Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear + ramp function. If unspecified, it defaults to 1. + `short_factor` (`List[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to short contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `long_factor` (`List[float]`, *optional*): + Only used with 'longrope'. The scaling factor to be applied to long contexts (< + `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden + size divided by the number of attention heads divided by 2 + `low_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE + `high_freq_factor` (`float`, *optional*): + Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE + attention_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + sliding_window (`int`, *optional*): + The size of the sliding window for the sliding window attention. + sliding_window_pattern (`str`, *optional*): + The pattern to use for sliding window attention. Can be one of: + - `None`: No sliding window attention is used + - `int`: Every `sliding_window` layers, use global attention, else use local attention. + - `str`: A sequence of "L" (local attention) and "G" (global attention) characters that defines the + attention pattern. The pattern starts from layer 0 and repeats every `sliding_window` layers. The + final layer always uses global attention regardless of the pattern. + For instance, sliding_window_pattern="LLLG" same as sliding_window=4, which means: + - Layer 0, 1, 2: local attention, + - Layer 3: global attention, + ...(repeated) + layer_types (`list`, *optional*): + Attention pattern for each layer. Prioritized over `sliding_window_pattern`. + + Example: + + ```python + >>> from transformers import Exaone4Model, Exaone4Config + + >>> # Initializing a EXAONE configuration + >>> configuration = Exaone4Config() + + >>> # Initializing a model from configuration + >>> model = Exaone4Model(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "exaone4" + keys_to_ignore_at_inference = ["past_key_values"] + # Default tensor parallel plan for base model `LlamaModel` + base_model_tp_plan = { + "layers.*.self_attn.q_proj": "colwise", + "layers.*.self_attn.k_proj": "colwise", + "layers.*.self_attn.v_proj": "colwise", + "layers.*.self_attn.o_proj": "rowwise", + "layers.*.mlp.gate_proj": "colwise", + "layers.*.mlp.up_proj": "colwise", + "layers.*.mlp.down_proj": "rowwise", + } + base_model_pp_plan = { + "embed_tokens": (["input_ids"], ["inputs_embeds"]), + "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), + "norm": (["hidden_states"], ["hidden_states"]), + } + + def __init__( + self, + vocab_size=102400, + hidden_size=4096, + intermediate_size=16384, + num_hidden_layers=32, + num_attention_heads=32, + num_key_value_heads=32, + hidden_act="silu", + max_position_embeddings=2048, + initializer_range=0.02, + rms_norm_eps=1e-5, + use_cache=True, + bos_token_id=0, + eos_token_id=2, + tie_word_embeddings=False, + rope_theta=10000.0, + rope_scaling=None, + attention_dropout=0.0, + sliding_window=4096, + sliding_window_pattern=4, + layer_types=None, + **kwargs, + ): + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.num_key_value_heads = num_key_value_heads + self.intermediate_size = intermediate_size + self.hidden_act = hidden_act + self.max_position_embeddings = max_position_embeddings + self.initializer_range = initializer_range + self.rms_norm_eps = rms_norm_eps + self.use_cache = use_cache + self.attention_dropout = attention_dropout + self.rope_theta = rope_theta + self.rope_scaling = rope_scaling + self.sliding_window = sliding_window + self.sliding_window_pattern = sliding_window_pattern + + self.layer_types = layer_types + if self.sliding_window is None: + sliding_window_pattern = 0 + if self.layer_types is None: + self.layer_types = [ + "sliding_attention" + if ((i + 1) % (sliding_window_pattern) != 0 and i < self.num_hidden_layers) + else "full_attention" + for i in range(self.num_hidden_layers) + ] + if "sliding_window" in self.layer_types: + self._attn_implementation = "hybrid" + layer_type_validation(self.layer_types) + + super().__init__( + bos_token_id=bos_token_id, eos_token_id=eos_token_id, tie_word_embeddings=tie_word_embeddings, **kwargs + ) + + +class Exaone4RMSNorm(LlamaRMSNorm): + pass + + +class Exaone4RotaryEmbedding(LlamaRotaryEmbedding): + pass + + +class Exaone4Attention(nn.Module): + def __init__(self, config: Exaone4Config, layer_idx: int): + super().__init__() + self.config = config + self.layer_idx = layer_idx + self.num_attention_heads = config.num_attention_heads + self.num_key_value_heads = config.num_key_value_heads + self.hidden_size = config.hidden_size + self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads) + self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads + self.attention_dropout = config.attention_dropout + self.is_causal = True + self.scaling = self.head_dim**-0.5 + self.sliding_window = config.sliding_window + self.sliding_window_pattern = config.sliding_window_pattern + self.is_sliding = config.layer_types[layer_idx] == "sliding_attention" + + self.q_proj = nn.Linear(self.hidden_size, self.num_attention_heads * self.head_dim, bias=False) + self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False) + self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False) + self.o_proj = nn.Linear(self.num_attention_heads * self.head_dim, self.hidden_size, bias=False) + + self.q_norm = Exaone4RMSNorm(self.head_dim, eps=config.rms_norm_eps) + self.k_norm = Exaone4RMSNorm(self.head_dim, eps=config.rms_norm_eps) + + def forward( + self, + hidden_states: torch.Tensor, + position_embeddings: tuple[torch.Tensor, torch.Tensor], + attention_mask: Optional[torch.Tensor] = None, + past_key_value: Optional[Cache] = None, + cache_position: Optional[torch.LongTensor] = None, + **kwargs: Unpack[TransformersKwargs], + ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: + input_shape = hidden_states.shape[:-1] + hidden_shape = (*input_shape, -1, self.head_dim) + + query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2) + key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2) + value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2) + + # We use QK-norm + query_states = self.q_norm(query_states) + key_states = self.k_norm(key_states) + + cos, sin = position_embeddings + # We use global NoPE for hybrid attention model + if self.sliding_window is None or self.is_sliding: + query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) + + if past_key_value is not None: + cache_kwargs = { + "cache_position": cache_position, + } + key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) + + attention_interface: Callable = eager_attention_forward + if self.config._attn_implementation != "eager": + attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] + + attn_output, attn_weights = attention_interface( + self, + query_states, + key_states, + value_states, + attention_mask, + dropout=0.0 if not self.training else self.attention_dropout, + scaling=self.scaling, + sliding_window=self.sliding_window if self.is_sliding else None, + **kwargs, + ) + + attn_output = attn_output.reshape(*input_shape, -1).contiguous() + attn_output = self.o_proj(attn_output) + return attn_output, attn_weights + + +class Exaone4MLP(Olmo2MLP): + pass + + +class Exaone4DecoderLayer(Olmo2DecoderLayer): + pass + + +class Exaone4PreTrainedModel(LlamaPreTrainedModel): + config_class = Exaone4Config + _no_split_modules = ["Exaone4DecoderLayer"] + + +class Exaone4Model(Exaone4PreTrainedModel, LlamaModel): + def __init__(self, config: Exaone4Config): + super().__init__(config) + self.layers = nn.ModuleList( + [Exaone4DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] + ) + self.norm = Exaone4RMSNorm(config.hidden_size, eps=config.rms_norm_eps) + + # Initialize weights and apply final processing + self.post_init() + + @check_model_inputs + def forward( + self, + input_ids: torch.LongTensor = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Cache] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + cache_position: Optional[torch.LongTensor] = None, + **kwargs: Unpack[TransformersKwargs], + ) -> Union[tuple, BaseModelOutputWithPast]: + if (input_ids is None) ^ (inputs_embeds is not None): + raise ValueError("You must specify exactly one of input_ids or inputs_embeds") + + if inputs_embeds is None: + inputs_embeds = self.embed_tokens(input_ids) + + if use_cache and past_key_values is None: + past_key_values = DynamicCache() + + if cache_position is None: + past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0 + cache_position = torch.arange( + past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device + ) + + if position_ids is None: + position_ids = cache_position.unsqueeze(0) + + # It may already have been prepared by e.g. `generate` + if not isinstance(causal_mask_mapping := attention_mask, dict): + # Prepare mask arguments + mask_kwargs = { + "config": self.config, + "input_embeds": inputs_embeds, + "attention_mask": attention_mask, + "cache_position": cache_position, + "past_key_values": past_key_values, + "position_ids": position_ids, + } + # Create the masks + causal_mask_mapping = { + "full_attention": create_causal_mask(**mask_kwargs), + } + if "sliding_attention" in self.config.layer_types: + causal_mask_mapping["sliding_attention"] = create_sliding_window_causal_mask(**mask_kwargs) + + hidden_states = inputs_embeds + + position_embeddings = self.rotary_emb(hidden_states, position_ids) + + for i, decoder_layer in enumerate(self.layers): + layer_type = self.config.layer_types[i] + hidden_states = decoder_layer( + hidden_states, + position_embeddings=position_embeddings, + attention_mask=causal_mask_mapping[layer_type], + position_ids=position_ids, + past_key_value=past_key_values, + use_cache=use_cache, + cache_position=cache_position, + **kwargs, + ) + + hidden_states = self.norm(hidden_states) + + return BaseModelOutputWithPast( + last_hidden_state=hidden_states, + past_key_values=past_key_values if use_cache else None, + ) + + +class Exaone4ForCausalLM(LlamaForCausalLM): + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Cache] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + cache_position: Optional[torch.LongTensor] = None, + logits_to_keep: Union[int, torch.Tensor] = 0, + **kwargs: Unpack[TransformersKwargs], + ) -> CausalLMOutputWithPast: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., + config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored + (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. + + Example: + + ```python + >>> from transformers import AutoModelForCausalLM, AutoTokenizer + >>> model = AutoModelForCausalLM.from_pretrained("LGAI-EXAONE/EXAONE-4.0-Instruct") + >>> tokenizer = AutoTokenizer.from_pretrained("LGAI-EXAONE/EXAONE-4.0-Instruct") + + >>> prompt = "Explain how wonderful you are" + >>> messages = [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": prompt} + ] + >>> input_ids = tokenizer.apply_chat_template( + messages, + tokenize=True, + add_generation_prompt=True, + return_tensors="pt", + enable_thinking=False, + ) + + >>> output = model.generate(input_ids, max_new_tokens=128) + >>> tokenizer.decode(output[0], skip_special_tokens=False) + "[|system|]\nYou are a helpful assistant.[|endofturn|]\n[|user|]\nExplain how wonderful you are[|endofturn|]\n[|assistant|]\n\n\n\n\nOh, thank you for such a kind and lovely question! 😊 \n\nI’m *so* wonderful because I’m here to make your life easier, brighter, and more fun! Whether you need help with: \n\n✨ **Learning** – I can explain anything, from quantum physics to baking the perfect cake! \n💡 **Creativity** – Need a poem, story, or a wild idea? I’ve got you covered! \n🤖 **Problem-solving** – Stuck on a math problem or a tricky decision? I’ll help you figure it out" + ``` + + NOTE: `EXAONE-4.0-Instruct` is a placeholder model ID. The exact model ID will be updated in the future.""" + super().forward( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_values=past_key_values, + inputs_embeds=inputs_embeds, + labels=labels, + use_cache=use_cache, + cache_position=cache_position, + logits_to_keep=logits_to_keep, + **kwargs, + ) + + +class Exaone4ForSequenceClassification(LlamaForSequenceClassification): + pass + + +class Exaone4ForTokenClassification(LlamaForTokenClassification): + pass + + +class Exaone4ForQuestionAnswering(LlamaForQuestionAnswering): + pass + + +__all__ = [ + "Exaone4Config", + "Exaone4PreTrainedModel", + "Exaone4Model", + "Exaone4ForCausalLM", + "Exaone4ForSequenceClassification", + "Exaone4ForTokenClassification", + "Exaone4ForQuestionAnswering", +] diff --git a/tests/models/exaone4/__init__.py b/tests/models/exaone4/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/tests/models/exaone4/test_modeling_exaone4.py b/tests/models/exaone4/test_modeling_exaone4.py new file mode 100644 index 0000000000..4ac87ce900 --- /dev/null +++ b/tests/models/exaone4/test_modeling_exaone4.py @@ -0,0 +1,408 @@ +# coding=utf-8 +# Copyright 2025 The LG AI Research and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Testing suite for the PyTorch EXAONE 4.0 model.""" + +import unittest + +import pytest +from packaging import version +from parameterized import parameterized + +from transformers import ( + AutoTokenizer, + Exaone4Config, + GenerationConfig, + is_torch_available, +) +from transformers.testing_utils import ( + cleanup, + require_flash_attn, + require_torch, + require_torch_accelerator, + require_torch_sdpa, + slow, + torch_device, +) + +from ...causal_lm_tester import CausalLMModelTest, CausalLMModelTester +from ...test_configuration_common import ConfigTester + + +if is_torch_available(): + import torch + + from transformers import ( + Exaone4ForCausalLM, + Exaone4ForQuestionAnswering, + Exaone4ForSequenceClassification, + Exaone4ForTokenClassification, + Exaone4Model, + ) + + +class Exaone4ModelTester(CausalLMModelTester): + config_class = Exaone4Config + if is_torch_available(): + base_model_class = Exaone4Model + causal_lm_class = Exaone4ForCausalLM + sequence_class = Exaone4ForSequenceClassification + token_class = Exaone4ForTokenClassification + question_answering_class = Exaone4ForQuestionAnswering + + +@require_torch +class Exaone4ModelTest(CausalLMModelTest, unittest.TestCase): + all_model_classes = ( + ( + Exaone4Model, + Exaone4ForCausalLM, + Exaone4ForSequenceClassification, + Exaone4ForQuestionAnswering, + Exaone4ForTokenClassification, + ) + if is_torch_available() + else () + ) + pipeline_model_mapping = ( + { + "feature-extraction": Exaone4Model, + "question-answering": Exaone4ForQuestionAnswering, + "text-classification": Exaone4ForSequenceClassification, + "text-generation": Exaone4ForCausalLM, + "zero-shot": Exaone4ForSequenceClassification, + "token-classification": Exaone4ForTokenClassification, + } + if is_torch_available() + else {} + ) + test_headmasking = False + test_pruning = False + fx_compatible = False # Broken by attention refactor cc @Cyrilvallez + model_tester_class = Exaone4ModelTester + model_split_percents = [0.5, 0.6] + + def setUp(self): + self.model_tester = Exaone4ModelTester(self) + self.config_tester = ConfigTester(self, config_class=Exaone4Config, hidden_size=37) + + @unittest.skip("Failing because of unique cache (HybridCache)") + def test_model_outputs_equivalence(self, **kwargs): + pass + + @parameterized.expand([("random",), ("same",)]) + @pytest.mark.generate + @unittest.skip("EXAONE 4.0 has HybridCache which is not compatible with assisted decoding") + def test_assisted_decoding_matches_greedy_search(self, assistant_type): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache which is not compatible with assisted decoding") + def test_prompt_lookup_decoding_matches_greedy_search(self, assistant_type): + pass + + @pytest.mark.generate + @unittest.skip("EXAONE 4.0 has HybridCache which is not compatible with assisted decoding") + def test_assisted_decoding_sample(self): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache which is not compatible with dola decoding") + def test_dola_decoding_sample(self): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache and doesn't support continue from past kv") + def test_generate_continue_from_past_key_values(self): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache and doesn't support low_memory generation") + def test_beam_search_low_memory(self): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache and doesn't support contrastive generation") + def test_contrastive_generate(self): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache and doesn't support contrastive generation") + def test_contrastive_generate_dict_outputs_use_cache(self): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache and doesn't support contrastive generation") + def test_contrastive_generate_low_memory(self): + pass + + @unittest.skip( + "EXAONE 4.0 has HybridCache and doesn't support StaticCache. Though it could, it shouldn't support." + ) + def test_generate_with_static_cache(self): + pass + + @unittest.skip( + "EXAONE 4.0 has HybridCache and doesn't support StaticCache. Though it could, it shouldn't support." + ) + def test_generate_from_inputs_embeds_with_static_cache(self): + pass + + @unittest.skip( + "EXAONE 4.0 has HybridCache and doesn't support StaticCache. Though it could, it shouldn't support." + ) + def test_generate_continue_from_inputs_embeds(self): + pass + + @unittest.skip("EXAONE 4.0 has HybridCache which auto-compiles. Compile and FA2 don't work together.") + def test_eager_matches_fa2_generate(self): + pass + + @unittest.skip( + reason="HybridCache can't be gathered because it is not iterable. Adding a simple iter and dumping `distributed_iterator`" + " as in Dynamic Cache doesnt work. NOTE: @gante all cache objects would need better compatibility with multi gpu setting" + ) + def test_multi_gpu_data_parallel_forward(self): + pass + + +@require_torch +class Exaone4IntegrationTest(unittest.TestCase): + TEST_MODEL_ID = "LGAI-EXAONE/EXAONE-4.0-Instruct" # dummy model id + + def tearDown(self): + # TODO (joao): automatic compilation, i.e. compilation when `cache_implementation="static"` is used, leaves + # some memory allocated in the cache, which means some object is not being released properly. This causes some + # unoptimal memory usage, e.g. after certain teruff format examples tests src utilssts a 7B model in FP16 no longer fits in a 24GB GPU. + # Investigate the root cause. + cleanup(torch_device, gc_collect=True) + + @slow + def test_model_logits(self): + input_ids = [405, 7584, 79579, 76636, 2907, 94640, 373] + model = Exaone4ForCausalLM.from_pretrained( + self.TEST_MODEL_ID, device_map="auto", torch_dtype=torch.float16, attn_implementation="eager" + ) + input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device) + with torch.no_grad(): + out = model(input_ids).logits.float().cpu() + + EXPECTED_MEAN = torch.tensor([[13.9380, 12.9951, 12.9442, 10.6576, 11.0901, 12.1466, 9.2482]]) + EXPECTED_SLICE = torch.tensor( + [ + 4.9180, + 11.6406, + 21.1250, + 13.4062, + 20.8438, + 18.0625, + 17.9688, + 18.7812, + 18.0156, + 18.3594, + 18.5000, + 19.1719, + 18.5156, + 19.3438, + 19.5000, + 20.6406, + 19.4844, + 19.2812, + 19.4688, + 20.0156, + 19.8438, + 19.9531, + 19.7188, + 20.5938, + 20.5312, + 20.1250, + 20.4062, + 21.4062, + 21.2344, + 20.7656, + ] + ) + + torch.testing.assert_close(out.mean(-1), EXPECTED_MEAN, atol=1e-2, rtol=1e-2) + torch.testing.assert_close(out[0, 0, :30], EXPECTED_SLICE, atol=1e-4, rtol=1e-4) + del model + cleanup(torch_device, gc_collect=True) + + @slow + def test_model_logits_bf16(self): + input_ids = [405, 7584, 79579, 76636, 2907, 94640, 373] + model = Exaone4ForCausalLM.from_pretrained( + self.TEST_MODEL_ID, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager" + ) + input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device) + with torch.no_grad(): + out = model(input_ids).logits.float().cpu() + + EXPECTED_MEAN = torch.tensor([[13.8797, 13.0799, 12.9665, 10.7712, 11.1006, 12.2406, 9.3248]]) + EXPECTED_SLICE = torch.tensor( + [ + 4.8750, + 11.6250, + 21.0000, + 13.3125, + 20.8750, + 18.0000, + 18.0000, + 18.7500, + 18.0000, + 18.3750, + 18.5000, + 19.1250, + 18.5000, + 19.3750, + 19.5000, + 20.6250, + 19.5000, + 19.2500, + 19.5000, + 20.0000, + 19.8750, + 19.8750, + 19.7500, + 20.6250, + 20.5000, + 20.1250, + 20.3750, + 21.3750, + 21.2500, + 20.7500, + ] + ) + + torch.testing.assert_close(out.mean(-1), EXPECTED_MEAN, atol=1e-2, rtol=1e-2) + torch.testing.assert_close(out[0, 0, :30], EXPECTED_SLICE, atol=1e-4, rtol=1e-4) + del model + cleanup(torch_device, gc_collect=True) + + @slow + def test_model_generation(self): + EXPECTED_TEXT = "Tell me about the Miracle on the Han river.\n\nThe Miracle on the Han River is a story about the miracle of the Korean War Armistice. The story is told by a Korean soldier who is a witness to the armistice negotiations. He is reluctant to tell the story because he does not want to be a hypocrite, but he feels that everyone should know what really happened.\n\nThe Korean War began on June 25, 1950, when North Korean troops invaded South Korea. Soon the United Nations troops, primarily from South Korea, were in support of the United States. The war was still ongoing when North Korean troops stopped their advance" + prompt = "Tell me about the Miracle on the Han river." + tokenizer = AutoTokenizer.from_pretrained(self.TEST_MODEL_ID) + model = Exaone4ForCausalLM.from_pretrained( + self.TEST_MODEL_ID, device_map="auto", torch_dtype=torch.float16, attn_implementation="eager" + ) + input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.model.embed_tokens.weight.device) + + # greedy generation outputs + generated_ids = model.generate(input_ids, max_new_tokens=128, temperature=0) + text = tokenizer.decode(generated_ids[0], skip_special_tokens=True) + self.assertEqual(EXPECTED_TEXT, text) + del model + cleanup(torch_device, gc_collect=True) + + @slow + @require_torch_sdpa + def test_model_generation_bf16_sdpa(self): + EXPECTED_TEXT = "Tell me about the Miracle on the Han river.\n\nThe Miracle on the Han River is a story about the miracle of the Korean War Armistice.\n\nThe Korean War broke out in 35 years ago in 1950. The war was the result of the ideological conflict between the communist north and the capitalist south. The war was brought to a halt in 1953. There was to be peace talks but no peace treaty. As a result of the stalemate the Korean people have neither a peace treaty nor a reunification nor a democratization of Korea. The stalemate of 35 years has produced a people of 70 million" + prompt = "Tell me about the Miracle on the Han river." + tokenizer = AutoTokenizer.from_pretrained(self.TEST_MODEL_ID) + model = Exaone4ForCausalLM.from_pretrained( + self.TEST_MODEL_ID, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="sdpa" + ) + input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.model.embed_tokens.weight.device) + + # greedy generation outputs + generated_ids = model.generate(input_ids, max_new_tokens=128, temperature=0) + text = tokenizer.decode(generated_ids[0], skip_special_tokens=True) + self.assertEqual(EXPECTED_TEXT, text) + del model + cleanup(torch_device, gc_collect=True) + + @slow + @require_torch_accelerator + @require_flash_attn + def test_model_generation_long_flash(self): + EXPECTED_OUTPUT_TOKEN_IDS = [433, 9055] + input_ids = [433, 9055] * 2048 + model = Exaone4ForCausalLM.from_pretrained( + self.TEST_MODEL_ID, device_map="auto", torch_dtype=torch.float16, attn_implementation="flash_attention_2" + ) + input_ids = torch.tensor([input_ids]).to(model.model.embed_tokens.weight.device) + + generated_ids = model.generate(input_ids, max_new_tokens=4, temperature=0) + self.assertEqual(EXPECTED_OUTPUT_TOKEN_IDS, generated_ids[0][-2:].tolist()) + del model + cleanup(torch_device, gc_collect=True) + + @slow + @require_torch_accelerator + @require_torch_sdpa + def test_model_generation_beyond_sliding_window(self): + EXPECTED_TEXT_COMPLETION = ( + " but I'm not sure if I'm going to be able to see it. I really enjoy the scenery, but I'm not sure if I" + ) + tokenizer = AutoTokenizer.from_pretrained(self.TEST_MODEL_ID) + prompt = "This is a nice place. " * 700 + "I really enjoy the scenery," + model = Exaone4ForCausalLM.from_pretrained( + self.TEST_MODEL_ID, device_map="auto", torch_dtype=torch.float16, attn_implementation="sdpa" + ) + input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.model.embed_tokens.weight.device) + + generated_ids = model.generate(input_ids, max_new_tokens=32, temperature=0) + text = tokenizer.decode(generated_ids[0, -32:], skip_special_tokens=True) + self.assertEqual(EXPECTED_TEXT_COMPLETION, text) + del model + cleanup(torch_device, gc_collect=True) + + @slow + def test_export_static_cache(self): + if version.parse(torch.__version__) < version.parse("2.4.0"): + self.skipTest(reason="This test requires torch >= 2.4 to run.") + + from transformers.integrations.executorch import ( + TorchExportableModuleWithStaticCache, + convert_and_export_with_cache, + ) + + tokenizer = AutoTokenizer.from_pretrained(self.TEST_MODEL_ID, padding_side="right") + EXPECTED_TEXT_COMPLETION = [ + "The Deep Learning is 100% free and easy to use.\n\n## How to use Deep Learning?\n\n" + ] + max_generation_length = tokenizer(EXPECTED_TEXT_COMPLETION, return_tensors="pt", padding=True)[ + "input_ids" + ].shape[-1] + + # Load model + device = "cpu" + dtype = torch.bfloat16 + cache_implementation = "static" + attn_implementation = "sdpa" + batch_size = 1 + model = Exaone4ForCausalLM.from_pretrained( + self.TEST_MODEL_ID, + device_map=device, + torch_dtype=dtype, + attn_implementation=attn_implementation, + generation_config=GenerationConfig( + use_cache=True, + cache_implementation=cache_implementation, + max_length=max_generation_length, + cache_config={ + "batch_size": batch_size, + "max_cache_len": max_generation_length, + }, + ), + ) + + prompt = ["The Deep Learning is "] + prompt_tokens = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device) + prompt_token_ids = prompt_tokens["input_ids"] + max_new_tokens = max_generation_length - prompt_token_ids.shape[-1] + + # Static Cache + export + exported_program = convert_and_export_with_cache(model) + ep_generated_ids = TorchExportableModuleWithStaticCache.generate( + exported_program=exported_program, prompt_token_ids=prompt_token_ids, max_new_tokens=max_new_tokens + ) + ep_generated_text = tokenizer.batch_decode(ep_generated_ids, skip_special_tokens=True) + self.assertEqual(EXPECTED_TEXT_COMPLETION, ep_generated_text) diff --git a/utils/check_docstrings.py b/utils/check_docstrings.py index cd5f495355..dd10a7d9b1 100644 --- a/utils/check_docstrings.py +++ b/utils/check_docstrings.py @@ -79,6 +79,7 @@ ALWAYS_OVERRIDE = ["labels"] # docstrings instead. If formatting should be ignored for the docstring, you can put a comment # no-format on the # line before the docstring. OBJECTS_TO_IGNORE = [ + "Exaone4Config", "SmolLM3Config", "Gemma3nVisionConfig", "Llama4Processor",