add Glm (#33823)
* Create modular_glm.py * Update modular_glm.py * Finalize architecture without all attentions * Add all attentions modules * Finalize modular * Update given last version * Last update * Finalize model * Finalize converter * Update convert_glm_weights_to_hf.py * style * style * Create __init__.py * Aff all inits * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Correct the rotary embeddings * Remove apply_residual_connection_post_layernorm (always false) * remove use_rms_norm (always true) * remove past_layer_norm (always true) * Update __init__.py * Update config and license * start adding tests and doc * Add doc + style * Update test_modeling_glm.py * Add dummies * Apply correct modeling * Refactor attention to follow llama * Update __init__.py * Update convert_glm_weights_to_hf.py * Correct bias * remove linear_bias and pdrop (never used) * apply modular * Simplify converter * remove dummies + style * add model_input_names * Add pretraining_tp to config for when eager attention is used * Update modular to remove all pretraining_tp * Update test_modeling_glm.py * Update the __all__ * Update __all__ * Update __init__.py * Update test_modeling_glm.py * add revisions * Add the correct repos and revisions * style * Update __init__.py * update exports * remove import of modular files * style * Apply Llama changes + refine converter * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * Update convert_glm_weights_to_hf.py * style * Use new modular converter * add pretrainedmodel to init * style * Update test_modeling_glm.py * Move config outside modular to please CI about docstrings * Add dummies to please CI * Update glm.md * Update glm.md
This commit is contained in:
@@ -414,6 +414,8 @@
|
|||||||
title: Gemma
|
title: Gemma
|
||||||
- local: model_doc/gemma2
|
- local: model_doc/gemma2
|
||||||
title: Gemma2
|
title: Gemma2
|
||||||
|
- local: model_doc/glm
|
||||||
|
title: GLM
|
||||||
- local: model_doc/openai-gpt
|
- local: model_doc/openai-gpt
|
||||||
title: GPT
|
title: GPT
|
||||||
- local: model_doc/gpt_neo
|
- local: model_doc/gpt_neo
|
||||||
|
|||||||
@@ -150,6 +150,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
| [Gemma](model_doc/gemma) | ✅ | ❌ | ✅ |
|
| [Gemma](model_doc/gemma) | ✅ | ❌ | ✅ |
|
||||||
| [Gemma2](model_doc/gemma2) | ✅ | ❌ | ❌ |
|
| [Gemma2](model_doc/gemma2) | ✅ | ❌ | ❌ |
|
||||||
| [GIT](model_doc/git) | ✅ | ❌ | ❌ |
|
| [GIT](model_doc/git) | ✅ | ❌ | ❌ |
|
||||||
|
| [GLM](model_doc/glm) | ✅ | ❌ | ❌ |
|
||||||
| [GLPN](model_doc/glpn) | ✅ | ❌ | ❌ |
|
| [GLPN](model_doc/glpn) | ✅ | ❌ | ❌ |
|
||||||
| [GPT Neo](model_doc/gpt_neo) | ✅ | ❌ | ✅ |
|
| [GPT Neo](model_doc/gpt_neo) | ✅ | ❌ | ✅ |
|
||||||
| [GPT NeoX](model_doc/gpt_neox) | ✅ | ❌ | ❌ |
|
| [GPT NeoX](model_doc/gpt_neox) | ✅ | ❌ | ❌ |
|
||||||
|
|||||||
99
docs/source/en/model_doc/glm.md
Normal file
99
docs/source/en/model_doc/glm.md
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
<!--Copyright 2024 The GLM & ZhipuAI team and The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
|
||||||
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||||
|
rendered properly in your Markdown viewer.
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
|
# GLM
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The GLM Model was proposed
|
||||||
|
in [ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools](https://arxiv.org/html/2406.12793v1)
|
||||||
|
by GLM Team, THUDM & ZhipuAI.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report
|
||||||
|
primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most
|
||||||
|
capable models that are trained with all the insights and lessons gained from the preceding three generations of
|
||||||
|
ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with
|
||||||
|
a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment
|
||||||
|
is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human
|
||||||
|
feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU,
|
||||||
|
GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3)
|
||||||
|
matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as
|
||||||
|
measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide
|
||||||
|
when and which tool(s) to use—including web browser, Python interpreter, text-to-image model, and user-defined
|
||||||
|
functions—to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All
|
||||||
|
Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter.
|
||||||
|
Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M),
|
||||||
|
GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- This model was contributed by [THUDM](https://huggingface.co/THUDM). The most recent code can be
|
||||||
|
found [here](https://github.com/thudm/GLM-4).
|
||||||
|
|
||||||
|
|
||||||
|
## Usage tips
|
||||||
|
|
||||||
|
`GLM-4` can be found on the [Huggingface Hub](https://huggingface.co/collections/THUDM/glm-4-665fcf188c414b03c2f7e3b7)
|
||||||
|
|
||||||
|
In the following, we demonstrate how to use `glm-4-9b-chat` for the inference. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
>>> device = "cuda" # the device to load the model onto
|
||||||
|
|
||||||
|
>>> model = AutoModelForCausalLM.from_pretrained("THUDM/glm-4-9b-chat", device_map="auto")
|
||||||
|
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat")
|
||||||
|
|
||||||
|
>>> prompt = "Give me a short introduction to large language model."
|
||||||
|
|
||||||
|
>>> messages = [{"role": "user", "content": prompt}]
|
||||||
|
|
||||||
|
>>> text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
||||||
|
|
||||||
|
>>> model_inputs = tokenizer([text], return_tensors="pt").to(device)
|
||||||
|
|
||||||
|
>>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
|
||||||
|
|
||||||
|
>>> generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
|
||||||
|
|
||||||
|
>>> response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
## GlmConfig
|
||||||
|
|
||||||
|
[[autodoc]] GlmConfig
|
||||||
|
|
||||||
|
## GlmModel
|
||||||
|
|
||||||
|
[[autodoc]] GlmModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GlmForCausalLM
|
||||||
|
|
||||||
|
[[autodoc]] GlmForCausalLM
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GlmForSequenceClassification
|
||||||
|
|
||||||
|
[[autodoc]] GlmForSequenceClassification
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## GlmForTokenClassification
|
||||||
|
|
||||||
|
[[autodoc]] GlmForTokenClassification
|
||||||
|
- forward
|
||||||
@@ -42,6 +42,7 @@ FlashAttention-2 is currently supported for the following architectures:
|
|||||||
* [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon#transformers.Chameleon)
|
* [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon#transformers.Chameleon)
|
||||||
* [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel)
|
* [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel)
|
||||||
* [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel)
|
* [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel)
|
||||||
|
* [GLM](https://huggingface.co/docs/transformers/model_doc/glm#transformers.GLMModel)
|
||||||
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
|
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
|
||||||
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
|
* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel)
|
||||||
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
|
* [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#transformers.GemmaModel)
|
||||||
@@ -216,6 +217,7 @@ For now, Transformers supports SDPA inference and training for the following arc
|
|||||||
* [CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert#transformers.CamembertModel)
|
* [CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert#transformers.CamembertModel)
|
||||||
* [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon#transformers.Chameleon)
|
* [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon#transformers.Chameleon)
|
||||||
* [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel)
|
* [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel)
|
||||||
|
* [GLM](https://huggingface.co/docs/transformers/model_doc/glm#transformers.GLMModel)
|
||||||
* [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel)
|
* [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel)
|
||||||
* [data2vec_audio](https://huggingface.co/docs/transformers/main/en/model_doc/data2vec#transformers.Data2VecAudioModel)
|
* [data2vec_audio](https://huggingface.co/docs/transformers/main/en/model_doc/data2vec#transformers.Data2VecAudioModel)
|
||||||
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
|
* [Dbrx](https://huggingface.co/docs/transformers/model_doc/dbrx#transformers.DbrxModel)
|
||||||
|
|||||||
@@ -454,6 +454,7 @@ _import_structure = {
|
|||||||
"GitProcessor",
|
"GitProcessor",
|
||||||
"GitVisionConfig",
|
"GitVisionConfig",
|
||||||
],
|
],
|
||||||
|
"models.glm": ["GlmConfig"],
|
||||||
"models.glpn": ["GLPNConfig"],
|
"models.glpn": ["GLPNConfig"],
|
||||||
"models.gpt2": [
|
"models.gpt2": [
|
||||||
"GPT2Config",
|
"GPT2Config",
|
||||||
@@ -2294,6 +2295,15 @@ else:
|
|||||||
"GitVisionModel",
|
"GitVisionModel",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.glm"].extend(
|
||||||
|
[
|
||||||
|
"GlmForCausalLM",
|
||||||
|
"GlmForSequenceClassification",
|
||||||
|
"GlmForTokenClassification",
|
||||||
|
"GlmModel",
|
||||||
|
"GlmPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.glpn"].extend(
|
_import_structure["models.glpn"].extend(
|
||||||
[
|
[
|
||||||
"GLPNForDepthEstimation",
|
"GLPNForDepthEstimation",
|
||||||
@@ -5304,6 +5314,7 @@ if TYPE_CHECKING:
|
|||||||
GitProcessor,
|
GitProcessor,
|
||||||
GitVisionConfig,
|
GitVisionConfig,
|
||||||
)
|
)
|
||||||
|
from .models.glm import GlmConfig
|
||||||
from .models.glpn import GLPNConfig
|
from .models.glpn import GLPNConfig
|
||||||
from .models.gpt2 import (
|
from .models.gpt2 import (
|
||||||
GPT2Config,
|
GPT2Config,
|
||||||
@@ -7024,6 +7035,13 @@ if TYPE_CHECKING:
|
|||||||
GitPreTrainedModel,
|
GitPreTrainedModel,
|
||||||
GitVisionModel,
|
GitVisionModel,
|
||||||
)
|
)
|
||||||
|
from .models.glm import (
|
||||||
|
GlmForCausalLM,
|
||||||
|
GlmForSequenceClassification,
|
||||||
|
GlmForTokenClassification,
|
||||||
|
GlmModel,
|
||||||
|
GlmPreTrainedModel,
|
||||||
|
)
|
||||||
from .models.glpn import (
|
from .models.glpn import (
|
||||||
GLPNForDepthEstimation,
|
GLPNForDepthEstimation,
|
||||||
GLPNModel,
|
GLPNModel,
|
||||||
|
|||||||
@@ -97,6 +97,7 @@ from . import (
|
|||||||
gemma,
|
gemma,
|
||||||
gemma2,
|
gemma2,
|
||||||
git,
|
git,
|
||||||
|
glm,
|
||||||
glpn,
|
glpn,
|
||||||
gpt2,
|
gpt2,
|
||||||
gpt_bigcode,
|
gpt_bigcode,
|
||||||
|
|||||||
@@ -114,6 +114,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
("gemma", "GemmaConfig"),
|
("gemma", "GemmaConfig"),
|
||||||
("gemma2", "Gemma2Config"),
|
("gemma2", "Gemma2Config"),
|
||||||
("git", "GitConfig"),
|
("git", "GitConfig"),
|
||||||
|
("glm", "GlmConfig"),
|
||||||
("glpn", "GLPNConfig"),
|
("glpn", "GLPNConfig"),
|
||||||
("gpt-sw3", "GPT2Config"),
|
("gpt-sw3", "GPT2Config"),
|
||||||
("gpt2", "GPT2Config"),
|
("gpt2", "GPT2Config"),
|
||||||
@@ -416,6 +417,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||||||
("gemma", "Gemma"),
|
("gemma", "Gemma"),
|
||||||
("gemma2", "Gemma2"),
|
("gemma2", "Gemma2"),
|
||||||
("git", "GIT"),
|
("git", "GIT"),
|
||||||
|
("glm", "GLM"),
|
||||||
("glpn", "GLPN"),
|
("glpn", "GLPN"),
|
||||||
("gpt-sw3", "GPT-Sw3"),
|
("gpt-sw3", "GPT-Sw3"),
|
||||||
("gpt2", "OpenAI GPT-2"),
|
("gpt2", "OpenAI GPT-2"),
|
||||||
|
|||||||
@@ -111,6 +111,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("gemma", "GemmaModel"),
|
("gemma", "GemmaModel"),
|
||||||
("gemma2", "Gemma2Model"),
|
("gemma2", "Gemma2Model"),
|
||||||
("git", "GitModel"),
|
("git", "GitModel"),
|
||||||
|
("glm", "GlmModel"),
|
||||||
("glpn", "GLPNModel"),
|
("glpn", "GLPNModel"),
|
||||||
("gpt-sw3", "GPT2Model"),
|
("gpt-sw3", "GPT2Model"),
|
||||||
("gpt2", "GPT2Model"),
|
("gpt2", "GPT2Model"),
|
||||||
@@ -486,6 +487,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
|||||||
("gemma", "GemmaForCausalLM"),
|
("gemma", "GemmaForCausalLM"),
|
||||||
("gemma2", "Gemma2ForCausalLM"),
|
("gemma2", "Gemma2ForCausalLM"),
|
||||||
("git", "GitForCausalLM"),
|
("git", "GitForCausalLM"),
|
||||||
|
("glm", "GlmForCausalLM"),
|
||||||
("gpt-sw3", "GPT2LMHeadModel"),
|
("gpt-sw3", "GPT2LMHeadModel"),
|
||||||
("gpt2", "GPT2LMHeadModel"),
|
("gpt2", "GPT2LMHeadModel"),
|
||||||
("gpt_bigcode", "GPTBigCodeForCausalLM"),
|
("gpt_bigcode", "GPTBigCodeForCausalLM"),
|
||||||
@@ -941,6 +943,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
|||||||
("funnel", "FunnelForSequenceClassification"),
|
("funnel", "FunnelForSequenceClassification"),
|
||||||
("gemma", "GemmaForSequenceClassification"),
|
("gemma", "GemmaForSequenceClassification"),
|
||||||
("gemma2", "Gemma2ForSequenceClassification"),
|
("gemma2", "Gemma2ForSequenceClassification"),
|
||||||
|
("glm", "GlmForSequenceClassification"),
|
||||||
("gpt-sw3", "GPT2ForSequenceClassification"),
|
("gpt-sw3", "GPT2ForSequenceClassification"),
|
||||||
("gpt2", "GPT2ForSequenceClassification"),
|
("gpt2", "GPT2ForSequenceClassification"),
|
||||||
("gpt_bigcode", "GPTBigCodeForSequenceClassification"),
|
("gpt_bigcode", "GPTBigCodeForSequenceClassification"),
|
||||||
@@ -1131,6 +1134,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
|||||||
("funnel", "FunnelForTokenClassification"),
|
("funnel", "FunnelForTokenClassification"),
|
||||||
("gemma", "GemmaForTokenClassification"),
|
("gemma", "GemmaForTokenClassification"),
|
||||||
("gemma2", "Gemma2ForTokenClassification"),
|
("gemma2", "Gemma2ForTokenClassification"),
|
||||||
|
("glm", "GlmForTokenClassification"),
|
||||||
("gpt-sw3", "GPT2ForTokenClassification"),
|
("gpt-sw3", "GPT2ForTokenClassification"),
|
||||||
("gpt2", "GPT2ForTokenClassification"),
|
("gpt2", "GPT2ForTokenClassification"),
|
||||||
("gpt_bigcode", "GPTBigCodeForTokenClassification"),
|
("gpt_bigcode", "GPTBigCodeForTokenClassification"),
|
||||||
|
|||||||
@@ -204,6 +204,7 @@ else:
|
|||||||
),
|
),
|
||||||
),
|
),
|
||||||
("git", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
("git", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
|
("glm", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)),
|
("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)),
|
||||||
("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
||||||
("gpt_bigcode", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
("gpt_bigcode", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
||||||
|
|||||||
27
src/transformers/models/glm/__init__.py
Normal file
27
src/transformers/models/glm/__init__.py
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
# Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import _LazyModule
|
||||||
|
from ...utils.import_utils import define_import_structure
|
||||||
|
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_glm import *
|
||||||
|
from .modeling_glm import *
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
_file = globals()["__file__"]
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
|
||||||
136
src/transformers/models/glm/configuration_glm.py
Normal file
136
src/transformers/models/glm/configuration_glm.py
Normal file
@@ -0,0 +1,136 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2024 The GLM & ZhipuAI team and HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
|
||||||
|
|
||||||
|
class GlmConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`GlmModel`]. It is used to instantiate an Glm
|
||||||
|
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||||
|
defaults will yield a similar configuration to that of the Glm-4-9b-chat.
|
||||||
|
e.g. [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 151552):
|
||||||
|
Vocabulary size of the Glm model. Defines the number of different tokens that can be represented by the
|
||||||
|
`inputs_ids` passed when calling [`GlmModel`]
|
||||||
|
hidden_size (`int`, *optional*, defaults to 4096):
|
||||||
|
Dimension of the hidden representations.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 13696):
|
||||||
|
Dimension of the MLP representations.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 40):
|
||||||
|
Number of hidden layers in the Transformer decoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 32):
|
||||||
|
Number of attention heads for each attention layer in the Transformer decoder.
|
||||||
|
num_key_value_heads (`int`, *optional*, defaults to 2):
|
||||||
|
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
|
||||||
|
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
|
||||||
|
`num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
|
||||||
|
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
|
||||||
|
by meanpooling all the original heads within that group. For more details checkout [this
|
||||||
|
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
|
||||||
|
`num_attention_heads`.
|
||||||
|
head_dim (`int`, *optional*, defaults to 128):
|
||||||
|
The attention head dimension.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
|
||||||
|
The legacy activation function. It is overwritten by the `hidden_activation`.
|
||||||
|
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 131072):
|
||||||
|
The maximum sequence length that this model might ever be used with.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
rms_norm_eps (`float`, *optional*, defaults to 1.5625e-07):
|
||||||
|
The epsilon used by the rms normalization layers.
|
||||||
|
use_cache (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||||
|
relevant if `config.is_decoder=True`.
|
||||||
|
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to tie weight embeddings
|
||||||
|
rope_theta (`float`, *optional*, defaults to 10000.0):
|
||||||
|
The base period of the RoPE embeddings.
|
||||||
|
pad_token_id (`int`, *optional*, defaults to 151329):
|
||||||
|
Padding token id.
|
||||||
|
eos_token_id (`int` | `list`, *optional*, defaults to `[151329, 151336, 151338]`):
|
||||||
|
End of stream token id.
|
||||||
|
bos_token_id (`int`, *optional*):
|
||||||
|
Beginning of stream token id.
|
||||||
|
attention_bias (`bool`, defaults to `False`, *optional*, defaults to `True`):
|
||||||
|
Whether to use a bias in the query, key, value and output projection layers during self-attention.
|
||||||
|
```python
|
||||||
|
>>> from transformers import GlmModel, GlmConfig
|
||||||
|
>>> # Initializing a Glm glm-4-9b-chat style configuration
|
||||||
|
>>> configuration = GlmConfig()
|
||||||
|
>>> # Initializing a model from the glm-4-9b-chat style configuration
|
||||||
|
>>> model = GlmModel(configuration)
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "glm"
|
||||||
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=151552,
|
||||||
|
hidden_size=4096,
|
||||||
|
intermediate_size=13696,
|
||||||
|
num_hidden_layers=40,
|
||||||
|
num_attention_heads=32,
|
||||||
|
num_key_value_heads=2,
|
||||||
|
head_dim=128,
|
||||||
|
hidden_act="silu",
|
||||||
|
attention_dropout=0.0,
|
||||||
|
max_position_embeddings=131072,
|
||||||
|
initializer_range=0.02,
|
||||||
|
rms_norm_eps=0.00000015625,
|
||||||
|
use_cache=True,
|
||||||
|
tie_word_embeddings=False,
|
||||||
|
rope_theta=10000.0,
|
||||||
|
pad_token_id=151329,
|
||||||
|
eos_token_id=[151329, 151336, 151338],
|
||||||
|
bos_token_id=None,
|
||||||
|
attention_bias=True,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.head_dim = head_dim
|
||||||
|
self.num_key_value_heads = num_key_value_heads
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.rms_norm_eps = rms_norm_eps
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.rope_theta = rope_theta
|
||||||
|
self.attention_bias = attention_bias
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
|
||||||
|
super().__init__(
|
||||||
|
pad_token_id=pad_token_id,
|
||||||
|
bos_token_id=bos_token_id,
|
||||||
|
eos_token_id=eos_token_id,
|
||||||
|
tie_word_embeddings=tie_word_embeddings,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["GlmConfig"]
|
||||||
174
src/transformers/models/glm/convert_glm_weights_to_hf.py
Normal file
174
src/transformers/models/glm/convert_glm_weights_to_hf.py
Normal file
@@ -0,0 +1,174 @@
|
|||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from safetensors.torch import load_file
|
||||||
|
from tokenizers import processors
|
||||||
|
|
||||||
|
from transformers import GlmConfig, GlmForCausalLM, PreTrainedTokenizerFast
|
||||||
|
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
# `None` means we drop the key
|
||||||
|
STATE_DICT_MAPPING = {
|
||||||
|
# CausalLM keys
|
||||||
|
r"transformer.output_layer.weight": r"lm_head.weight",
|
||||||
|
|
||||||
|
# Model keys
|
||||||
|
r"transformer.embedding.word_embeddings.weight": r"model.embed_tokens.weight",
|
||||||
|
r"transformer.rotary_pos_emb.inv_freq": None,
|
||||||
|
r"transformer.encoder.final_layernorm.weight": r"model.norm.weight",
|
||||||
|
|
||||||
|
# Layers keys
|
||||||
|
r"transformer.encoder.layers.(\d+).input_layernorm.weight": r"model.layers.\1.input_layernorm.weight",
|
||||||
|
r"transformer.encoder.layers.(\d+).post_attention_layernorm.weight": r"model.layers.\1.post_attention_layernorm.weight",
|
||||||
|
|
||||||
|
# Attention keys
|
||||||
|
r"transformer.encoder.layers.(\d+).self_attention.dense.weight": r"model.layers.\1.self_attn.o_proj.weight",
|
||||||
|
# qkv_proj will later be split in q|k|v|_proj
|
||||||
|
r"transformer.encoder.layers.(\d+).self_attention.query_key_value.(weight|bias)": r"model.layers.\1.self_attn.qkv_proj.\2",
|
||||||
|
|
||||||
|
# MLP keys
|
||||||
|
r"transformer.encoder.layers.(\d+).mlp.dense_h_to_4h.weight": r"model.layers.\1.mlp.gate_up_proj.weight",
|
||||||
|
r"transformer.encoder.layers.(\d+).mlp.dense_4h_to_h.weight": r"model.layers.\1.mlp.down_proj.weight",
|
||||||
|
}
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
|
def merge_safetensors(input_dir: str):
|
||||||
|
all_files = [os.path.join(input_dir, x) for x in os.listdir(input_dir) if x.endswith(".safetensors")]
|
||||||
|
all_files = sorted(all_files, key=lambda x: int(x.rsplit("-", 3)[1]))
|
||||||
|
|
||||||
|
all_weights = {}
|
||||||
|
for file in all_files:
|
||||||
|
tensors = load_file(file)
|
||||||
|
all_weights.update(tensors)
|
||||||
|
|
||||||
|
return all_weights
|
||||||
|
|
||||||
|
|
||||||
|
def map_old_key_to_new(old_key):
|
||||||
|
for pattern, replacement in STATE_DICT_MAPPING.items():
|
||||||
|
if replacement is None:
|
||||||
|
if re.fullmatch(pattern, old_key):
|
||||||
|
return None
|
||||||
|
else:
|
||||||
|
new_key, n_replace = re.subn(pattern, replacement, old_key)
|
||||||
|
# Early exit of the loop
|
||||||
|
if n_replace > 0:
|
||||||
|
return new_key
|
||||||
|
|
||||||
|
raise ValueError(f"Key: {old_key} could not be mapped (check the mapping).")
|
||||||
|
|
||||||
|
|
||||||
|
def convert_state_dict(original_state_dict: dict, config: GlmConfig):
|
||||||
|
new_dict = {}
|
||||||
|
|
||||||
|
head_dim = config.hidden_size // config.num_attention_heads
|
||||||
|
query_size = config.num_attention_heads * head_dim
|
||||||
|
kv_size = config.num_key_value_heads * head_dim
|
||||||
|
|
||||||
|
for old_key, value in original_state_dict.items():
|
||||||
|
new_key = map_old_key_to_new(old_key)
|
||||||
|
if new_key is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if "qkv_proj." in new_key:
|
||||||
|
q_proj, k_proj, v_proj = (
|
||||||
|
value[:query_size, ...],
|
||||||
|
value[query_size : query_size + kv_size, ...],
|
||||||
|
value[query_size + kv_size :, ...],
|
||||||
|
)
|
||||||
|
new_dict[new_key.replace("qkv_proj.", "q_proj.")] = q_proj
|
||||||
|
new_dict[new_key.replace("qkv_proj.", "k_proj.")] = k_proj
|
||||||
|
new_dict[new_key.replace("qkv_proj.", "v_proj.")] = v_proj
|
||||||
|
else:
|
||||||
|
new_dict[new_key] = value
|
||||||
|
return new_dict
|
||||||
|
|
||||||
|
|
||||||
|
def convert_config(original_config: dict):
|
||||||
|
key_mapping = {
|
||||||
|
"vocab_size": "padded_vocab_size",
|
||||||
|
"intermediate_size": "ffn_hidden_size",
|
||||||
|
"num_hidden_layers": "num_layers",
|
||||||
|
"max_position_embeddings": "seq_length",
|
||||||
|
"rms_norm_eps": "layernorm_epsilon",
|
||||||
|
"head_dim": "kv_channels",
|
||||||
|
"attention_bias": "add_qkv_bias",
|
||||||
|
}
|
||||||
|
similar_keys_to_keep = [
|
||||||
|
"num_attention_heads" "hidden_size",
|
||||||
|
"attention_dropout",
|
||||||
|
"use_cache",
|
||||||
|
"eos_token_id",
|
||||||
|
"pad_token_id",
|
||||||
|
"tie_word_embeddings",
|
||||||
|
]
|
||||||
|
new_config_kwargs = {k: original_config[v] for k, v in key_mapping.items()}
|
||||||
|
new_config_kwargs.update({k: v for k, v in original_config.items() if k in similar_keys_to_keep})
|
||||||
|
new_config_kwargs["num_key_value_heads"] = (
|
||||||
|
new_config_kwargs["num_attention_heads"]
|
||||||
|
if not original_config["multi_query_attention"]
|
||||||
|
else original_config["multi_query_group_num"]
|
||||||
|
)
|
||||||
|
new_config_kwargs["rope_theta"] = 10000.0 * getattr(original_config, "rope_ratio", 1)
|
||||||
|
|
||||||
|
new_config = GlmConfig(**new_config_kwargs)
|
||||||
|
return new_config
|
||||||
|
|
||||||
|
|
||||||
|
def convert_glm_tokenizer(input_dir):
|
||||||
|
fast_tok = PreTrainedTokenizerFast.from_pretrained(input_dir, model_input_names=["input_ids", "attention_mask"])
|
||||||
|
# Add the two tokens automatically with post processor
|
||||||
|
fast_tok._tokenizer.post_processor = processors.Sequence(
|
||||||
|
[
|
||||||
|
processors.ByteLevel(trim_offsets=False),
|
||||||
|
processors.TemplateProcessing(
|
||||||
|
single="[gMASK]:0 <sop>:0 $A:0",
|
||||||
|
pair="[gMASK]:0 <sop>:0 $A:0 $B:1",
|
||||||
|
special_tokens=[("[gMASK]", 151331), ("<sop>", 151333)],
|
||||||
|
),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
return fast_tok
|
||||||
|
|
||||||
|
|
||||||
|
def convert_glm_model(input_dir, output_dir):
|
||||||
|
# Load and convert config
|
||||||
|
with open(os.path.join(input_dir, "config.json")) as f:
|
||||||
|
original_config = json.load(f)
|
||||||
|
config = convert_config(original_config)
|
||||||
|
config.save_pretrained(output_dir)
|
||||||
|
|
||||||
|
# Load and convert weights
|
||||||
|
original_state_dict = merge_safetensors(input_dir)
|
||||||
|
new_dict = convert_state_dict(original_state_dict, config)
|
||||||
|
with torch.device("meta"):
|
||||||
|
model = GlmForCausalLM(config)
|
||||||
|
model.load_state_dict(new_dict, strict=True, assign=True)
|
||||||
|
model.save_pretrained(output_dir)
|
||||||
|
|
||||||
|
# Load and convert tokenizer
|
||||||
|
tokenizer = convert_glm_tokenizer(input_dir)
|
||||||
|
tokenizer.save_pretrained(output_dir)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument(
|
||||||
|
"input_dir",
|
||||||
|
type=str,
|
||||||
|
help="Location of the local folder copied from the Hub.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"output_dir",
|
||||||
|
type=str,
|
||||||
|
help="Location to write HF model and tokenizer",
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_glm_model(args.input_dir, args.output_dir)
|
||||||
1313
src/transformers/models/glm/modeling_glm.py
Normal file
1313
src/transformers/models/glm/modeling_glm.py
Normal file
File diff suppressed because it is too large
Load Diff
188
src/transformers/models/glm/modular_glm.py
Normal file
188
src/transformers/models/glm/modular_glm.py
Normal file
@@ -0,0 +1,188 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2024 The GLM & ZhipuAI team and HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import math
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
import torch.utils.checkpoint
|
||||||
|
|
||||||
|
from ...utils import logging
|
||||||
|
from ..gemma.modeling_gemma import (
|
||||||
|
GemmaForCausalLM,
|
||||||
|
GemmaForSequenceClassification,
|
||||||
|
GemmaForTokenClassification,
|
||||||
|
)
|
||||||
|
from ..granite.modeling_granite import (
|
||||||
|
GraniteAttention,
|
||||||
|
GraniteFlashAttention2,
|
||||||
|
GraniteSdpaAttention,
|
||||||
|
)
|
||||||
|
from ..llama.modeling_llama import (
|
||||||
|
LlamaDecoderLayer,
|
||||||
|
LlamaModel,
|
||||||
|
LlamaPreTrainedModel,
|
||||||
|
)
|
||||||
|
from ..phi3.modeling_phi3 import (
|
||||||
|
Phi3MLP,
|
||||||
|
Phi3RMSNorm,
|
||||||
|
Phi3RotaryEmbedding,
|
||||||
|
)
|
||||||
|
from .configuration_glm import GlmConfig
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class GlmRMSNorm(Phi3RMSNorm):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class GlmRotaryEmbedding(Phi3RotaryEmbedding):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class GlmMLP(Phi3MLP):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def rotate_half(x):
|
||||||
|
"""Rotates half the hidden dims of the input."""
|
||||||
|
x1 = x[..., 0::2]
|
||||||
|
x2 = x[..., 1::2]
|
||||||
|
return torch.stack((-x2, x1), dim=-1).flatten(-2)
|
||||||
|
|
||||||
|
|
||||||
|
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
|
||||||
|
"""Applies Rotary Position Embedding to the query and key tensors.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
q (`torch.Tensor`): The query tensor.
|
||||||
|
k (`torch.Tensor`): The key tensor.
|
||||||
|
cos (`torch.Tensor`): The cosine part of the rotary embedding.
|
||||||
|
sin (`torch.Tensor`): The sine part of the rotary embedding.
|
||||||
|
position_ids (`torch.Tensor`, *optional*):
|
||||||
|
Deprecated and unused.
|
||||||
|
unsqueeze_dim (`int`, *optional*, defaults to 1):
|
||||||
|
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
|
||||||
|
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
|
||||||
|
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
|
||||||
|
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
|
||||||
|
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
|
||||||
|
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
|
||||||
|
Returns:
|
||||||
|
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
|
||||||
|
"""
|
||||||
|
cos = cos.unsqueeze(unsqueeze_dim)
|
||||||
|
sin = sin.unsqueeze(unsqueeze_dim)
|
||||||
|
|
||||||
|
# Interleave them instead of usual shape
|
||||||
|
cos = cos[..., : cos.shape[-1] // 2].repeat_interleave(2, dim=-1)
|
||||||
|
sin = sin[..., : sin.shape[-1] // 2].repeat_interleave(2, dim=-1)
|
||||||
|
|
||||||
|
# Keep half for later concatenation
|
||||||
|
q, q_pass = q[..., : q.shape[-1] // 2], q[..., q.shape[-1] // 2 :]
|
||||||
|
k, k_pass = k[..., : k.shape[-1] // 2], k[..., k.shape[-1] // 2 :]
|
||||||
|
|
||||||
|
# Apply rotary embeddings on the first half
|
||||||
|
q_embed = (q * cos) + (rotate_half(q) * sin)
|
||||||
|
k_embed = (k * cos) + (rotate_half(k) * sin)
|
||||||
|
|
||||||
|
# Concatenate back to full shape
|
||||||
|
q_embed = torch.cat([q_embed, q_pass], dim=-1)
|
||||||
|
k_embed = torch.cat([k_embed, k_pass], dim=-1)
|
||||||
|
return q_embed, k_embed
|
||||||
|
|
||||||
|
|
||||||
|
class GlmAttention(GraniteAttention):
|
||||||
|
def __init__(self, config: GlmConfig, layer_idx: Optional[int] = None):
|
||||||
|
super().__init__(config, layer_idx)
|
||||||
|
self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
|
||||||
|
self.scaling = 1 / math.sqrt(self.head_dim)
|
||||||
|
|
||||||
|
|
||||||
|
class GlmFlashAttention2(GlmAttention, GraniteFlashAttention2):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class GlmSdpaAttention(GraniteSdpaAttention):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
GLM_ATTENTION_CLASSES = {
|
||||||
|
"eager": GlmAttention,
|
||||||
|
"flash_attention_2": GlmFlashAttention2,
|
||||||
|
"sdpa": GlmSdpaAttention,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class GlmDecoderLayer(LlamaDecoderLayer):
|
||||||
|
def __init__(self, config: GlmConfig, layer_idx: Optional[int] = None):
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
self.mlp = GlmMLP(config)
|
||||||
|
self.input_layernorm = GlmRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||||
|
self.post_attention_layernorm = GlmRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||||
|
|
||||||
|
|
||||||
|
class GlmPreTrainedModel(LlamaPreTrainedModel):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class GlmModel(GlmPreTrainedModel, LlamaModel):
|
||||||
|
def __init__(self, config: GlmConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
self.layers = nn.ModuleList(
|
||||||
|
[GlmDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
|
||||||
|
)
|
||||||
|
self.norm = GlmRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
|
||||||
|
self.rotary_emb = GlmRotaryEmbedding(
|
||||||
|
dim=config.head_dim // 2, max_position_embeddings=config.max_position_embeddings, base=config.rope_theta
|
||||||
|
)
|
||||||
|
self.gradient_checkpointing = False
|
||||||
|
|
||||||
|
# Initialize weights and apply final processing
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
|
||||||
|
class GlmForCausalLM(GemmaForCausalLM):
|
||||||
|
def __init__(self, config: GlmConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
self.model = GlmModel(config)
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
|
||||||
|
class GlmForSequenceClassification(GemmaForSequenceClassification):
|
||||||
|
def __init__(self, config: GlmConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
self.model = GlmModel(config)
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
|
||||||
|
class GlmForTokenClassification(GemmaForTokenClassification):
|
||||||
|
def __init__(self, config: GlmConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
self.model = GlmModel(config)
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"GlmPreTrainedModel",
|
||||||
|
"GlmModel",
|
||||||
|
"GlmForCausalLM",
|
||||||
|
"GlmForSequenceClassification",
|
||||||
|
"GlmForTokenClassification",
|
||||||
|
]
|
||||||
@@ -4368,6 +4368,41 @@ class GitVisionModel(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class GlmForCausalLM(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class GlmForSequenceClassification(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class GlmForTokenClassification(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class GlmModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class GlmPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
class GLPNForDepthEstimation(metaclass=DummyObject):
|
class GLPNForDepthEstimation(metaclass=DummyObject):
|
||||||
_backends = ["torch"]
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
|||||||
@@ -1943,6 +1943,13 @@ def create_import_structure_from_path(module_path):
|
|||||||
if "__init__.py" in adjacent_modules:
|
if "__init__.py" in adjacent_modules:
|
||||||
adjacent_modules.remove("__init__.py")
|
adjacent_modules.remove("__init__.py")
|
||||||
|
|
||||||
|
# Modular files should not be imported
|
||||||
|
def find_substring(substring, list_):
|
||||||
|
return any(substring in x for x in list_)
|
||||||
|
|
||||||
|
if find_substring("modular_", adjacent_modules) and find_substring("modeling_", adjacent_modules):
|
||||||
|
adjacent_modules = [module for module in adjacent_modules if "modular_" not in module]
|
||||||
|
|
||||||
module_requirements = {}
|
module_requirements = {}
|
||||||
for module_name in adjacent_modules:
|
for module_name in adjacent_modules:
|
||||||
# Only modules ending in `.py` are accepted here.
|
# Only modules ending in `.py` are accepted here.
|
||||||
|
|||||||
0
tests/models/glm/__init__.py
Normal file
0
tests/models/glm/__init__.py
Normal file
955
tests/models/glm/test_modeling_glm.py
Normal file
955
tests/models/glm/test_modeling_glm.py
Normal file
@@ -0,0 +1,955 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Testing suite for the PyTorch Glm model."""
|
||||||
|
|
||||||
|
import inspect
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
from parameterized import parameterized
|
||||||
|
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer, GlmConfig, is_torch_available
|
||||||
|
from transformers.testing_utils import (
|
||||||
|
is_flaky,
|
||||||
|
require_flash_attn,
|
||||||
|
require_torch,
|
||||||
|
require_torch_accelerator,
|
||||||
|
require_torch_gpu,
|
||||||
|
require_torch_sdpa,
|
||||||
|
slow,
|
||||||
|
torch_device,
|
||||||
|
)
|
||||||
|
from transformers.utils import is_torch_bf16_available_on_device, is_torch_fp16_available_on_device
|
||||||
|
|
||||||
|
from ...generation.test_utils import GenerationTesterMixin
|
||||||
|
from ...test_configuration_common import ConfigTester
|
||||||
|
from ...test_modeling_common import ModelTesterMixin, ids_tensor
|
||||||
|
from ...test_pipeline_mixin import PipelineTesterMixin
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import (
|
||||||
|
GlmForCausalLM,
|
||||||
|
GlmForSequenceClassification,
|
||||||
|
GlmForTokenClassification,
|
||||||
|
GlmModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class GlmModelTester:
|
||||||
|
config_class = GlmConfig
|
||||||
|
if is_torch_available():
|
||||||
|
model_class = GlmModel
|
||||||
|
for_causal_lm_class = GlmForCausalLM
|
||||||
|
for_sequence_class = GlmForSequenceClassification
|
||||||
|
for_token_class = GlmForTokenClassification
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=True,
|
||||||
|
use_input_mask=True,
|
||||||
|
use_token_type_ids=False,
|
||||||
|
use_labels=True,
|
||||||
|
vocab_size=99,
|
||||||
|
hidden_size=32,
|
||||||
|
num_hidden_layers=2,
|
||||||
|
num_attention_heads=4,
|
||||||
|
num_key_value_heads=2,
|
||||||
|
intermediate_size=37,
|
||||||
|
hidden_act="silu",
|
||||||
|
attention_dropout=0.1,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
type_vocab_size=16,
|
||||||
|
type_sequence_label_size=2,
|
||||||
|
initializer_range=0.02,
|
||||||
|
num_labels=3,
|
||||||
|
num_choices=4,
|
||||||
|
pad_token_id=0,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_input_mask = use_input_mask
|
||||||
|
self.use_token_type_ids = use_token_type_ids
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.num_key_value_heads = num_key_value_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.type_sequence_label_size = type_sequence_label_size
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.num_labels = num_labels
|
||||||
|
self.num_choices = num_choices
|
||||||
|
self.pad_token_id = pad_token_id
|
||||||
|
self.scope = scope
|
||||||
|
self.head_dim = self.hidden_size // self.num_attention_heads
|
||||||
|
|
||||||
|
# Copied from tests.models.mistral.test_modeling_mistral.MistralModelTester.prepare_config_and_inputs
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
|
||||||
|
input_mask = None
|
||||||
|
if self.use_input_mask:
|
||||||
|
input_mask = torch.tril(torch.ones_like(input_ids).to(torch_device))
|
||||||
|
|
||||||
|
token_type_ids = None
|
||||||
|
if self.use_token_type_ids:
|
||||||
|
token_type_ids = ids_tensor([self.batch_size, self.seq_length], self.type_vocab_size)
|
||||||
|
|
||||||
|
sequence_labels = None
|
||||||
|
token_labels = None
|
||||||
|
choice_labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
|
||||||
|
token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
|
||||||
|
choice_labels = ids_tensor([self.batch_size], self.num_choices)
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
|
||||||
|
return config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return self.config_class(
|
||||||
|
vocab_size=self.vocab_size,
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
num_key_value_heads=self.num_key_value_heads,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
hidden_act=self.hidden_act,
|
||||||
|
attention_dropout=self.attention_dropout,
|
||||||
|
max_position_embeddings=self.max_position_embeddings,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
pad_token_id=self.pad_token_id,
|
||||||
|
head_dim=self.head_dim,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(
|
||||||
|
self, config, input_ids, token_type_ids, input_mask, sequence_labels, token_labels, choice_labels
|
||||||
|
):
|
||||||
|
model = self.model_class(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=input_mask)
|
||||||
|
result = model(input_ids)
|
||||||
|
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||||
|
|
||||||
|
def create_and_check_model_as_decoder(
|
||||||
|
self,
|
||||||
|
config,
|
||||||
|
input_ids,
|
||||||
|
token_type_ids,
|
||||||
|
input_mask,
|
||||||
|
sequence_labels,
|
||||||
|
token_labels,
|
||||||
|
choice_labels,
|
||||||
|
encoder_hidden_states,
|
||||||
|
encoder_attention_mask,
|
||||||
|
):
|
||||||
|
model = self.model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(
|
||||||
|
input_ids,
|
||||||
|
attention_mask=input_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
)
|
||||||
|
result = model(
|
||||||
|
input_ids,
|
||||||
|
attention_mask=input_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
)
|
||||||
|
result = model(input_ids, attention_mask=input_mask)
|
||||||
|
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||||
|
|
||||||
|
def create_and_check_for_causal_lm(
|
||||||
|
self,
|
||||||
|
config,
|
||||||
|
input_ids,
|
||||||
|
token_type_ids,
|
||||||
|
input_mask,
|
||||||
|
sequence_labels,
|
||||||
|
token_labels,
|
||||||
|
choice_labels,
|
||||||
|
encoder_hidden_states,
|
||||||
|
encoder_attention_mask,
|
||||||
|
):
|
||||||
|
model = self.for_causal_lm_class(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=input_mask, labels=token_labels)
|
||||||
|
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||||
|
|
||||||
|
def create_and_check_decoder_model_past_large_inputs(
|
||||||
|
self,
|
||||||
|
config,
|
||||||
|
input_ids,
|
||||||
|
token_type_ids,
|
||||||
|
input_mask,
|
||||||
|
sequence_labels,
|
||||||
|
token_labels,
|
||||||
|
choice_labels,
|
||||||
|
encoder_hidden_states,
|
||||||
|
encoder_attention_mask,
|
||||||
|
):
|
||||||
|
model = self.for_causal_lm_class(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
# first forward pass
|
||||||
|
outputs = model(
|
||||||
|
input_ids,
|
||||||
|
attention_mask=input_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
use_cache=True,
|
||||||
|
)
|
||||||
|
past_key_values = outputs.past_key_values
|
||||||
|
|
||||||
|
# create hypothetical multiple next token and extent to next_input_ids
|
||||||
|
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
|
||||||
|
next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
|
||||||
|
|
||||||
|
# append to next input_ids and
|
||||||
|
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||||
|
next_attention_mask = torch.cat([input_mask, next_mask], dim=-1)
|
||||||
|
|
||||||
|
output_from_no_past = model(
|
||||||
|
next_input_ids,
|
||||||
|
attention_mask=next_attention_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
output_hidden_states=True,
|
||||||
|
)["hidden_states"][0]
|
||||||
|
output_from_past = model(
|
||||||
|
next_tokens,
|
||||||
|
attention_mask=next_attention_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
past_key_values=past_key_values,
|
||||||
|
output_hidden_states=True,
|
||||||
|
)["hidden_states"][0]
|
||||||
|
|
||||||
|
# select random slice
|
||||||
|
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||||
|
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||||
|
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||||
|
|
||||||
|
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
|
||||||
|
|
||||||
|
# test that outputs are equal for slice
|
||||||
|
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||||
|
|
||||||
|
# Copied from tests.models.llama.test_modeling_llama.LlamaModelTester.prepare_config_and_inputs_for_common with Llama->Glm
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config_and_inputs = self.prepare_config_and_inputs()
|
||||||
|
(
|
||||||
|
config,
|
||||||
|
input_ids,
|
||||||
|
token_type_ids,
|
||||||
|
input_mask,
|
||||||
|
sequence_labels,
|
||||||
|
token_labels,
|
||||||
|
choice_labels,
|
||||||
|
) = config_and_inputs
|
||||||
|
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class GlmModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (
|
||||||
|
(GlmModel, GlmForCausalLM, GlmForSequenceClassification, GlmForTokenClassification)
|
||||||
|
if is_torch_available()
|
||||||
|
else ()
|
||||||
|
)
|
||||||
|
all_generative_model_classes = (GlmForCausalLM,) if is_torch_available() else ()
|
||||||
|
pipeline_model_mapping = (
|
||||||
|
{
|
||||||
|
"feature-extraction": GlmModel,
|
||||||
|
"text-classification": GlmForSequenceClassification,
|
||||||
|
"token-classification": GlmForTokenClassification,
|
||||||
|
"text-generation": GlmForCausalLM,
|
||||||
|
}
|
||||||
|
if is_torch_available()
|
||||||
|
else {}
|
||||||
|
)
|
||||||
|
test_headmasking = False
|
||||||
|
test_pruning = False
|
||||||
|
|
||||||
|
# used in `test_torch_compile`
|
||||||
|
_torch_compile_test_ckpt = "THUDM/glm-4-9b"
|
||||||
|
_torch_compile_test_revision = "refs/pr/15"
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = GlmModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=GlmConfig, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_model_various_embeddings(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
for type in ["absolute", "relative_key", "relative_key_query"]:
|
||||||
|
config_and_inputs[0].position_embedding_type = type
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_Glm_sequence_classification_model(self):
|
||||||
|
config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
print(config)
|
||||||
|
config.num_labels = 3
|
||||||
|
input_ids = input_dict["input_ids"]
|
||||||
|
attention_mask = input_ids.ne(1).to(torch_device)
|
||||||
|
sequence_labels = ids_tensor([self.model_tester.batch_size], self.model_tester.type_sequence_label_size)
|
||||||
|
model = self.model_tester.for_sequence_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=attention_mask, labels=sequence_labels)
|
||||||
|
self.assertEqual(result.logits.shape, (self.model_tester.batch_size, self.model_tester.num_labels))
|
||||||
|
|
||||||
|
def test_Glm_sequence_classification_model_for_single_label(self):
|
||||||
|
config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.num_labels = 3
|
||||||
|
config.problem_type = "single_label_classification"
|
||||||
|
input_ids = input_dict["input_ids"]
|
||||||
|
attention_mask = input_ids.ne(1).to(torch_device)
|
||||||
|
sequence_labels = ids_tensor([self.model_tester.batch_size], self.model_tester.type_sequence_label_size)
|
||||||
|
model = self.model_tester.for_sequence_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=attention_mask, labels=sequence_labels)
|
||||||
|
self.assertEqual(result.logits.shape, (self.model_tester.batch_size, self.model_tester.num_labels))
|
||||||
|
|
||||||
|
def test_Glm_sequence_classification_model_for_multi_label(self):
|
||||||
|
config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.num_labels = 3
|
||||||
|
config.problem_type = "multi_label_classification"
|
||||||
|
input_ids = input_dict["input_ids"]
|
||||||
|
attention_mask = input_ids.ne(1).to(torch_device)
|
||||||
|
sequence_labels = ids_tensor(
|
||||||
|
[self.model_tester.batch_size, config.num_labels], self.model_tester.type_sequence_label_size
|
||||||
|
).to(torch.float)
|
||||||
|
model = self.model_tester.for_sequence_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=attention_mask, labels=sequence_labels)
|
||||||
|
self.assertEqual(result.logits.shape, (self.model_tester.batch_size, self.model_tester.num_labels))
|
||||||
|
|
||||||
|
def test_Glm_token_classification_model(self):
|
||||||
|
config, input_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.num_labels = 3
|
||||||
|
input_ids = input_dict["input_ids"]
|
||||||
|
attention_mask = input_ids.ne(1).to(torch_device)
|
||||||
|
token_labels = ids_tensor([self.model_tester.batch_size, self.model_tester.seq_length], config.num_labels)
|
||||||
|
model = self.model_tester.for_token_class(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_ids, attention_mask=attention_mask, labels=token_labels)
|
||||||
|
self.assertEqual(
|
||||||
|
result.logits.shape,
|
||||||
|
(self.model_tester.batch_size, self.model_tester.seq_length, self.model_tester.num_labels),
|
||||||
|
)
|
||||||
|
|
||||||
|
@unittest.skip(reason="Glm uses GQA on all models so the KV cache is a non standard format")
|
||||||
|
def test_past_key_values_format(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@is_flaky()
|
||||||
|
def test_custom_4d_attention_mask(self):
|
||||||
|
"""Overwrite the common test to use atol=1e-3 instead of 1e-4. Can still rarely fail, thus flaky."""
|
||||||
|
for model_class in self.all_generative_model_classes:
|
||||||
|
if not model_class._supports_static_cache:
|
||||||
|
self.skipTest(f"{model_class.__name__} is not guaranteed to work with custom 4D attention masks")
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
if getattr(config, "sliding_window", 0) is not None and getattr(config, "sliding_window", 0) > 0:
|
||||||
|
self.skipTest(f"{model_class.__name__} with sliding window attention is not supported by this test")
|
||||||
|
model = model_class(config).to(device=torch_device, dtype=torch.float32)
|
||||||
|
|
||||||
|
(
|
||||||
|
input_ids,
|
||||||
|
position_ids,
|
||||||
|
input_ids_shared_prefix,
|
||||||
|
mask_shared_prefix,
|
||||||
|
position_ids_shared_prefix,
|
||||||
|
) = self._get_custom_4d_mask_test_data()
|
||||||
|
|
||||||
|
logits = model.forward(input_ids, position_ids=position_ids).logits
|
||||||
|
# logits.shape == torch.Size([3, 4, ...])
|
||||||
|
|
||||||
|
logits_shared_prefix = model(
|
||||||
|
input_ids_shared_prefix,
|
||||||
|
attention_mask=mask_shared_prefix,
|
||||||
|
position_ids=position_ids_shared_prefix,
|
||||||
|
)[0]
|
||||||
|
# logits_shared_prefix.shape == torch.Size([1, 6, ...])
|
||||||
|
|
||||||
|
out_last_tokens = logits[:, -1, :] # last tokens in each batch line
|
||||||
|
out_shared_prefix_last_tokens = logits_shared_prefix[0, -3:, :] # last three tokens
|
||||||
|
|
||||||
|
# comparing softmax-normalized logits:
|
||||||
|
normalized_0 = torch.nn.functional.softmax(out_last_tokens)
|
||||||
|
normalized_1 = torch.nn.functional.softmax(out_shared_prefix_last_tokens)
|
||||||
|
print(torch.abs(normalized_0 - normalized_1).max())
|
||||||
|
|
||||||
|
torch.testing.assert_close(normalized_0, normalized_1, rtol=1e-3, atol=1e-3)
|
||||||
|
|
||||||
|
@require_flash_attn
|
||||||
|
@require_torch_gpu
|
||||||
|
@pytest.mark.flash_attn_test
|
||||||
|
@slow
|
||||||
|
def test_flash_attn_2_generate_padding_right(self):
|
||||||
|
"""Overwrite the common test as the test is flaky on tiny models."""
|
||||||
|
model = GlmForCausalLM.from_pretrained(
|
||||||
|
"THUDM/glm-4-9b",
|
||||||
|
device_map={"": 0},
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
revision="refs/pr/15",
|
||||||
|
)
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b", revision="refs/pr/15")
|
||||||
|
tokenizer.padding_side = "right"
|
||||||
|
|
||||||
|
texts = ["hi", "Hello this is a very long sentence"]
|
||||||
|
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(0)
|
||||||
|
|
||||||
|
output_native = model.generate(**inputs, max_new_tokens=15, do_sample=False)
|
||||||
|
output_native = tokenizer.batch_decode(output_native)
|
||||||
|
|
||||||
|
model = GlmForCausalLM.from_pretrained(
|
||||||
|
"THUDM/glm-4-9b",
|
||||||
|
device_map={"": 0},
|
||||||
|
attn_implementation="flash_attention_2",
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
revision="refs/pr/15",
|
||||||
|
)
|
||||||
|
|
||||||
|
output_fa_2 = model.generate(**inputs, max_new_tokens=15, do_sample=False)
|
||||||
|
output_fa_2 = tokenizer.batch_decode(output_fa_2)
|
||||||
|
|
||||||
|
self.assertListEqual(output_native, output_fa_2)
|
||||||
|
|
||||||
|
@parameterized.expand([("float16",), ("bfloat16",), ("float32",)])
|
||||||
|
@require_torch_sdpa
|
||||||
|
@slow
|
||||||
|
@is_flaky
|
||||||
|
def test_eager_matches_sdpa_inference(self, torch_dtype: str):
|
||||||
|
"""Overwrite to add flakyness: some cases can sometimes fail"""
|
||||||
|
if torch_dtype == "float16" and not is_torch_fp16_available_on_device(torch_device):
|
||||||
|
self.skipTest(f"float16 not supported on {torch_device} (on the specific device currently used)")
|
||||||
|
|
||||||
|
if torch_dtype == "bfloat16" and not is_torch_bf16_available_on_device(torch_device):
|
||||||
|
self.skipTest(
|
||||||
|
f"bfloat16 not supported on {torch_device} (on the specific device currently used, e.g. Nvidia T4 GPU)"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Not sure whether it's fine to put torch.XXX in a decorator if torch is not available so hacking it here instead.
|
||||||
|
if torch_dtype == "float16":
|
||||||
|
torch_dtype = torch.float16
|
||||||
|
elif torch_dtype == "bfloat16":
|
||||||
|
torch_dtype = torch.bfloat16
|
||||||
|
elif torch_dtype == "float32":
|
||||||
|
torch_dtype = torch.float32
|
||||||
|
|
||||||
|
atols = {
|
||||||
|
("cpu", False, torch.float32): 1e-6,
|
||||||
|
("cpu", False, torch.bfloat16): 1e-2,
|
||||||
|
("cpu", True, torch.float32): 1e-6,
|
||||||
|
("cpu", True, torch.bfloat16): 1e-2,
|
||||||
|
("cuda", False, torch.float32): 1e-6,
|
||||||
|
("cuda", False, torch.bfloat16): 1e-2,
|
||||||
|
("cuda", False, torch.float16): 5e-3,
|
||||||
|
("cuda", True, torch.float32): 1e-6,
|
||||||
|
("cuda", True, torch.bfloat16): 1e-2,
|
||||||
|
("cuda", True, torch.float16): 5e-3,
|
||||||
|
}
|
||||||
|
rtols = {
|
||||||
|
("cpu", False, torch.float32): 1e-4,
|
||||||
|
("cpu", False, torch.bfloat16): 1e-2,
|
||||||
|
("cpu", True, torch.float32): 1e-4,
|
||||||
|
("cpu", True, torch.bfloat16): 1e-2,
|
||||||
|
("cuda", False, torch.float32): 1e-4,
|
||||||
|
("cuda", False, torch.bfloat16): 1e-2,
|
||||||
|
("cuda", False, torch.float16): 5e-3,
|
||||||
|
("cuda", True, torch.float32): 1e-4,
|
||||||
|
("cuda", True, torch.bfloat16): 3e-2,
|
||||||
|
("cuda", True, torch.float16): 5e-3,
|
||||||
|
}
|
||||||
|
|
||||||
|
def get_mean_reldiff(failcase, x, ref, atol, rtol):
|
||||||
|
return f"{failcase}: mean relative difference: {((x - ref).abs() / (ref.abs() + 1e-12)).mean():.3e}, torch atol = {atol}, torch rtol = {rtol}"
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
model = model_class(config)
|
||||||
|
# FIXME: we deactivate boolean mask for models using "use_mask_token" in their constructors.
|
||||||
|
# These models support masking only in the case `use_mask_token=True`. Otherwise they cannot consume an input mask.
|
||||||
|
# This means that the class needs to be instantiated much later, after `use_mask` is set, which means a significant refactor of the code.
|
||||||
|
# However masking there is not done at any layers that matters (i.e self-attention), therefore we can safely deactivate it.
|
||||||
|
deactivate_mask = "use_mask_token" in inspect.signature(model_class).parameters
|
||||||
|
|
||||||
|
is_encoder_decoder = model.config.is_encoder_decoder
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||||
|
model.save_pretrained(tmpdirname)
|
||||||
|
model_sdpa = model_class.from_pretrained(tmpdirname, torch_dtype=torch_dtype)
|
||||||
|
model_sdpa = model_sdpa.eval().to(torch_device)
|
||||||
|
|
||||||
|
self.assertTrue(model_sdpa.config._attn_implementation == "sdpa")
|
||||||
|
|
||||||
|
model_eager = model_class.from_pretrained(
|
||||||
|
tmpdirname,
|
||||||
|
torch_dtype=torch_dtype,
|
||||||
|
attn_implementation="eager",
|
||||||
|
)
|
||||||
|
model_eager = model_eager.eval().to(torch_device)
|
||||||
|
|
||||||
|
self.assertTrue(model_eager.config._attn_implementation == "eager")
|
||||||
|
|
||||||
|
for name, submodule in model_eager.named_modules():
|
||||||
|
class_name = submodule.__class__.__name__
|
||||||
|
if "SdpaAttention" in class_name or "SdpaSelfAttention" in class_name:
|
||||||
|
raise ValueError("The eager model should not have SDPA attention layers")
|
||||||
|
|
||||||
|
has_sdpa = False
|
||||||
|
for name, submodule in model_sdpa.named_modules():
|
||||||
|
class_name = submodule.__class__.__name__
|
||||||
|
if "SdpaAttention" in class_name or "SdpaSelfAttention" in class_name:
|
||||||
|
has_sdpa = True
|
||||||
|
break
|
||||||
|
if not has_sdpa and model_sdpa.config.model_type != "falcon":
|
||||||
|
raise ValueError("The SDPA model should have SDPA attention layers")
|
||||||
|
|
||||||
|
# We use these for loops instead of parameterized.expand just for the interest of avoiding loading/saving 16 times the model,
|
||||||
|
# but it would be nicer to have an efficient way to use parameterized.expand
|
||||||
|
fail_cases = []
|
||||||
|
for padding_side in ["left", "right"]:
|
||||||
|
for use_mask in [False, True]:
|
||||||
|
for output_attentions in [True, False]:
|
||||||
|
can_output_attn = "output_attentions" in inspect.signature(model_sdpa.forward).parameters
|
||||||
|
if not (self.has_attentions and can_output_attn) and output_attentions:
|
||||||
|
continue
|
||||||
|
for batch_size in [1, 5]:
|
||||||
|
dummy_input = inputs_dict[model.main_input_name]
|
||||||
|
|
||||||
|
if dummy_input.dtype in [torch.float32, torch.bfloat16, torch.float16]:
|
||||||
|
dummy_input = dummy_input.to(torch_dtype)
|
||||||
|
|
||||||
|
dummy_input = dummy_input[:batch_size]
|
||||||
|
if dummy_input.shape[0] != batch_size:
|
||||||
|
if dummy_input.dtype in [torch.float32, torch.bfloat16, torch.float16]:
|
||||||
|
extension = torch.rand(
|
||||||
|
batch_size - dummy_input.shape[0],
|
||||||
|
*dummy_input.shape[1:],
|
||||||
|
dtype=torch_dtype,
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
dummy_input = torch.cat((dummy_input, extension), dim=0).to(torch_device)
|
||||||
|
else:
|
||||||
|
extension = torch.randint(
|
||||||
|
high=5,
|
||||||
|
size=(batch_size - dummy_input.shape[0], *dummy_input.shape[1:]),
|
||||||
|
dtype=dummy_input.dtype,
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
dummy_input = torch.cat((dummy_input, extension), dim=0).to(torch_device)
|
||||||
|
|
||||||
|
if not use_mask:
|
||||||
|
dummy_attention_mask = None
|
||||||
|
else:
|
||||||
|
dummy_attention_mask = inputs_dict.get("attention_mask", None)
|
||||||
|
if dummy_attention_mask is None:
|
||||||
|
if is_encoder_decoder:
|
||||||
|
seqlen = inputs_dict.get("decoder_input_ids", dummy_input).shape[-1]
|
||||||
|
else:
|
||||||
|
seqlen = dummy_input.shape[-1]
|
||||||
|
dummy_attention_mask = (
|
||||||
|
torch.ones(batch_size, seqlen).to(torch.int64).to(torch_device)
|
||||||
|
)
|
||||||
|
|
||||||
|
dummy_attention_mask = dummy_attention_mask[:batch_size]
|
||||||
|
if dummy_attention_mask.shape[0] != batch_size:
|
||||||
|
extension = torch.ones(
|
||||||
|
batch_size - dummy_attention_mask.shape[0],
|
||||||
|
*dummy_attention_mask.shape[1:],
|
||||||
|
dtype=dummy_attention_mask.dtype,
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
dummy_attention_mask = torch.cat((dummy_attention_mask, extension), dim=0)
|
||||||
|
dummy_attention_mask = dummy_attention_mask.to(torch_device)
|
||||||
|
|
||||||
|
dummy_attention_mask[:] = 1
|
||||||
|
if padding_side == "left":
|
||||||
|
dummy_attention_mask[-1, :-1] = 1
|
||||||
|
dummy_attention_mask[-1, -4:] = 0
|
||||||
|
elif padding_side == "right":
|
||||||
|
dummy_attention_mask[-1, 1:] = 1
|
||||||
|
dummy_attention_mask[-1, :3] = 0
|
||||||
|
|
||||||
|
for enable_kernels in [False, True]:
|
||||||
|
failcase = f"padding_side={padding_side}, use_mask={use_mask}, batch_size={batch_size}, enable_kernels={enable_kernels}"
|
||||||
|
if is_encoder_decoder:
|
||||||
|
decoder_input_ids = inputs_dict.get("decoder_input_ids", dummy_input)[
|
||||||
|
:batch_size
|
||||||
|
]
|
||||||
|
if decoder_input_ids.shape[0] != batch_size:
|
||||||
|
extension = torch.ones(
|
||||||
|
batch_size - decoder_input_ids.shape[0],
|
||||||
|
*decoder_input_ids.shape[1:],
|
||||||
|
dtype=decoder_input_ids.dtype,
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
decoder_input_ids = torch.cat((decoder_input_ids, extension), dim=0)
|
||||||
|
decoder_input_ids = decoder_input_ids.to(torch_device)
|
||||||
|
|
||||||
|
# TODO: never an `attention_mask` arg here?
|
||||||
|
processed_inputs = {
|
||||||
|
model.main_input_name: dummy_input,
|
||||||
|
"decoder_input_ids": decoder_input_ids,
|
||||||
|
"decoder_attention_mask": dummy_attention_mask,
|
||||||
|
"output_hidden_states": True,
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
processed_inputs = {
|
||||||
|
model.main_input_name: dummy_input,
|
||||||
|
"output_hidden_states": True,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Otherwise fails for e.g. WhisperEncoderModel
|
||||||
|
if "attention_mask" in inspect.signature(model_eager.forward).parameters:
|
||||||
|
processed_inputs["attention_mask"] = dummy_attention_mask
|
||||||
|
|
||||||
|
if (
|
||||||
|
self.has_attentions
|
||||||
|
and "output_attentions" in inspect.signature(model_sdpa.forward).parameters
|
||||||
|
):
|
||||||
|
processed_inputs["output_attentions"] = output_attentions
|
||||||
|
if not deactivate_mask and (
|
||||||
|
"bool_masked_pos" in inspect.signature(model_eager.forward).parameters
|
||||||
|
):
|
||||||
|
dummy_mask = torch.ones((self.model_tester.num_masks,))
|
||||||
|
|
||||||
|
# In case of additional token (like class) we define a custom `mask_length`
|
||||||
|
if hasattr(self.model_tester, "mask_length"):
|
||||||
|
mask_length = self.model_tester.mask_length - dummy_mask.size(0)
|
||||||
|
else:
|
||||||
|
mask_length = self.model_tester.seq_length - dummy_mask.size(0)
|
||||||
|
dummy_mask = torch.cat([dummy_mask, torch.zeros(mask_length)])
|
||||||
|
dummy_bool_masked_pos = dummy_mask.expand(batch_size, -1).bool()
|
||||||
|
processed_inputs["bool_masked_pos"] = dummy_bool_masked_pos.to(torch_device)
|
||||||
|
|
||||||
|
if "noise" in inspect.signature(model_eager.forward).parameters:
|
||||||
|
np.random.seed(2)
|
||||||
|
num_patches = int(
|
||||||
|
(self.model_tester.image_size // self.model_tester.patch_size) ** 2
|
||||||
|
)
|
||||||
|
noise = np.random.uniform(size=(batch_size, num_patches))
|
||||||
|
processed_inputs["noise"] = torch.from_numpy(noise)
|
||||||
|
|
||||||
|
# TODO: test gradients as well (& for FA2 as well!)
|
||||||
|
with torch.no_grad():
|
||||||
|
with torch.backends.cuda.sdp_kernel(
|
||||||
|
enable_flash=enable_kernels,
|
||||||
|
enable_math=True,
|
||||||
|
enable_mem_efficient=enable_kernels,
|
||||||
|
):
|
||||||
|
prepared_inputs = self._prepare_for_class(processed_inputs, model_class)
|
||||||
|
outputs_eager = model_eager(**prepared_inputs)
|
||||||
|
outputs_sdpa = model_sdpa(**prepared_inputs)
|
||||||
|
|
||||||
|
logits_eager = (
|
||||||
|
outputs_eager.hidden_states[-1]
|
||||||
|
if not is_encoder_decoder
|
||||||
|
else outputs_eager.decoder_hidden_states[-1]
|
||||||
|
)
|
||||||
|
logits_sdpa = (
|
||||||
|
outputs_sdpa.hidden_states[-1]
|
||||||
|
if not is_encoder_decoder
|
||||||
|
else outputs_sdpa.decoder_hidden_states[-1]
|
||||||
|
)
|
||||||
|
|
||||||
|
if torch_device in ["cpu", "cuda"]:
|
||||||
|
atol = atols[torch_device, enable_kernels, torch_dtype]
|
||||||
|
rtol = rtols[torch_device, enable_kernels, torch_dtype]
|
||||||
|
else:
|
||||||
|
atol = 1e-7
|
||||||
|
rtol = 1e-4
|
||||||
|
|
||||||
|
# Masked tokens output slightly deviates - we don't mind that.
|
||||||
|
if use_mask:
|
||||||
|
if padding_side == "left":
|
||||||
|
sub_sdpa = logits_sdpa[:-1]
|
||||||
|
sub_eager = logits_eager[:-1]
|
||||||
|
if not torch.allclose(sub_sdpa, sub_eager, atol=atol, rtol=rtol):
|
||||||
|
fail_cases.append(
|
||||||
|
get_mean_reldiff(failcase, sub_sdpa, sub_eager, atol, rtol)
|
||||||
|
)
|
||||||
|
|
||||||
|
sub_sdpa = logits_sdpa[-1, :-4]
|
||||||
|
sub_eager = logits_eager[-1, :-4]
|
||||||
|
if not torch.allclose(sub_sdpa, sub_eager, atol=atol, rtol=rtol):
|
||||||
|
fail_cases.append(
|
||||||
|
get_mean_reldiff(failcase, sub_sdpa, sub_eager, atol, rtol)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Testing the padding tokens is not really meaningful but anyway
|
||||||
|
# sub_sdpa = logits_sdpa[-1, -4:]
|
||||||
|
# sub_eager = logits_eager[-1, -4:]
|
||||||
|
# if not torch.allclose(sub_sdpa, sub_eager, atol=atol, rtol=rtol):
|
||||||
|
# fail_cases.append(get_mean_reldiff(failcase, sub_sdpa, sub_eager, 4e-2, 4e-2))
|
||||||
|
elif padding_side == "right":
|
||||||
|
sub_sdpa = logits_sdpa[:-1]
|
||||||
|
sub_eager = logits_eager[:-1]
|
||||||
|
if not torch.allclose(sub_sdpa, sub_eager, atol=atol, rtol=rtol):
|
||||||
|
fail_cases.append(
|
||||||
|
get_mean_reldiff(failcase, sub_sdpa, sub_eager, atol, rtol)
|
||||||
|
)
|
||||||
|
|
||||||
|
sub_sdpa = logits_sdpa[-1, 3:]
|
||||||
|
sub_eager = logits_eager[-1, 3:]
|
||||||
|
if not torch.allclose(sub_sdpa, sub_eager, atol=atol, rtol=rtol):
|
||||||
|
fail_cases.append(
|
||||||
|
get_mean_reldiff(failcase, sub_sdpa, sub_eager, atol, rtol)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Testing the padding tokens is not really meaningful but anyway
|
||||||
|
# sub_sdpa = logits_sdpa[-1, :3]
|
||||||
|
# sub_eager = logits_eager[-1, :3]
|
||||||
|
# if not torch.allclose(sub_sdpa, sub_eager, atol=atol, rtol=rtol):
|
||||||
|
# fail_cases.append(get_mean_reldiff(failcase, sub_sdpa, sub_eager, 4e-2, 4e-2))
|
||||||
|
|
||||||
|
else:
|
||||||
|
if not torch.allclose(logits_sdpa, logits_eager, atol=atol, rtol=rtol):
|
||||||
|
fail_cases.append(
|
||||||
|
get_mean_reldiff(failcase, logits_sdpa, logits_eager, atol, rtol)
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertTrue(len(fail_cases) == 0, "\n".join(fail_cases))
|
||||||
|
|
||||||
|
@require_torch_sdpa
|
||||||
|
@slow
|
||||||
|
@is_flaky()
|
||||||
|
def test_eager_matches_sdpa_generate(self):
|
||||||
|
"""Overwrite to add flakyness: outputs sometimes start to diverge after some tokens"""
|
||||||
|
|
||||||
|
max_new_tokens = 30
|
||||||
|
|
||||||
|
for model_class in self.all_generative_model_classes:
|
||||||
|
if not model_class._supports_sdpa:
|
||||||
|
self.skipTest(f"{model_class.__name__} does not support SDPA")
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
dummy_input = inputs_dict[model_class.main_input_name]
|
||||||
|
if dummy_input.dtype in [torch.float32, torch.bfloat16]:
|
||||||
|
dummy_input = dummy_input.to(torch.float16)
|
||||||
|
|
||||||
|
# make sure that all models have enough positions for generation
|
||||||
|
if hasattr(config, "max_position_embeddings"):
|
||||||
|
config.max_position_embeddings = max_new_tokens + dummy_input.shape[1] + 1
|
||||||
|
|
||||||
|
model = model_class(config)
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||||
|
model.save_pretrained(tmpdirname)
|
||||||
|
|
||||||
|
dummy_attention_mask = inputs_dict.get("attention_mask", torch.ones_like(dummy_input))
|
||||||
|
|
||||||
|
model_sdpa = model_class.from_pretrained(
|
||||||
|
tmpdirname,
|
||||||
|
torch_dtype=torch.float16,
|
||||||
|
low_cpu_mem_usage=True,
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
|
self.assertTrue(model_sdpa.config._attn_implementation == "sdpa")
|
||||||
|
|
||||||
|
model_eager = model_class.from_pretrained(
|
||||||
|
tmpdirname,
|
||||||
|
torch_dtype=torch.float16,
|
||||||
|
low_cpu_mem_usage=True,
|
||||||
|
attn_implementation="eager",
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
|
self.assertTrue(model_eager.config._attn_implementation == "eager")
|
||||||
|
|
||||||
|
for name, submodule in model_eager.named_modules():
|
||||||
|
class_name = submodule.__class__.__name__
|
||||||
|
if "SdpaAttention" in class_name or "SdpaSelfAttention" in class_name:
|
||||||
|
raise ValueError("The eager model should not have SDPA attention layers")
|
||||||
|
|
||||||
|
has_sdpa = False
|
||||||
|
for name, submodule in model_sdpa.named_modules():
|
||||||
|
class_name = submodule.__class__.__name__
|
||||||
|
if "SdpaAttention" in class_name or "SdpaSelfAttention" in class_name:
|
||||||
|
has_sdpa = True
|
||||||
|
break
|
||||||
|
if not has_sdpa:
|
||||||
|
raise ValueError("The SDPA model should have SDPA attention layers")
|
||||||
|
|
||||||
|
# Just test that a large cache works as expected
|
||||||
|
res_eager = model_eager.generate(
|
||||||
|
dummy_input, attention_mask=dummy_attention_mask, max_new_tokens=max_new_tokens, do_sample=False
|
||||||
|
)
|
||||||
|
|
||||||
|
res_sdpa = model_sdpa.generate(
|
||||||
|
dummy_input, attention_mask=dummy_attention_mask, max_new_tokens=max_new_tokens, do_sample=False
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertTrue(torch.allclose(res_eager, res_sdpa))
|
||||||
|
|
||||||
|
|
||||||
|
@slow
|
||||||
|
@require_torch_accelerator
|
||||||
|
class GlmIntegrationTest(unittest.TestCase):
|
||||||
|
input_text = ["Hello I am doing", "Hi today"]
|
||||||
|
model_id = "THUDM/glm-4-9b"
|
||||||
|
revision = "refs/pr/15"
|
||||||
|
# This variable is used to determine which CUDA device are we using for our runners (A10 or T4)
|
||||||
|
# Depending on the hardware we get different logits / generations
|
||||||
|
cuda_compute_capability_major_version = None
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def setUpClass(cls):
|
||||||
|
if is_torch_available() and torch.cuda.is_available():
|
||||||
|
# 8 is for A100 / A10 and 7 for T4
|
||||||
|
cls.cuda_compute_capability_major_version = torch.cuda.get_device_capability()[0]
|
||||||
|
|
||||||
|
def test_model_9b_fp16(self):
|
||||||
|
EXPECTED_TEXTS = [
|
||||||
|
"Hello I am doing a project on the history of the internetSolution:\n\nStep 1: Introduction\nThe history of the",
|
||||||
|
"Hi today I am going to show you how to make a simple and easy to make a DIY paper flower.",
|
||||||
|
]
|
||||||
|
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
self.model_id, low_cpu_mem_usage=True, torch_dtype=torch.float16, revision=self.revision
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(self.model_id, revision=self.revision)
|
||||||
|
inputs = tokenizer(self.input_text, return_tensors="pt", padding=True).to(torch_device)
|
||||||
|
|
||||||
|
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
|
||||||
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
|
||||||
|
|
||||||
|
self.assertEqual(output_text, EXPECTED_TEXTS)
|
||||||
|
|
||||||
|
def test_model_9b_bf16(self):
|
||||||
|
EXPECTED_TEXTS = [
|
||||||
|
"Hello I am doing a project on the history of the internetSolution:\n\nStep 1: Introduction\nThe history of the",
|
||||||
|
"Hi today I am going to show you how to make a simple and easy to make a DIY paper flower.",
|
||||||
|
]
|
||||||
|
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
self.model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, revision=self.revision
|
||||||
|
).to(torch_device)
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(self.model_id, revision=self.revision)
|
||||||
|
inputs = tokenizer(self.input_text, return_tensors="pt", padding=True).to(torch_device)
|
||||||
|
|
||||||
|
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
|
||||||
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
|
||||||
|
|
||||||
|
self.assertEqual(output_text, EXPECTED_TEXTS)
|
||||||
|
|
||||||
|
def test_model_9b_eager(self):
|
||||||
|
EXPECTED_TEXTS = [
|
||||||
|
"Hello I am doing a project on the history of the internetSolution:\n\nStep 1: Introduction\nThe history of the",
|
||||||
|
"Hi today I am going to show you how to make a simple and easy to make a DIY paper flower.",
|
||||||
|
]
|
||||||
|
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
self.model_id,
|
||||||
|
low_cpu_mem_usage=True,
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
attn_implementation="eager",
|
||||||
|
revision=self.revision,
|
||||||
|
)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(self.model_id, revision=self.revision)
|
||||||
|
inputs = tokenizer(self.input_text, return_tensors="pt", padding=True).to(torch_device)
|
||||||
|
|
||||||
|
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
|
||||||
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
|
||||||
|
|
||||||
|
self.assertEqual(output_text, EXPECTED_TEXTS)
|
||||||
|
|
||||||
|
@require_torch_sdpa
|
||||||
|
def test_model_9b_sdpa(self):
|
||||||
|
EXPECTED_TEXTS = [
|
||||||
|
"Hello I am doing a project on the history of the internetSolution:\n\nStep 1: Introduction\nThe history of the",
|
||||||
|
"Hi today I am going to show you how to make a simple and easy to make a DIY paper flower.",
|
||||||
|
]
|
||||||
|
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
self.model_id,
|
||||||
|
low_cpu_mem_usage=True,
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
attn_implementation="sdpa",
|
||||||
|
revision=self.revision,
|
||||||
|
)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(self.model_id, revision=self.revision)
|
||||||
|
inputs = tokenizer(self.input_text, return_tensors="pt", padding=True).to(torch_device)
|
||||||
|
|
||||||
|
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
|
||||||
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
|
||||||
|
|
||||||
|
self.assertEqual(output_text, EXPECTED_TEXTS)
|
||||||
|
|
||||||
|
@require_flash_attn
|
||||||
|
@pytest.mark.flash_attn_test
|
||||||
|
def test_model_9b_flash_attn(self):
|
||||||
|
EXPECTED_TEXTS = [
|
||||||
|
"Hello I am doing a project on the history of the internetSolution:\n\nStep 1: Introduction\nThe history of the",
|
||||||
|
"Hi today I am going to show you how to make a simple and easy to make a DIY paper flower.",
|
||||||
|
]
|
||||||
|
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
self.model_id,
|
||||||
|
low_cpu_mem_usage=True,
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
attn_implementation="flash_attention_2",
|
||||||
|
revision=self.revision,
|
||||||
|
)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(self.model_id, revision=self.revision)
|
||||||
|
inputs = tokenizer(self.input_text, return_tensors="pt", padding=True).to(torch_device)
|
||||||
|
|
||||||
|
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
|
||||||
|
output_text = tokenizer.batch_decode(output, skip_special_tokens=True)
|
||||||
|
|
||||||
|
self.assertEqual(output_text, EXPECTED_TEXTS)
|
||||||
@@ -4938,14 +4938,17 @@ class ModelTesterMixin:
|
|||||||
if not hasattr(self, "_torch_compile_test_ckpt"):
|
if not hasattr(self, "_torch_compile_test_ckpt"):
|
||||||
self.skipTest(f"{self.__class__.__name__} doesn't have the attribute `_torch_compile_test_ckpt`.")
|
self.skipTest(f"{self.__class__.__name__} doesn't have the attribute `_torch_compile_test_ckpt`.")
|
||||||
ckpt = self._torch_compile_test_ckpt
|
ckpt = self._torch_compile_test_ckpt
|
||||||
|
revision = "main" if not hasattr(self, "_torch_compile_test_revision") else self._torch_compile_test_revision
|
||||||
|
|
||||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||||
|
|
||||||
batch_size = 1
|
batch_size = 1
|
||||||
n_iter = 3
|
n_iter = 3
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
tokenizer = AutoTokenizer.from_pretrained(ckpt, revision=revision)
|
||||||
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to(torch_device)
|
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16, revision=revision).to(
|
||||||
|
torch_device
|
||||||
|
)
|
||||||
|
|
||||||
model.generation_config.max_new_tokens = 4
|
model.generation_config.max_new_tokens = 4
|
||||||
|
|
||||||
@@ -5013,11 +5016,14 @@ class ModelTesterMixin:
|
|||||||
if not hasattr(self, "_torch_compile_test_ckpt"):
|
if not hasattr(self, "_torch_compile_test_ckpt"):
|
||||||
self.skipTest(f"{self.__class__.__name__} doesn't have the attribute `_torch_compile_test_ckpt`.")
|
self.skipTest(f"{self.__class__.__name__} doesn't have the attribute `_torch_compile_test_ckpt`.")
|
||||||
ckpt = self._torch_compile_test_ckpt
|
ckpt = self._torch_compile_test_ckpt
|
||||||
|
revision = "main" if not hasattr(self, "_torch_compile_test_revision") else self._torch_compile_test_revision
|
||||||
|
|
||||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
tokenizer = AutoTokenizer.from_pretrained(ckpt, revision=revision)
|
||||||
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to(torch_device)
|
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16, revision=revision).to(
|
||||||
|
torch_device
|
||||||
|
)
|
||||||
|
|
||||||
cache_implementation = "static"
|
cache_implementation = "static"
|
||||||
if model.config.model_type == "gemma2":
|
if model.config.model_type == "gemma2":
|
||||||
|
|||||||
Reference in New Issue
Block a user