Compare commits

..

15 Commits

Author SHA1 Message Date
Pedro Cuenca
63cd4c76f3 Llama Guard updates (#37872)
Some checks failed
Release - Conda / build_and_package (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
* Unhardcode use_chunked_attention, fix no_rope_layers

* Go back to exhaustive list of bools

* Conversion and modeling updates

* Fix rope

* Unhardcode rope

* Fix context length

* style

* Minor updates to conversion

* Use StaticCache

* Minor simplification

* DynamicCache 🤦

* Style

* Style
2025-04-30 10:34:43 +02:00
Yao Matrix
34f26e2c3e enable internvl UTs on XPU (#37779)
* enable internvl UTs on XPU

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* fix style per comments

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
2025-04-30 10:29:40 +02:00
Guang Yang
a57274466f Allow override inputs to export recipe (#37508)
Add option to specify dynamic shapes during export

Co-authored-by: Guang Yang <guangyang@fb.com>
2025-04-30 10:19:27 +02:00
Matt
481de7204c Skip is_flaky tests in the CI (#37723)
* No more red flaky tests in the CI!

* Remove the CircleCI logic as well

* Revert most changes including is_flaky behaviour

* make fixup

* Move to a more sensible place

* Mark a flaky test that failed on this PR!

* correct import

* update

* update

* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-04-30 09:52:21 +02:00
Aaditya Ura
5f8d17268c Update modeling_llama4.py (#37841)
* Update modeling_llama4.py

* Update modeling_llama4.py

* do not pass device

---------

Co-authored-by: raushan <raushan@huggingface.co>
2025-04-30 00:36:02 +02:00
Kim Juwon
50f8caaa48 🌐 [i18n-KO] Translated electra.md to Korean (#36763)
* docs: ko: electra.md

* feat: nmt draft

* fix: manual edits

* fix: manual edits
2025-04-29 14:03:39 -07:00
regisss
91f3e9422f Add Intel Gaudi doc (#37855)
* Add Intel Gaudi doc

* Use "TIP" instead of "NOTE"

* Address comments from reviews
2025-04-29 13:28:06 -07:00
Pedro Cuenca
c34afa5957 Processor chat template: pass custom kwargs (#37852) 2025-04-29 21:22:10 +02:00
Yaner
66ad8b2db0 docs: Details for ambigious channel dimension assignment (#37600)
* docs: Details for ambigious channel dimension inference

* Update src/transformers/image_utils.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-04-29 08:12:38 -07:00
Mohamed Mekkouri
096f25ae1f Fix Bitnet tokenizer in pipeline (#37861)
add tokenizer
2025-04-29 15:35:02 +02:00
Chris
da7ae467c4 Fix cache get item return type hints (#37847)
F: Fix cache return hints

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-04-29 14:23:52 +01:00
Hicham Tala
aa6b79db43 Fix check of unecessary packages (issue #37626) (#37825)
* Fix check of unecessary packages (issue #37626)

* Reformat using ruff

* And a condition to avoind the risk of matching a random object in `import_utils`

* Reformat
2025-04-29 14:21:05 +01:00
Matt
517367fe9a Revert change that breaks on Torch 2.1 (#37531)
* Revert change that breaks on Torch 2.1

* Add TODO

* Trigger tests

* Trigger tests
2025-04-29 13:27:09 +01:00
Joao Gante
755b0fa2fe [tests] reorganize cache tests and clean memory between tests (#37684) 2025-04-29 12:21:14 +01:00
Joao Gante
3a1acc36ed [tests] fix flaky pattern in test_generate_continue_from_past_key_values (#37724) 2025-04-29 12:20:42 +01:00
20 changed files with 696 additions and 309 deletions

View File

@@ -28,6 +28,8 @@ COMMON_ENV_VARIABLES = {
"TRANSFORMERS_IS_CI": True, "TRANSFORMERS_IS_CI": True,
"PYTEST_TIMEOUT": 120, "PYTEST_TIMEOUT": 120,
"RUN_PIPELINE_TESTS": False, "RUN_PIPELINE_TESTS": False,
# will be adjust in `CircleCIJob.to_dict`.
"RUN_FLAKY": True,
} }
# Disable the use of {"s": None} as the output is way too long, causing the navigation on CircleCI impractical # Disable the use of {"s": None} as the output is way too long, causing the navigation on CircleCI impractical
COMMON_PYTEST_OPTIONS = {"max-worker-restart": 0, "vvv": None, "rsfE":None} COMMON_PYTEST_OPTIONS = {"max-worker-restart": 0, "vvv": None, "rsfE":None}
@@ -126,6 +128,8 @@ class CircleCIJob:
def to_dict(self): def to_dict(self):
env = COMMON_ENV_VARIABLES.copy() env = COMMON_ENV_VARIABLES.copy()
# Do not run tests decorated by @is_flaky on pull requests
env['RUN_FLAKY'] = os.environ.get("CIRCLE_PULL_REQUEST", "") == ""
env.update(self.additional_env) env.update(self.additional_env)
job = { job = {

View File

@@ -149,6 +149,8 @@
title: TPU title: TPU
- local: perf_train_special - local: perf_train_special
title: Apple Silicon title: Apple Silicon
- local: perf_train_gaudi
title: Intel Gaudi
- local: perf_hardware - local: perf_hardware
title: Build your own machine title: Build your own machine
title: Hardware title: Hardware

View File

@@ -0,0 +1,34 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Intel Gaudi
The Intel Gaudi AI accelerator family includes [Intel Gaudi 1](https://habana.ai/products/gaudi/), [Intel Gaudi 2](https://habana.ai/products/gaudi2/), and [Intel Gaudi 3](https://habana.ai/products/gaudi3/). Each server is equipped with 8 devices, known as Habana Processing Units (HPUs), providing 128GB of memory on Gaudi 3, 96GB on Gaudi 2, and 32GB on the first-gen Gaudi. For more details on the underlying hardware architecture, check out the [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html) overview.
[`TrainingArguments`], [`Trainer`] and [`Pipeline`] detect and set the backend device to `hpu` if an Intel Gaudi device is available. No additional changes are required to enable training and inference on your device.
Some modeling code in Transformers is not optimized for HPU lazy mode. If you encounter any errors, set the environment variable below to use eager mode:
```
PT_HPU_LAZY_MODE=0
```
In some cases, you'll also need to enable int64 support to avoid casting issues with long integers:
```
PT_ENABLE_INT64_SUPPORT=1
```
Refer to the [Gaudi docs](https://docs.habana.ai/en/latest/index.html) for more details.
> [!TIP]
> For training and inference with Gaudi-optimized model implementations, we recommend using [Optimum for Intel Gaudi](https://huggingface.co/docs/optimum/main/en/habana/index).

View File

@@ -354,8 +354,8 @@
title: (번역중) DistilBERT title: (번역중) DistilBERT
- local: in_translation - local: in_translation
title: (번역중) DPR title: (번역중) DPR
- local: in_translation - local: model_doc/electra
title: (번역중) ELECTRA title: ELECTRA
- local: model_doc/encoder-decoder - local: model_doc/encoder-decoder
title: 인코더 디코더 모델 title: 인코더 디코더 모델
- local: in_translation - local: in_translation

View File

@@ -0,0 +1,196 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# ELECTRA[[electra]]
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAAC7lBMVEUAAADg5vYHPVgAoJH+/v76+v39/f9JbLP///9+AIgAnY3///+mcqzt8fXy9fgkXa3Ax9709fr+///9/f8qXq49qp5AaLGMwrv8/P0eW60VWawxYq8yqJzG2dytt9Wyu9elzci519Lf3O3S2efY3OrY0+Xp7PT///////+dqNCexMc6Z7AGpJeGvbenstPZ5ejQ1OfJzOLa7ejh4+/r8fT29vpccbklWK8PVa0AS6ghW63O498vYa+lsdKz1NDRt9Kw1c672tbD3tnAxt7R6OHp5vDe7OrDyuDn6vLl6/EAQKak0MgATakkppo3ZK/Bz9y8w9yzu9jey97axdvHzeG21NHH4trTwthKZrVGZLSUSpuPQJiGAI+GAI8SWKydycLL4d7f2OTi1+S9xNzL0ePT6OLGzeEAo5U0qJw/aLEAo5JFa7JBabEAp5Y4qZ2QxLyKmsm3kL2xoMOehrRNb7RIbbOZgrGre68AUqwAqZqNN5aKJ5N/lMq+qsd8kMa4pcWzh7muhLMEV69juq2kbKqgUaOTR5uMMZWLLZSGAI5VAIdEAH+ovNDHuNCnxcy3qcaYx8K8msGplrx+wLahjbYdXrV6vbMvYK9DrZ8QrZ8tqJuFms+Sos6sw8ecy8RffsNVeMCvmb43aLltv7Q4Y7EZWK4QWa1gt6meZKUdr6GOAZVeA4xPAISyveLUwtivxtKTpNJ2jcqfvcltiMiwwcfAoMVxhL+Kx7xjdrqTe60tsaNQs6KaRKACrJ6UTZwkqpqTL5pkHY4AloSgsd2ptNXPvNOOncuxxsqFl8lmg8apt8FJcr9EbryGxLqlkrkrY7dRa7ZGZLQ5t6iXUZ6PPpgVpZeJCJFKAIGareTa0+KJod3H0deY2M+esM25usmYu8d2zsJOdcBVvrCLbqcAOaaHaKQAMaScWqKBXqCXMJ2RHpiLF5NmJZAdAHN2kta11dKu1M+DkcZLdb+Mcql3TppyRJdzQ5ZtNZNlIY+DF4+voCOQAAAAZ3RSTlMABAT+MEEJ/RH+/TP+Zlv+pUo6Ifz8+fco/fz6+evr39S9nJmOilQaF/7+/f38+smmoYp6b1T+/v7++vj189zU0tDJxsGzsrKSfv34+Pf27dDOysG9t6+n/vv6+vr59uzr1tG+tZ6Qg9Ym3QAABR5JREFUSMeNlVVUG1EQhpcuxEspXqS0SKEtxQp1d3d332STTRpIQhIISQgJhODu7lAoDoUCpe7u7u7+1puGpqnCPOyZvffbOXPm/PsP9JfQgyCC+tmTABTOcbxDz/heENS7/1F+9nhvkHePG0wNDLbGWwdXL+rbLWvpmZHXD8+gMfBjTh+aSe6Gnn7lwQIOTR0c8wfX3PWgv7avbdKwf/ZoBp1Gp/PvuvXW3vw5ib7emnTW4OR+3D4jB9vjNJ/7gNvfWWeH/TO/JyYrsiKCRjVEZA3UB+96kON+DxOQ/NLE8PE5iUYgIXjFnCOlxEQMaSGVxjg4gxOnEycGz8bptuNjVx08LscIgrzH3umcn+KKtiBIyvzOO2O99aAdR8cF19oZalnCtvREUw79tCd5sow1g1UKM6kXqUx4T8wsi3sTjJ3yzDmmhenLXLpo8u45eG5y4Vvbk6kkC4LLtJMowkSQxmk4ggVJEG+7c6QpHT8vvW9X7/o7+3ELmiJi2mEzZJiz8cT6TBlanBk70cB5GGIGC1gRDdZ00yADLW1FL6gqhtvNXNG5S9gdSrk4M1qu7JAsmYshzDS4peoMrU/gT7qQdqYGZaYhxZmVbGJAm/CS/HloWyhRUlknQ9KYcExTwS80d3VNOxUZJpITYyspl0LbhArhpZCD9cRWEQuhYkNGMHToQ/2Cs6swJlb39CsllxdXX6IUKh/H5jbnSsPKjgmoaFQ1f8wRLR0UnGE/RcDEjj2jXG1WVTwUs8+zxfcrVO+vSsuOpVKxCfYZiQ0/aPKuxQbQ8lIz+DClxC8u+snlcJ7Yr1z1JPqUH0V+GDXbOwAib931Y4Imaq0NTIXPXY+N5L18GJ37SVWu+hwXff8l72Ds9XuwYIBaXPq6Shm4l+Vl/5QiOlV+uTk6YR9PxKsI9xNJny31ygK1e+nIRC1N97EGkFPI+jCpiHe5PCEy7oWqWSwRrpOvhFzcbTWMbm3ZJAOn1rUKpYIt/lDhW/5RHHteeWFN60qo98YJuoq1nK3uW5AabyspC1BcIEpOhft+SZAShYoLSvnmSfnYADUERP5jJn2h5XtsgCRuhYQqAvwTwn33+YWEKUI72HX5AtfSAZDe8F2DtPPm77afhl0EkthzuCQU0BWApgQIH9+KB0JhopMM7bJrdTRoleM2JAVNMyPF+wdoaz+XJpGoVAQ7WXUkcV7gT3oUZyi/ISIJAVKhgNp+4b4veCFhYVJw4locdSjZCp9cPUhLF9EZ3KKzURepMEtCDPP3VcWFx4UIiZIklIpFNfHpdEafIF2aRmOcrUmjohbT2WUllbmRvgfbythbQO3222fpDJoufaQPncYYuqoGtUEsCJZL6/3PR5b4syeSjZMQG/T2maGANlXT2v8S4AULWaUkCxfLyW8iW4kdka+nEMjxpL2NCwsYNBp+Q61PF43zyDg9Bm9+3NNySn78jMZUUkumqE4Gp7JmFOdP1vc8PpRrzj9+wPinCy8K1PiJ4aYbnTYpCCbDkBSbzhu2QJ1Gd82t8jI8TH51+OzvXoWbnXUOBkNW+0mWFwGcGOUVpU81/n3TOHb5oMt2FgYGjzau0Nif0Ss7Q3XB33hjjQHjHA5E5aOyIQc8CBrLdQSs3j92VG+3nNEjbkbdbBr9zm04ruvw37vh0QKOdeGIkckc80fX3KH/h7PT4BOjgCty8VZ5ux1MoO5Cf5naca2LAsEgehI+drX8o/0Nu+W0m6K/I9gGPd/dfx/EN/wN62AhsBWuAAAAAElFTkSuQmCC
">
</div>
## 개요[[overview]]
ELECTRA 모델은 [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators](https://openreview.net/pdf?id=r1xMH1BtvB) 논문에서 제안되었습니다. ELECTRA는 두가지 트랜스포머 모델인 생성 모델과 판별 모델을 학습시키는 새로운 사전학습 접근법입니다. 생성 모델의 역할은 시퀀스에 있는 토큰을 대체하는 것이며 마스킹된 언어 모델로 학습됩니다. 우리가 관심을 가진 판별 모델은 시퀀스에서 어떤 토큰이 생성 모델에 의해 대체되었는지 식별합니다.
논문의 초록은 다음과 같습니다:
*BERT와 같은 마스킹된 언어 모델(MLM) 사전학습 방법은 일부 토큰을 [MASK] 토큰으로 바꿔 손상시키고 난 뒤, 모델이 다시 원본 토큰을 복원하도록 학습합니다. 이런 방식은 다운스트림 NLP 작업을 전이할 때 좋은 성능을 내지만, 효과적으로 사용하기 위해서는 일반적으로 많은 양의 연산이 필요합니다. 따라서 대안으로, 대체 토큰 탐지라고 불리는 샘플-효과적인 사전학습을 제안합니다. 우리의 방법론은 입력에 마스킹을 하는 대신에 소형 생성 모델의 그럴듯한 대안 토큰으로 손상시킵니다. 그리고 나서, 모델이 손상된 토큰의 원래 토큰을 예측하도록 훈련시키는 대신, 판별 모델을 각각의 토큰이 생성 모델의 샘플로 손상되었는지 아닌지 학습합니다. 실험들은 통해 이 새로운 사전학습 방식은 마스킹된 일부 토큰에만 적용되는 기존 방식과 달리 모든 입력 토큰에 대해 학습이 이뤄지기 때문에 마스킹된 언어 모델(MLM)보다 더 효율적임을 입증하였습니다. 결과적으로 소개된 방식이 같은 모델 크기, 데이터, 연산량을 가진 BERT모델로 학습한 결과를 압도하는 문맥 표현 학습을 할 수 있다는 것을 확인했습니다. 특히 작은 모델에서 성능 향상이 두드러지며, 예를 들어 GPU 한 대로 4일간 학습한 모델이 30배 더 많은 계산 자원을 사용한 GPT보다 GLUE 자연어 이해 벤치마크에서 더 나은 성능을 보입니다. 대규모 환경에서도 유효하며 더 적은 연산량으로 RoBERTa와 XLNet과 비슷한 성능을 낼 수 있으며, 동일한 연산량을 가질 경우 이들의 성능을 능가합니다.*
이 모델은 [lysandre](https://huggingface.co/lysandre)이 기여했습니다. 원본 코드는 [이곳](https://github.com/google-research/electra)에서 찾아보실 수 있습니다.
## 사용 팁[[usage-tips]]
- ELECTRA는 사전학습 방법으로 기본 모델인 BERT의 구조와 거의 차이가 없습니다. 유일한 차이는 임베딩 크기와 히든 크기를 구분했다는 점입니다. 임베딩 크기는 일반적으로 더 작고, 히든 크기는 더 큽니다. 임베딩에서 임베딩 크기를 히든 크기로 변환하기 위해 추가로 선형 변환 층이 사용됩니다. 임베딩 크기와 히든 크기가 동일할 경우에는 이 선형 변환 층이 필요하지 않습니다.
- ELECTRA는 또 다른 (작은) 마스킹된 언어 모델을 사용해 사전학습 된 트랜스포머 모델입니다. 작은 언어 모델이 입력 텍스트의 일부를 무작위로 마스킹하고, 그 자리에 새로운 토큰을 삽입합니다. ELECTRA는 원래 토큰과 대체된 토큰을 구분하는 역할을 수행합니다. GAN 훈련과 비슷하지만, 생성 모델은 ELECTRA 모델을 속이는 것이 아니라 원래 텍스트를 복원하는 목표로 몇 단계 학습합니다. 그 후 ELECTRA가 학습을 하게 됩니다.
- [구글 리서치의 구현](https://github.com/google-research/electra)으로 저장된 ELECTRA checkpoints는 생성 모델과 판별 모델을 포함합니다. 변환 스크립트에서는 사용자가 어떤 모델을 어떤 아키텍처로 내보낼지 명시해야 합니다. 일단 Hugging Face 포맷으로 변환되면, 이 체크포인트들은 모든 ELECTRA 모델에서 불러올 수 있습니다. 즉, 판별 모델은 [`ElectraForMaskedLM`] 모델에, 생성 모델은 [`ElectraForPreTraining`]모델에 불러올 수 있다는 의미입니다. (단, 생성 모델에는 분류 헤드가 존재하지 않기 때문에, 해당 부분은 무작위로 초기화됩니다.)
## 참고 자료[[resources]]
- [텍스트 분류 가이드](../tasks/sequence_classification)
- [토큰 분류 가이드](../tasks/token_classification)
- [질의 응답 가이드](../tasks/question_answering)
- [인과 언어 모델링 가이드](../tasks/language_modeling)
- [마스킹된 언어 모델링 가이드](../tasks/masked_language_modeling)
- [객관식 문제 가이드](../tasks/multiple_choice)
## ElectraConfig
[[autodoc]] ElectraConfig
## ElectraTokenizer
[[autodoc]] ElectraTokenizer
## ElectraTokenizerFast
[[autodoc]] ElectraTokenizerFast
## Electra specific outputs
[[autodoc]] models.electra.modeling_electra.ElectraForPreTrainingOutput
[[autodoc]] models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput
<frameworkcontent>
<pt>
## ElectraModel
[[autodoc]] ElectraModel
- forward
## ElectraForPreTraining
[[autodoc]] ElectraForPreTraining
- forward
## ElectraForCausalLM
[[autodoc]] ElectraForCausalLM
- forward
## ElectraForMaskedLM
[[autodoc]] ElectraForMaskedLM
- forward
## ElectraForSequenceClassification
[[autodoc]] ElectraForSequenceClassification
- forward
## ElectraForMultipleChoice
[[autodoc]] ElectraForMultipleChoice
- forward
## ElectraForTokenClassification
[[autodoc]] ElectraForTokenClassification
- forward
## ElectraForQuestionAnswering
[[autodoc]] ElectraForQuestionAnswering
- forward
</pt>
<tf>
## TFElectraModel
[[autodoc]] TFElectraModel
- call
## TFElectraForPreTraining
[[autodoc]] TFElectraForPreTraining
- call
## TFElectraForMaskedLM
[[autodoc]] TFElectraForMaskedLM
- call
## TFElectraForSequenceClassification
[[autodoc]] TFElectraForSequenceClassification
- call
## TFElectraForMultipleChoice
[[autodoc]] TFElectraForMultipleChoice
- call
## TFElectraForTokenClassification
[[autodoc]] TFElectraForTokenClassification
- call
## TFElectraForQuestionAnswering
[[autodoc]] TFElectraForQuestionAnswering
- call
</tf>
<jax>
## FlaxElectraModel
[[autodoc]] FlaxElectraModel
- __call__
## FlaxElectraForPreTraining
[[autodoc]] FlaxElectraForPreTraining
- __call__
## FlaxElectraForCausalLM
[[autodoc]] FlaxElectraForCausalLM
- __call__
## FlaxElectraForMaskedLM
[[autodoc]] FlaxElectraForMaskedLM
- __call__
## FlaxElectraForSequenceClassification
[[autodoc]] FlaxElectraForSequenceClassification
- __call__
## FlaxElectraForMultipleChoice
[[autodoc]] FlaxElectraForMultipleChoice
- __call__
## FlaxElectraForTokenClassification
[[autodoc]] FlaxElectraForTokenClassification
- __call__
## FlaxElectraForQuestionAnswering
[[autodoc]] FlaxElectraForQuestionAnswering
- __call__
</jax>
</frameworkcontent>

View File

@@ -376,7 +376,7 @@ class DynamicCache(Cache):
self.key_cache.append(key_states) self.key_cache.append(key_states)
self.value_cache.append(value_states) self.value_cache.append(value_states)
def __getitem__(self, layer_idx: int) -> List[Tuple[torch.Tensor]]: def __getitem__(self, layer_idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
""" """
Support for backwards-compatible `past_key_value` indexing, e.g. `past_key_value[0][0].shape[2]` to get the Support for backwards-compatible `past_key_value` indexing, e.g. `past_key_value[0][0].shape[2]` to get the
sequence length. sequence length.
@@ -649,7 +649,7 @@ class OffloadedCache(DynamicCache):
self.key_cache[prev_layer_idx] = self.key_cache[prev_layer_idx].to("cpu", non_blocking=True) self.key_cache[prev_layer_idx] = self.key_cache[prev_layer_idx].to("cpu", non_blocking=True)
self.value_cache[prev_layer_idx] = self.value_cache[prev_layer_idx].to("cpu", non_blocking=True) self.value_cache[prev_layer_idx] = self.value_cache[prev_layer_idx].to("cpu", non_blocking=True)
def __getitem__(self, layer_idx: int) -> List[Tuple[torch.Tensor]]: def __getitem__(self, layer_idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
"Gets the cache for this layer to the device. Prefetches the next and evicts the previous layer." "Gets the cache for this layer to the device. Prefetches the next and evicts the previous layer."
if layer_idx < len(self): if layer_idx < len(self):
# Evict the previous layer if necessary # Evict the previous layer if necessary
@@ -1473,7 +1473,7 @@ class EncoderDecoderCache(Cache):
for layer_idx in range(len(cross_attention_cache.key_cache)): for layer_idx in range(len(cross_attention_cache.key_cache)):
self.is_updated[layer_idx] = bool(cross_attention_cache.get_seq_length(layer_idx) > 0) self.is_updated[layer_idx] = bool(cross_attention_cache.get_seq_length(layer_idx) > 0)
def __getitem__(self, layer_idx: int) -> List[Tuple[torch.Tensor]]: def __getitem__(self, layer_idx: int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
""" """
Support for backwards-compatible `past_key_value` indexing, e.g. `past_key_value[0][0].shape[2]` to get the Support for backwards-compatible `past_key_value` indexing, e.g. `past_key_value[0][0].shape[2]` to get the
sequence length. sequence length.

View File

@@ -151,17 +151,24 @@ def get_imports(filename: Union[str, os.PathLike]) -> list[str]:
content = f.read() content = f.read()
imported_modules = set() imported_modules = set()
import transformers.utils
def recursive_look_for_imports(node): def recursive_look_for_imports(node):
if isinstance(node, ast.Try): if isinstance(node, ast.Try):
return # Don't recurse into Try blocks and ignore imports in them return # Don't recurse into Try blocks and ignore imports in them
elif isinstance(node, ast.If): elif isinstance(node, ast.If):
test = node.test test = node.test
for condition_node in ast.walk(test): for condition_node in ast.walk(test):
if isinstance(condition_node, ast.Call) and getattr(condition_node.func, "id", "").startswith( if isinstance(condition_node, ast.Call):
"is_flash_attn" check_function = getattr(condition_node.func, "id", "")
): if (
# Don't recurse into "if flash_attn_available()" blocks and ignore imports in them check_function.endswith("available")
return and check_function.startswith("is_flash_attn")
or hasattr(transformers.utils.import_utils, check_function)
):
# Don't recurse into "if flash_attn_available()" or any "if library_available" blocks
# that appears in `transformers.utils.import_utils` and ignore imports in them
return
elif isinstance(node, ast.Import): elif isinstance(node, ast.Import):
# Handle 'import x' statements # Handle 'import x' statements
for alias in node.names: for alias in node.names:

View File

@@ -376,7 +376,7 @@ def infer_channel_dimension_format(
if image.shape[first_dim] in num_channels and image.shape[last_dim] in num_channels: if image.shape[first_dim] in num_channels and image.shape[last_dim] in num_channels:
logger.warning( logger.warning(
f"The channel dimension is ambiguous. Got image shape {image.shape}. Assuming channels are the first dimension." f"The channel dimension is ambiguous. Got image shape {image.shape}. Assuming channels are the first dimension. Use the [input_data_format](https://huggingface.co/docs/transformers/main/internal/image_processing_utils#transformers.image_transforms.rescale.input_data_format) parameter to assign the channel dimension."
) )
return ChannelDimension.FIRST return ChannelDimension.FIRST
elif image.shape[first_dim] in num_channels: elif image.shape[first_dim] in num_channels:

View File

@@ -10,6 +10,7 @@
# an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the # an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
# specific language governing permissions and limitations under the License. # specific language governing permissions and limitations under the License.
import logging
from typing import Optional from typing import Optional
import torch import torch
@@ -50,14 +51,22 @@ class TorchExportableModuleForDecoderOnlyLM(torch.nn.Module):
""" """
super().__init__() super().__init__()
if model.config.cache_implementation == "static": if not hasattr(model.config, "use_cache") or model.config.use_cache is False:
raise ValueError("The model must have caching enabled to be performant.")
if not hasattr(model.config, "cache_implementation"):
# If `cache_implementation` is not specified explicitly in the config, `DynamicCache` will
# be used by default, so export will use `StaticCache` by default.
logging.info("Using `StaticCache` for export as `cache_implementation` is not specified in the config.")
self.model = TorchExportableModuleWithStaticCache(model) self.model = TorchExportableModuleWithStaticCache(model)
elif model.config.cache_implementation == "hybrid":
self.model = TorchExportableModuleWithHybridCache(model, max_batch_size, max_cache_len)
else: else:
raise ValueError( if model.config.cache_implementation == "hybrid":
f"Unsupported cache implementation in this export recipe: '{model.config.cache_implementation}'" self.model = TorchExportableModuleWithHybridCache(model, max_batch_size, max_cache_len)
) else:
raise ValueError(
f"Unsupported cache implementation: {model.config.cache_implementation}. "
"Please use `hybrid` or `static`."
)
def forward( def forward(
self, self,
@@ -462,6 +471,8 @@ def convert_and_export_with_cache(
model: PreTrainedModel, model: PreTrainedModel,
example_input_ids: Optional[torch.Tensor] = None, example_input_ids: Optional[torch.Tensor] = None,
example_cache_position: Optional[torch.Tensor] = None, example_cache_position: Optional[torch.Tensor] = None,
dynamic_shapes: Optional[dict] = None,
strict: Optional[bool] = None,
): ):
""" """
Convert a `PreTrainedModel` into an exportable module and export it using `torch.export`, Convert a `PreTrainedModel` into an exportable module and export it using `torch.export`,
@@ -469,8 +480,10 @@ def convert_and_export_with_cache(
Args: Args:
model (`PreTrainedModel`): The pretrained model to be exported. model (`PreTrainedModel`): The pretrained model to be exported.
example_input_ids (`torch.Tensor`): Example input token id used by `torch.export`. example_input_ids (`Optional[torch.Tensor]`): Example input token id used by `torch.export`.
example_cache_position (`torch.Tensor`): Example current cache position used by `torch.export`. example_cache_position (`Optional[torch.Tensor]`): Example current cache position used by `torch.export`.
dynamic_shapes(`Optional[dict]`): Dynamic shapes used by `torch.export`.
strict(`Optional[bool]`): Flag to instruct `torch.export` to use `torchdynamo`.
Returns: Returns:
Exported program (`torch.export.ExportedProgram`): The exported program generated via `torch.export`. Exported program (`torch.export.ExportedProgram`): The exported program generated via `torch.export`.
@@ -489,14 +502,21 @@ def convert_and_export_with_cache(
example_cache_position if example_cache_position is not None else torch.tensor([0], dtype=torch.long) example_cache_position if example_cache_position is not None else torch.tensor([0], dtype=torch.long)
) )
if is_torch_greater_or_equal("2.5.0"): if is_torch_greater_or_equal("2.6.0"):
exported_program = torch.export.export( exported_program = torch.export.export(
TorchExportableModuleWithStaticCache(model), TorchExportableModuleWithStaticCache(model),
args=(example_input_ids,), args=(example_input_ids, example_cache_position),
kwargs={"cache_position": example_cache_position}, kwargs={},
strict=True, dynamic_shapes=dynamic_shapes,
strict=strict if strict is not None else True,
) )
else: else:
if dynamic_shapes is not None:
logging.warning(
"Dynamic shapes spec will be ignored by convert_and_export_with_cache for torch < 2.6.0."
)
if strict is not None:
logging.warning("The strict flag will be ingored by convert_and_export_with_cache for torch < 2.6.0.")
# We have to keep this path for BC. # We have to keep this path for BC.
# #
# Due to issue https://github.com/pytorch/pytorch/issues/128394, we need to switch to use an internal # Due to issue https://github.com/pytorch/pytorch/issues/128394, we need to switch to use an internal

View File

@@ -93,6 +93,7 @@ else:
), ),
("bigbird_pegasus", ("PegasusTokenizer", "PegasusTokenizerFast" if is_tokenizers_available() else None)), ("bigbird_pegasus", ("PegasusTokenizer", "PegasusTokenizerFast" if is_tokenizers_available() else None)),
("biogpt", ("BioGptTokenizer", None)), ("biogpt", ("BioGptTokenizer", None)),
("bitnet", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("blenderbot", ("BlenderbotTokenizer", "BlenderbotTokenizerFast")), ("blenderbot", ("BlenderbotTokenizer", "BlenderbotTokenizerFast")),
("blenderbot-small", ("BlenderbotSmallTokenizer", None)), ("blenderbot-small", ("BlenderbotSmallTokenizer", None)),
("blip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("blip", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),

View File

@@ -224,8 +224,13 @@ class Llama4TextConfig(PretrainedConfig):
Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
<TODO> <TODO>
<TODO> <TODO>
no_rope_layers (`int`, *optional*): TODO no_rope_layers (`List[int]`, *optional*):
no_rope_layer_interval (`int`, *optional*, defaults to 4): TODO List with at least the same length as the number of layers in the model.
A `1` at an index position indicates that the corresponding layer will use RoPE,
while a `0` indicates that it's a NoPE layer.
no_rope_layer_interval (`int`, *optional*, defaults to 4):
If `no_rope_layers` is `None`, it will be created using a NoPE layer every
`no_rope_layer_interval` layers.
attention_chunk_size (`int`, *optional*, defaults to 8192): attention_chunk_size (`int`, *optional*, defaults to 8192):
<TODO> <TODO>
attn_temperature_tuning (`bool`, *optional*, defaults to `True`): attn_temperature_tuning (`bool`, *optional*, defaults to `True`):
@@ -339,11 +344,15 @@ class Llama4TextConfig(PretrainedConfig):
self.output_router_logits = output_router_logits self.output_router_logits = output_router_logits
self.router_aux_loss_coef = router_aux_loss_coef self.router_aux_loss_coef = router_aux_loss_coef
self.router_jitter_noise = router_jitter_noise self.router_jitter_noise = router_jitter_noise
# Backwards compatibility
if no_rope_layers == []:
no_rope_layers = None
default_no_rope_layers = [ default_no_rope_layers = [
int((layer_idx + 1) % no_rope_layer_interval != 0) for layer_idx in range(self.num_hidden_layers) int((layer_idx + 1) % no_rope_layer_interval != 0) for layer_idx in range(self.num_hidden_layers)
] ]
# no_rope_layers == [] is invalid as we cannot have 0 layers
self.no_rope_layers = no_rope_layers if no_rope_layers else default_no_rope_layers self.no_rope_layers = no_rope_layers if no_rope_layers else default_no_rope_layers
self.interleave_moe_layer_step = interleave_moe_layer_step self.interleave_moe_layer_step = interleave_moe_layer_step

View File

@@ -65,6 +65,7 @@ ORIGINAL_TO_CONVERTED_KEY_MAPPING = {
r"layers.(\d+).feed_forward.w3.weight": r"language_model.model.layers.\1.feed_forward.up_proj.weight", # might need to be fused for efficiency? r"layers.(\d+).feed_forward.w3.weight": r"language_model.model.layers.\1.feed_forward.up_proj.weight", # might need to be fused for efficiency?
# r"layers.(\d+).feed_forward.mlp.fc1_weight": r"language_model.model.layers.\1.feed_forward.gate_up_proj.weight", # r"layers.(\d+).feed_forward.mlp.fc1_weight": r"language_model.model.layers.\1.feed_forward.gate_up_proj.weight",
r"layers.(\d+).feed_forward.mlp.fc2_weight": r"language_model.model.layers.\1.feed_forward.down_proj.weight", r"layers.(\d+).feed_forward.mlp.fc2_weight": r"language_model.model.layers.\1.feed_forward.down_proj.weight",
r"layers.(\d+).feed_forward.w2.weight": r"language_model.model.layers.\1.feed_forward.down_proj.weight",
r"layers.(\d+).feed_forward.mlp.layer_norm.weight": r"language_model.model.layers.\1.post_attention_layernorm.weight", r"layers.(\d+).feed_forward.mlp.layer_norm.weight": r"language_model.model.layers.\1.post_attention_layernorm.weight",
# Vision encoder mapping # Vision encoder mapping
@@ -166,8 +167,8 @@ def get_concat_dim(key):
return 0 return 0
def compute_intermediate_size(hidden_dim, multiple_of=1024, ffn_dim_multiplier=1.3): def compute_intermediate_size(hidden_dim, ffn_exp=4, multiple_of=1024, ffn_dim_multiplier=1.2):
hidden_dim = 4 * int(2 * hidden_dim / 3) hidden_dim = ffn_exp * int(2 * hidden_dim / 3)
hidden_dim = int(ffn_dim_multiplier * hidden_dim) hidden_dim = int(ffn_dim_multiplier * hidden_dim)
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of) hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
return hidden_dim return hidden_dim
@@ -203,6 +204,8 @@ def max_context_length(model_path, instruct=False):
with open(os.path.join(model_path, "params.json"), "r") as f: with open(os.path.join(model_path, "params.json"), "r") as f:
params = json.load(f) params = json.load(f)
params = params.get("model", params) params = params.get("model", params)
if params.get("moe_args") is None:
return 8192
num_experts = params["moe_args"]["num_experts"] num_experts = params["moe_args"]["num_experts"]
return 10485760 if num_experts == 16 else 1048576 return 10485760 if num_experts == 16 else 1048576
@@ -242,24 +245,40 @@ def write_model(
# some constants from original code # some constants from original code
rope_scaling = { rope_scaling = {
"rope_type": "llama3", "rope_type": "llama3",
"factor": 8.0, "factor": params.get("rope_scaling_factor", 8.0),
"low_freq_factor": 1.0, "low_freq_factor": 1.0,
"high_freq_factor": 4.0, "high_freq_factor": params.get("rope_high_freq_factor", 4.0),
"original_max_position_embeddings": 8192, "original_max_position_embeddings": 8192,
} }
config_kwargs.update({"rope_scaling": rope_scaling}) config_kwargs.update({"rope_scaling": rope_scaling})
if attention_chunk_size is None:
config_kwargs.update({"cache_implementation": "static"})
# compute additional params for weight conversion # compute additional params for weight conversion
num_heads_per_shard = num_heads // num_shards num_heads_per_shard = num_heads // num_shards
dim_per_head = dim // num_heads dim_per_head = dim // num_heads
# intermediate_size = compute_intermediate_size(dim, multiple_of=params["multiple_of"]) intermediate_size_mlp = compute_intermediate_size(
dim,
ffn_exp=params["ffn_exp"],
multiple_of=params["multiple_of"],
ffn_dim_multiplier=params["ffn_dim_multiplier"],
)
num_key_value_heads = params["n_kv_heads"] # for GQA / MQA num_key_value_heads = params["n_kv_heads"] # for GQA / MQA
num_experts = params["moe_args"]["num_experts"] if hasattr(params, "moe_args"):
interleave_moe_layer_step = params["moe_args"].get("interleave_moe_layer_step", 1) num_experts = params["moe_args"]["num_experts"]
interleave_moe_layer_step = params["moe_args"].get("interleave_moe_layer_step", 1)
else:
# Dense model (possibly Llama Guard) - disable all moe layers
num_experts = 0
interleave_moe_layer_step = 0
config_kwargs.update({"moe_layers": []})
# Ensure all layers are rope if `nope_layer_interval` is None
no_rope_layer_interval = params["nope_layer_interval"] no_rope_layer_interval = params["nope_layer_interval"]
no_rope_layer_interval = num_heads * 2 if no_rope_layer_interval is None else no_rope_layer_interval
bos_token_id = 200000 bos_token_id = 200000
eos_token_id = [200001, 200007, 200008] if instruct else 200001 eos_token_id = [200001, 200007, 200008] if instruct else 200001
@@ -273,7 +292,7 @@ def write_model(
rope_theta=rope_theta, rope_theta=rope_theta,
num_hidden_layers=num_layers, num_hidden_layers=num_layers,
intermediate_size=8192, intermediate_size=8192,
intermediate_size_mlp=16384, intermediate_size_mlp=intermediate_size_mlp,
max_position_embeddings=max_context_length(input_base_path, instruct), max_position_embeddings=max_context_length(input_base_path, instruct),
num_local_experts=num_experts, num_local_experts=num_experts,
interleave_moe_layer_step=interleave_moe_layer_step, interleave_moe_layer_step=interleave_moe_layer_step,
@@ -336,7 +355,7 @@ def write_model(
sharded_keys = [] sharded_keys = []
for _key in all_keys_raw: for _key in all_keys_raw:
try: try:
if (loaded[0][_key] == loaded[1][_key]).all(): if num_shards == 1 or (loaded[0][_key] == loaded[1][_key]).all():
repeated_keys.append(_key) repeated_keys.append(_key)
else: else:
sharded_keys.append(_key) sharded_keys.append(_key)
@@ -354,7 +373,7 @@ def write_model(
for key in tqdm(all_keys, desc="Renaming and processing all keys", unit="key"): for key in tqdm(all_keys, desc="Renaming and processing all keys", unit="key"):
new_key = new_keys[key] new_key = new_keys[key]
print(key, new_key) print(key, new_key)
if not is_param_same_across_shards(new_key): if num_shards > 1 and not is_param_same_across_shards(new_key):
current_parameter = [chunk.pop(key) for chunk in loaded if not isinstance(chunk[key], io.BytesIO)] current_parameter = [chunk.pop(key) for chunk in loaded if not isinstance(chunk[key], io.BytesIO)]
else: else:
print(f"{key} (now {new_key}) is the same across all shards.") print(f"{key} (now {new_key}) is the same across all shards.")
@@ -565,8 +584,8 @@ LLAMA4_TEXT_POST_TRAIN_SPECIAL_TOKENS = [
"<|python_end|>", "<|python_end|>",
"<|finetune_right_pad|>", "<|finetune_right_pad|>",
] + get_reserved_special_tokens( ] + get_reserved_special_tokens(
"text_post_train", 61, 6 "text_post_train", 61, 8
) # <|text_post_train_reserved_special_token_6|>, ..., <|text_post_train_reserved_special_token_66|> ) # <|text_post_train_reserved_special_token_8|>, ..., <|text_post_train_reserved_special_token_68|>
# 200080, ..., 201133 # 200080, ..., 201133
LLAMA4_VISION_SPECIAL_TOKENS = [ LLAMA4_VISION_SPECIAL_TOKENS = [
@@ -621,15 +640,6 @@ class Llama4Converter(TikTokenConverter):
**kwargs, **kwargs,
) )
# to check
# import tiktoken
# model = tiktoken.Encoding(
# name=Path(model_path).name,
# pat_str=self.O200K_PATTERN,
# mergeable_ranks=mergeable_ranks,
# special_tokens=self.special_tokens,
# )
instruct = chat_template is not None instruct = chat_template is not None
self.update_post_processor(self.converted_tokenizer) self.update_post_processor(self.converted_tokenizer)
# finer special_tokens_map.json # finer special_tokens_map.json
@@ -687,12 +697,10 @@ if __name__ == "__main__":
parser.add_argument( parser.add_argument(
"--input_dir", "--input_dir",
type=str, type=str,
default="/fsx/arthur/Llama-4-17B-Omni-Instruct-Original",
help="Location of the local folder copied from the Hub.", help="Location of the local folder copied from the Hub.",
) )
parser.add_argument( parser.add_argument(
"--output_dir", "--output_dir",
default="llama4_hf_vision",
type=str, type=str,
help="Location to write HF model and tokenizer", help="Location to write HF model and tokenizer",
) )

View File

@@ -20,12 +20,11 @@ from typing import Callable, List, Optional, Tuple, Union
import torch import torch
import torch.nn as nn import torch.nn as nn
import torch.nn.functional as F import torch.nn.functional as F
import torch.utils.checkpoint
from transformers.models.llama4.configuration_llama4 import Llama4VisionConfig from transformers.models.llama4.configuration_llama4 import Llama4VisionConfig
from ...activations import ACT2FN from ...activations import ACT2FN
from ...cache_utils import Cache, HybridChunkedCache from ...cache_utils import Cache, DynamicCache, HybridChunkedCache
from ...generation import GenerationMixin from ...generation import GenerationMixin
from ...integrations.hub_kernels import use_kernel_forward_from_hub from ...integrations.hub_kernels import use_kernel_forward_from_hub
from ...modeling_attn_mask_utils import AttentionMaskConverter from ...modeling_attn_mask_utils import AttentionMaskConverter
@@ -287,7 +286,7 @@ class Llama4TextAttention(nn.Module):
self.attn_temperature_tuning = config.attn_temperature_tuning self.attn_temperature_tuning = config.attn_temperature_tuning
self.attention_dropout = config.attention_dropout self.attention_dropout = config.attention_dropout
self.is_causal = True self.is_causal = True
self.use_rope = int((layer_idx + 1) % 4 != 0) # rope unused for dense layers self.use_rope = config.no_rope_layers[layer_idx]
self.q_proj = nn.Linear( self.q_proj = nn.Linear(
config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
) )
@@ -374,7 +373,7 @@ class Llama4TextDecoderLayer(nn.Module):
super().__init__() super().__init__()
self.hidden_size = config.hidden_size self.hidden_size = config.hidden_size
self.self_attn = Llama4TextAttention(config, layer_idx) self.self_attn = Llama4TextAttention(config, layer_idx)
self.use_chunked_attention = int((layer_idx + 1) % 4 != 0) # <=> use rope self.use_chunked_attention = config.attention_chunk_size is not None and bool(config.no_rope_layers[layer_idx])
self.is_moe_layer = layer_idx in config.moe_layers self.is_moe_layer = layer_idx in config.moe_layers
if self.is_moe_layer: # the 128E model interleaves dense / sparse if self.is_moe_layer: # the 128E model interleaves dense / sparse
self.feed_forward = Llama4TextMoe(config) self.feed_forward = Llama4TextMoe(config)
@@ -643,7 +642,10 @@ class Llama4TextModel(Llama4PreTrainedModel):
inputs_embeds = self.embed_tokens(input_ids.to(self.embed_tokens.weight.device)) inputs_embeds = self.embed_tokens(input_ids.to(self.embed_tokens.weight.device))
if use_cache and past_key_values is None: if use_cache and past_key_values is None:
past_key_values = HybridChunkedCache(self.config, inputs_embeds.shape[0], inputs_embeds.shape[1]) if self.config.get_text_config().get("attention_chunk_size") is not None:
past_key_values = HybridChunkedCache(self.config, inputs_embeds.shape[0], inputs_embeds.shape[1])
else:
past_key_values = DynamicCache(self.config, inputs_embeds.shape[0], inputs_embeds.shape[1])
if cache_position is None: if cache_position is None:
past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0 past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
@@ -740,6 +742,7 @@ class Llama4TextModel(Llama4PreTrainedModel):
sequence_length = input_tensor.shape[1] sequence_length = input_tensor.shape[1]
cache_position = cache_position.to(self.device) cache_position = cache_position.to(self.device)
attention_chunk_size = self.config.attention_chunk_size attention_chunk_size = self.config.attention_chunk_size
using_chunked_attention = attention_chunk_size is not None
first_cache_position = cache_position[0] first_cache_position = cache_position[0]
@@ -748,26 +751,28 @@ class Llama4TextModel(Llama4PreTrainedModel):
else: else:
full_cache_length = attention_mask.shape[-1] if attention_mask is not None else sequence_length full_cache_length = attention_mask.shape[-1] if attention_mask is not None else sequence_length
cond1 = first_cache_position >= attention_chunk_size if using_chunked_attention:
cond2 = (first_cache_position < attention_chunk_size) & ( cond1 = first_cache_position >= attention_chunk_size
first_cache_position + sequence_length > attention_chunk_size cond2 = (first_cache_position < attention_chunk_size) & (
) first_cache_position + sequence_length > attention_chunk_size
key_length = ( )
torch.where( key_length = (
cond1, torch.where(
attention_chunk_size + sequence_length - 1, cond1,
torch.where(cond2, first_cache_position + sequence_length, attention_chunk_size), attention_chunk_size + sequence_length - 1,
torch.where(cond2, first_cache_position + sequence_length, attention_chunk_size),
)
if use_cache
else full_cache_length
) )
if use_cache
else full_cache_length
)
if self.config._attn_implementation == "flex_attention": if self.config._attn_implementation == "flex_attention":
if isinstance(attention_mask, torch.Tensor): if isinstance(attention_mask, torch.Tensor):
offsets = (first_cache_position, max(first_cache_position - attention_chunk_size + 1, 0)) if using_chunked_attention:
chunked_attention_mask = make_flex_block_causal_mask( offsets = (first_cache_position, max(first_cache_position - attention_chunk_size + 1, 0))
attention_mask, self.config.attention_chunk_size, sequence_length, key_length, offsets=offsets chunked_attention_mask = make_flex_block_causal_mask(
) attention_mask, attention_chunk_size, sequence_length, key_length, offsets=offsets
)
attention_mask = make_flex_block_causal_mask( attention_mask = make_flex_block_causal_mask(
attention_mask, attention_mask,
query_length=sequence_length, query_length=sequence_length,
@@ -780,15 +785,16 @@ class Llama4TextModel(Llama4PreTrainedModel):
# In case the provided `attention` mask is 2D, we generate a causal mask here (4D). # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
dtype, device = input_tensor.dtype, input_tensor.device dtype, device = input_tensor.dtype, input_tensor.device
target_length = max(full_cache_length, attention_chunk_size) if using_chunked_attention else full_cache_length
causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position( causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
attention_mask, attention_mask,
sequence_length=sequence_length, sequence_length=sequence_length,
target_length=max(full_cache_length, attention_chunk_size), target_length=target_length,
dtype=dtype, dtype=dtype,
cache_position=cache_position, cache_position=cache_position,
batch_size=input_tensor.shape[0], batch_size=input_tensor.shape[0],
) )
if full_cache_length > self.config.attention_chunk_size: if using_chunked_attention and full_cache_length > attention_chunk_size:
start_idx = max(first_cache_position - attention_chunk_size + 1, 0) start_idx = max(first_cache_position - attention_chunk_size + 1, 0)
end_idx = start_idx + key_length end_idx = start_idx + key_length
chunked_attention_mask = self.create_chunked_attention_mask( chunked_attention_mask = self.create_chunked_attention_mask(
@@ -873,7 +879,6 @@ class Llama4TextModel(Llama4PreTrainedModel):
sequence_length: int, sequence_length: int,
target_length: int, target_length: int,
dtype: torch.dtype, dtype: torch.dtype,
device: torch.device,
cache_position: torch.Tensor, cache_position: torch.Tensor,
batch_size: int, batch_size: int,
**kwargs, **kwargs,
@@ -906,16 +911,18 @@ class Llama4TextModel(Llama4PreTrainedModel):
else: else:
min_dtype = torch.finfo(dtype).min min_dtype = torch.finfo(dtype).min
causal_mask = torch.full( causal_mask = torch.full(
(sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
) )
if sequence_length != 1: if sequence_length != 1:
causal_mask = torch.triu(causal_mask, diagonal=1) causal_mask = torch.triu(causal_mask, diagonal=1)
causal_mask *= torch.arange(target_length, device=device) > cache_position.to(device).reshape(-1, 1) causal_mask *= torch.arange(target_length, device=cache_position.device) > cache_position.reshape(-1, 1)
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1) causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None: if attention_mask is not None:
causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1] mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(device) padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
cache_position.device
)
padding_mask = padding_mask == 0 padding_mask = padding_mask == 0
causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill( causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
padding_mask, min_dtype padding_mask, min_dtype

View File

@@ -1440,6 +1440,9 @@ class ProcessorMixin(PushToHubMixin):
if value is not None and not isinstance(value, dict): if value is not None and not isinstance(value, dict):
processed_kwargs[kwarg_type][key] = value processed_kwargs[kwarg_type][key] = value
# Pass unprocessed custom kwargs
processed_kwargs["template_kwargs"].update(kwargs)
if isinstance(conversation, (list, tuple)) and ( if isinstance(conversation, (list, tuple)) and (
isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "content") isinstance(conversation[0], (list, tuple)) or hasattr(conversation[0], "content")
): ):

View File

@@ -240,6 +240,7 @@ def parse_int_from_env(key, default=None):
_run_slow_tests = parse_flag_from_env("RUN_SLOW", default=False) _run_slow_tests = parse_flag_from_env("RUN_SLOW", default=False)
_run_flaky_tests = parse_flag_from_env("RUN_FLAKY", default=True)
_run_custom_tokenizers = parse_flag_from_env("RUN_CUSTOM_TOKENIZERS", default=False) _run_custom_tokenizers = parse_flag_from_env("RUN_CUSTOM_TOKENIZERS", default=False)
_run_staging = parse_flag_from_env("HUGGINGFACE_CO_STAGING", default=False) _run_staging = parse_flag_from_env("HUGGINGFACE_CO_STAGING", default=False)
_run_pipeline_tests = parse_flag_from_env("RUN_PIPELINE_TESTS", default=True) _run_pipeline_tests = parse_flag_from_env("RUN_PIPELINE_TESTS", default=True)
@@ -2614,7 +2615,7 @@ def is_flaky(max_attempts: int = 5, wait_before_retry: Optional[float] = None, d
return test_func_ref(*args, **kwargs) return test_func_ref(*args, **kwargs)
return wrapper return unittest.skipUnless(_run_flaky_tests, "test is flaky")(wrapper)
return decorator return decorator

View File

@@ -3704,7 +3704,10 @@ class Trainer:
arguments, depending on the situation. arguments, depending on the situation.
""" """
if self.use_cpu_amp: if self.use_cpu_amp:
ctx_manager = torch.amp.autocast("cpu", cache_enabled=cache_enabled, dtype=self.amp_dtype) # TODO Matt: This syntax is deprecated and the preferred version is
# torch.amp.autocast("cpu", cache_enabled=cache_enabled, dtype=self.amp_dtype)
# but this is unavailable on Torch 2.1 or earlier. We can change this when we stop supporting 2.1.
ctx_manager = torch.cpu.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)
else: else:
ctx_manager = contextlib.nullcontext() ctx_manager = contextlib.nullcontext()

View File

@@ -1097,25 +1097,18 @@ class GenerationTesterMixin:
# test output equality of low versus high memory # test output equality of low versus high memory
model = model_class(config).to(torch_device).eval() model = model_class(config).to(torch_device).eval()
generate_kwargs = {
"top_k": 4,
"penalty_alpha": 0.6,
"max_new_tokens": self.max_new_tokens,
"use_cache": True,
"return_dict_in_generate": True,
"output_scores": True,
}
low_output = model.generate( low_output = model.generate(**inputs_dict, **generate_kwargs, low_memory=True)
top_k=4, high_output = model.generate(**inputs_dict, **generate_kwargs, low_memory=False)
penalty_alpha=0.6, self._check_similar_generate_outputs(low_output, high_output)
low_memory=True,
max_new_tokens=self.max_new_tokens,
**inputs_dict,
use_cache=True,
)
high_output = model.generate(
top_k=4,
penalty_alpha=0.6,
low_memory=False,
max_new_tokens=self.max_new_tokens,
**inputs_dict,
use_cache=True,
)
self.assertListEqual(low_output.tolist(), high_output.tolist())
@parameterized.expand([("random",), ("same",)]) @parameterized.expand([("random",), ("same",)])
@pytest.mark.generate @pytest.mark.generate
@@ -1863,22 +1856,29 @@ class GenerationTesterMixin:
model = model_class(config).to(torch_device) model = model_class(config).to(torch_device)
model.eval() model.eval()
model.generation_config.pad_token_id = model.generation_config.eos_token_id = -1
model.generation_config.forced_eos_token_id = None
model.generation_config.encoder_no_repeat_ngram_size = 0
model.generation_config.use_cache = True
# If "past_key_values" is not returned, skip the test (e.g. RWKV uses a different cache name and format) # If "past_key_values" is not returned, skip the test (e.g. RWKV uses a different cache name and format)
outputs = model(**inputs) outputs = model(**inputs)
if "past_key_values" not in outputs: if "past_key_values" not in outputs:
self.skipTest(reason="This model doesn't return `past_key_values`") self.skipTest(reason="This model doesn't return `past_key_values`")
generate_kwargs = {
"pad_token_id": -1,
"eos_token_id": -1,
"forced_eos_token_id": None,
"encoder_no_repeat_ngram_size": 0,
"use_cache": True,
"do_sample": False,
"return_dict_in_generate": True,
"output_scores": True,
}
# Traditional way of generating text, with `return_dict_in_generate` to return the past key values # Traditional way of generating text, with `return_dict_in_generate` to return the past key values
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=4, return_dict_in_generate=True) outputs = model.generate(**inputs, **generate_kwargs, max_new_tokens=4)
# Let's generate again, but passing the past key values in between (3 + 1 = 4 tokens). Note that the # Let's generate again, but passing the past key values in between (3 + 1 = 4 tokens). Note that the
# inputs may need to be tweaked across `generate` calls (like the attention mask). # inputs may need to be tweaked across `generate` calls (like the attention mask).
outputs_cached = model.generate(**inputs, do_sample=False, max_new_tokens=3, return_dict_in_generate=True) outputs_cached = model.generate(**inputs, **generate_kwargs, max_new_tokens=3)
# Continue from the tokens generated above, preparing the inputs accordingly # Continue from the tokens generated above, preparing the inputs accordingly
inputs["past_key_values"] = outputs_cached.past_key_values inputs["past_key_values"] = outputs_cached.past_key_values
@@ -1901,10 +1901,13 @@ class GenerationTesterMixin:
mode="constant", mode="constant",
value=1, value=1,
) )
outputs_cached = model.generate(**inputs, do_sample=False, max_new_tokens=1, return_dict_in_generate=True) first_caches_scores = outputs_cached.scores
outputs_cached = model.generate(**inputs, **generate_kwargs, max_new_tokens=1)
full_cached_scores = first_caches_scores + outputs_cached.scores
outputs_cached.scores = full_cached_scores
# The two sets of generated text and past kv should be equal to each other # The two sets of generated text and past kv should be equal to each other
self.assertListEqual(outputs.sequences.tolist(), outputs_cached.sequences.tolist()) self._check_similar_generate_outputs(outputs, outputs_cached)
for layer_idx in range(len(outputs_cached.past_key_values)): for layer_idx in range(len(outputs_cached.past_key_values)):
for kv_idx in range(len(outputs_cached.past_key_values[layer_idx])): for kv_idx in range(len(outputs_cached.past_key_values[layer_idx])):
self.assertTrue( self.assertTrue(
@@ -1930,6 +1933,8 @@ class GenerationTesterMixin:
if config.is_encoder_decoder: if config.is_encoder_decoder:
self.skipTest(reason="This model is encoder-decoder") self.skipTest(reason="This model is encoder-decoder")
# TODO (joao, raushan): the correct line below is `if not hasattr(config.get_text_config(), "use_cache")`,
# but it breaks a few models. Fix and then apply `_check_similar_generate_outputs` pattern
if not hasattr(config, "use_cache"): if not hasattr(config, "use_cache"):
self.skipTest(reason=f"{model_class.__name__} doesn't support caching") self.skipTest(reason=f"{model_class.__name__} doesn't support caching")
@@ -1990,32 +1995,6 @@ class GenerationTesterMixin:
) )
) )
@parameterized.expand([("offloaded",)]) # ("offloaded_static",) TODO: @raushan fixme in some models (eg T5)
@require_torch_accelerator
@pytest.mark.generate
def test_offloaded_cache_implementation(self, cache_implementation):
"""Tests we can generate by indicating `cache_implementation` for each possible cache class"""
for model_class in self.all_generative_model_classes:
if not model_class._supports_cache_class:
self.skipTest(reason="This model does not support the new cache format")
config, inputs_dict = self.prepare_config_and_inputs_for_generate()
model = model_class(config).to(torch_device).eval()
generation_kwargs = {
"max_new_tokens": 5,
"use_cache": True,
"cache_implementation": cache_implementation,
}
legacy_results = model.generate(**generation_kwargs, **inputs_dict)
# Most cache classes have their own tests except for some that are tested here
# The ones here do not need special treatment when passing `cache_implementation`
# and are not bound to specific models only
new_results = model.generate(**generation_kwargs, **inputs_dict)
self.assertListEqual(legacy_results.tolist(), new_results.tolist())
@pytest.mark.generate @pytest.mark.generate
def test_generate_with_static_cache(self): def test_generate_with_static_cache(self):
""" """

View File

@@ -27,11 +27,13 @@ from transformers import (
is_vision_available, is_vision_available,
) )
from transformers.testing_utils import ( from transformers.testing_utils import (
Expectations,
cleanup, cleanup,
require_av, require_av,
require_bitsandbytes, require_bitsandbytes,
require_deterministic_for_xpu,
require_torch, require_torch,
require_torch_gpu, require_torch_accelerator,
slow, slow,
torch_device, torch_device,
) )
@@ -177,7 +179,7 @@ class InternVLVisionText2TextModelTester:
model = InternVLForConditionalGeneration(config=config) model = InternVLForConditionalGeneration(config=config)
model.to(torch_device) model.to(torch_device)
model.eval() model.eval()
with torch.autocast(device_type="cuda", dtype=torch.float16): with torch.autocast(device_type=torch_device, dtype=torch.float16):
logits = model( logits = model(
input_ids=input_ids, input_ids=input_ids,
attention_mask=attention_mask, attention_mask=attention_mask,
@@ -279,7 +281,7 @@ class InternVLModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterM
@slow @slow
@require_torch_gpu @require_torch_accelerator
class InternVLQwen2IntegrationTest(unittest.TestCase): class InternVLQwen2IntegrationTest(unittest.TestCase):
def setUp(self): def setUp(self):
self.small_model_checkpoint = "OpenGVLab/InternVL3-1B-hf" self.small_model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
@@ -326,7 +328,14 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
output = model(**inputs) output = model(**inputs)
actual_logits = output.logits[0, -1, :5].cpu() actual_logits = output.logits[0, -1, :5].cpu()
expected_logits = torch.tensor([11.9375, 14.8750, 14.0625, 10.7500, 6.9062], dtype=torch.bfloat16) expected_logits_all = Expectations(
{
("xpu", 3): torch.tensor([11.7500, 14.7500, 14.1250, 10.5625, 6.7812], dtype=torch.bfloat16),
("cuda", 7): torch.tensor([11.9375, 14.8750, 14.0625, 10.7500, 6.9062], dtype=torch.bfloat16),
}
) # fmt: skip
expected_logits = expected_logits_all.get_expectation()
self.assertTrue( self.assertTrue(
torch.allclose(actual_logits, expected_logits, atol=0.1), torch.allclose(actual_logits, expected_logits, atol=0.1),
f"Actual logits: {actual_logits}" f"Actual logits: {actual_logits}"
@@ -334,6 +343,7 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
f"\nDifference: {torch.abs(actual_logits - expected_logits)}", f"\nDifference: {torch.abs(actual_logits - expected_logits)}",
) )
@require_deterministic_for_xpu
def test_qwen2_small_model_integration_generate_text_only(self): def test_qwen2_small_model_integration_generate_text_only(self):
processor = AutoProcessor.from_pretrained(self.small_model_checkpoint) processor = AutoProcessor.from_pretrained(self.small_model_checkpoint)
model = InternVLForConditionalGeneration.from_pretrained( model = InternVLForConditionalGeneration.from_pretrained(
@@ -346,7 +356,15 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
decoded_output = processor.decode( decoded_output = processor.decode(
generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True
) )
expected_output = "Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."
expected_outputs = Expectations(
{
("xpu", 3): "Whispers of dawn,\nSilent whispers of the night,\nNew day's light.",
("cuda", 7): "Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins.",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual(decoded_output, expected_output) self.assertEqual(decoded_output, expected_output)
def test_qwen2_small_model_integration_generate_chat_template(self): def test_qwen2_small_model_integration_generate_chat_template(self):
@@ -375,6 +393,7 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
expected_output = "The image shows two cats lying on a pink blanket. The cat on the left is a tabby" expected_output = "The image shows two cats lying on a pink blanket. The cat on the left is a tabby"
self.assertEqual(decoded_output, expected_output) self.assertEqual(decoded_output, expected_output)
@require_deterministic_for_xpu
def test_qwen2_small_model_integration_batched_generate(self): def test_qwen2_small_model_integration_batched_generate(self):
processor = AutoProcessor.from_pretrained(self.small_model_checkpoint) processor = AutoProcessor.from_pretrained(self.small_model_checkpoint)
model = InternVLForConditionalGeneration.from_pretrained( model = InternVLForConditionalGeneration.from_pretrained(
@@ -404,7 +423,15 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
) )
# Check second output # Check second output
decoded_output = processor.decode(output[1], skip_special_tokens=True) decoded_output = processor.decode(output[1], skip_special_tokens=True)
expected_output = 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of' # fmt: skip
expected_outputs = Expectations(
{
("xpu", 3): 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate"',
("cuda", 7): 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of',
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -455,7 +482,14 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
# Check second output # Check second output
decoded_output = processor.decode(output[1], skip_special_tokens=True) decoded_output = processor.decode(output[1], skip_special_tokens=True)
expected_output = 'user\n\nWhat are the differences between these two images?\nassistant\nThe images show the Statue of Liberty and the Golden Gate Bridge from different angles. Here are the differences:\n\n1. **Angle' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "user\n\nWhat are the differences between these two images?\nassistant\nThe images show the Statue of Liberty and the Golden Gate Bridge from different angles. Here are the differences:\n\n1. **Foreground",
("cuda", 7): "user\n\nWhat are the differences between these two images?\nassistant\nThe images show the Statue of Liberty and the Golden Gate Bridge from different angles. Here are the differences:\n\n1. **Angle",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -495,7 +529,13 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
output = model.generate(**inputs, do_sample=False, max_new_tokens=25) output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True) decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
expected_output = 'The man is performing a forehand shot.' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "The man is performing a volley.",
("cuda", 7): "The man is performing a forehand shot.",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -503,6 +543,7 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
) )
@require_av @require_av
@require_deterministic_for_xpu
def test_qwen2_small_model_integration_interleaved_images_videos(self): def test_qwen2_small_model_integration_interleaved_images_videos(self):
processor = AutoProcessor.from_pretrained(self.small_model_checkpoint) processor = AutoProcessor.from_pretrained(self.small_model_checkpoint)
model = InternVLForConditionalGeneration.from_pretrained( model = InternVLForConditionalGeneration.from_pretrained(
@@ -564,7 +605,13 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
decoded_output = processor.decode(output[0], skip_special_tokens=True) decoded_output = processor.decode(output[0], skip_special_tokens=True)
# Batching seems to alter the output slightly, but it is also the case in the original implementation. This seems to be expected: https://github.com/huggingface/transformers/issues/23017#issuecomment-1649630232 # Batching seems to alter the output slightly, but it is also the case in the original implementation. This seems to be expected: https://github.com/huggingface/transformers/issues/23017#issuecomment-1649630232
expected_output = 'user\n\n\nWhat are the differences between these two images?\nassistant\nThe images depict two distinct scenes:\n\n1. **Left Image**: This shows the Statue of Liberty on Liberty Island, with the' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "user\n\n\nWhat are the differences between these two images?\nassistant\nThe images depict two distinct scenes:\n\n1. **Left Image:**\n - The Statue of Liberty is prominently featured on an",
("cuda", 7): "user\n\n\nWhat are the differences between these two images?\nassistant\nThe images depict two distinct scenes:\n\n1. **Left Image**: This shows the Statue of Liberty on Liberty Island, with the",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -572,7 +619,13 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
) )
# Check second output # Check second output
decoded_output = processor.decode(output[1], skip_special_tokens=True) decoded_output = processor.decode(output[1], skip_special_tokens=True)
expected_output = 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nThe man is performing a forehand shot.",
("cuda", 7): "user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -590,7 +643,7 @@ class InternVLQwen2IntegrationTest(unittest.TestCase):
@slow @slow
@require_torch_gpu @require_torch_accelerator
class InternVLLlamaIntegrationTest(unittest.TestCase): class InternVLLlamaIntegrationTest(unittest.TestCase):
def setUp(self): def setUp(self):
self.small_model_checkpoint = "OpenGVLab/InternVL2_5-2B-MPO-hf" self.small_model_checkpoint = "OpenGVLab/InternVL2_5-2B-MPO-hf"
@@ -711,7 +764,13 @@ class InternVLLlamaIntegrationTest(unittest.TestCase):
# Check first output # Check first output
decoded_output = processor.decode(output[0], skip_special_tokens=True) decoded_output = processor.decode(output[0], skip_special_tokens=True)
expected_output = 'user\n\nWrite a haiku for this image\nassistant\nMajestic snow-capped peaks,\nWooden dock stretches to the sea,\nSilent water mirrors.' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "user\n\nWrite a haiku for this image\nassistant\nMajestic snow-capped peaks,\nWooden path leads to calm lake,\nNature's peaceful grace.",
("cuda", 7): "user\n\nWrite a haiku for this image\nassistant\nMajestic snow-capped peaks,\nWooden dock stretches to the sea,\nSilent water mirrors.",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -880,7 +939,13 @@ class InternVLLlamaIntegrationTest(unittest.TestCase):
decoded_output = processor.decode(output[0], skip_special_tokens=True) decoded_output = processor.decode(output[0], skip_special_tokens=True)
# Batching seems to alter the output slightly, but it is also the case in the original implementation. This seems to be expected: https://github.com/huggingface/transformers/issues/23017#issuecomment-1649630232 # Batching seems to alter the output slightly, but it is also the case in the original implementation. This seems to be expected: https://github.com/huggingface/transformers/issues/23017#issuecomment-1649630232
expected_output = 'user\n\n\nWhat are the difference between these two images?\nassistant\nI apologize for the confusion in my previous response. Upon closer inspection, the differences between the two images are:\n\n1. **' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "user\n\n\nWhat are the difference between these two images?\nassistant\nI apologize for the confusion in my previous response. After re-examining the images, I can see that they are actually",
("cuda", 7): "user\n\n\nWhat are the difference between these two images?\nassistant\nI apologize for the confusion in my previous response. Upon closer inspection, the differences between the two images are:\n\n1. **",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -889,7 +954,13 @@ class InternVLLlamaIntegrationTest(unittest.TestCase):
# Check second output # Check second output
decoded_output = processor.decode(output[1], skip_special_tokens=True) decoded_output = processor.decode(output[1], skip_special_tokens=True)
expected_output = 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nThe man is performing a forehand shot. This is a common shot in tennis where the player swings the racket across their' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nThe man is performing a forehand shot. This is a common shot in tennis where the player swings the racket across their",
("cuda", 7): "user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nThe man is performing a forehand shot. This is a common shot in tennis where the player swings the racket across their",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,
@@ -898,7 +969,13 @@ class InternVLLlamaIntegrationTest(unittest.TestCase):
# Check third output # Check third output
decoded_output = processor.decode(output[2], skip_special_tokens=True) decoded_output = processor.decode(output[2], skip_special_tokens=True)
expected_output = 'user\n\nWrite a haiku for this image\nassistant\nMajestic snow-capped peaks,\nA wooden path leads to the sea,\nPeaceful, untouched dreams.' # fmt: skip expected_outputs = Expectations(
{
("xpu", 3): "user\n\nWrite a haiku for this image\nassistant\nMajestic snow-capped peaks,\nWooden dock stretches to the sea,\nSilent water mirrors.",
("cuda", 7): "user\n\nWrite a haiku for this image\nassistant\nMajestic snow-capped peaks,\nA wooden path leads to the sea,\nPeaceful, untouched dreams.",
}
) # fmt: skip
expected_output = expected_outputs.get_expectation()
self.assertEqual( self.assertEqual(
decoded_output, decoded_output,
expected_output, expected_output,

View File

@@ -80,7 +80,6 @@ from transformers.testing_utils import (
require_bitsandbytes, require_bitsandbytes,
require_deepspeed, require_deepspeed,
require_flash_attn, require_flash_attn,
require_non_xpu,
require_safetensors, require_safetensors,
require_torch, require_torch,
require_torch_accelerator, require_torch_accelerator,
@@ -2604,7 +2603,7 @@ class ModelTesterMixin:
)[0] )[0]
torch.testing.assert_close(out_embeds, out_ids) torch.testing.assert_close(out_embeds, out_ids)
@require_non_xpu @require_torch_gpu
@require_torch_multi_gpu @require_torch_multi_gpu
def test_multi_gpu_data_parallel_forward(self): def test_multi_gpu_data_parallel_forward(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
@@ -3874,7 +3873,6 @@ class ModelTesterMixin:
with sdpa_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): with sdpa_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
_ = model(**inputs_dict) _ = model(**inputs_dict)
@require_non_xpu
@require_torch_sdpa @require_torch_sdpa
@require_torch_accelerator @require_torch_accelerator
@slow @slow
@@ -3887,8 +3885,8 @@ class ModelTesterMixin:
self.skipTest(reason="This test requires an NVIDIA GPU with compute capability >= 8.0") self.skipTest(reason="This test requires an NVIDIA GPU with compute capability >= 8.0")
elif device_type == "rocm" and major < 9: elif device_type == "rocm" and major < 9:
self.skipTest(reason="This test requires an AMD GPU with compute capability >= 9.0") self.skipTest(reason="This test requires an AMD GPU with compute capability >= 9.0")
else: elif device_type not in ["cuda", "rocm", "xpu"]:
self.skipTest(reason="This test requires a Nvidia or AMD GPU") self.skipTest(reason="This test requires a Nvidia or AMD GPU, or an Intel XPU")
torch.compiler.reset() torch.compiler.reset()

View File

@@ -20,11 +20,11 @@ from parameterized import parameterized
from transformers import set_seed from transformers import set_seed
from transformers.testing_utils import ( from transformers.testing_utils import (
CaptureStderr, CaptureStderr,
cleanup,
get_gpu_count, get_gpu_count,
is_torch_available, is_torch_available,
require_gptq, require_gptq,
require_non_xpu, require_non_xpu,
require_read_token,
require_torch, require_torch,
require_torch_accelerator, require_torch_accelerator,
require_torch_gpu, require_torch_gpu,
@@ -53,6 +53,8 @@ if is_torch_available():
@require_torch @require_torch
class CacheTest(unittest.TestCase): class CacheTest(unittest.TestCase):
"""Cache tests that don't require loading models"""
def test_dynamic_cache_retrocompatibility(self): def test_dynamic_cache_retrocompatibility(self):
"""Tests that we can convert back and forth between the legacy cache format and DynamicCache""" """Tests that we can convert back and forth between the legacy cache format and DynamicCache"""
legacy_cache = () legacy_cache = ()
@@ -173,120 +175,17 @@ class CacheTest(unittest.TestCase):
self.assertTrue(cached_keys.shape == (1, 1, 10, 128)) self.assertTrue(cached_keys.shape == (1, 1, 10, 128))
self.assertTrue(cached_values.shape == (1, 1, 10, 128)) self.assertTrue(cached_values.shape == (1, 1, 10, 128))
def test_dynamic_cache_exportability(self):
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-MistralForCausalLM")
model = model.eval()
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-MistralForCausalLM")
prompt = "What is the best way to debug python script?"
inputs = tokenizer(prompt, return_tensors="pt")
attention_mask = inputs.attention_mask
input_ids = inputs.input_ids
past_key_values = DynamicCache()
ep = torch.export.export(
model,
(),
{
"input_ids": input_ids,
"attention_mask": attention_mask,
"past_key_values": past_key_values,
"use_cache": True,
},
strict=False,
)
res = ep.module()(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
use_cache=True,
)
self.assertTrue(len(res.past_key_values.key_cache) == model.config.num_hidden_layers)
self.assertEqual(2 * model.config.num_hidden_layers + 1, len(ep.graph_signature.output_specs))
self.assertEqual(
3,
len(
[
x
for x in ep.graph_signature.input_specs
if x.kind == torch.export.graph_signature.InputKind.USER_INPUT
]
),
)
past_key_values_eager = DynamicCache()
res_eager = model(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values_eager,
use_cache=True,
)
self.assertTrue(torch.allclose(res.logits, res_eager.logits))
for k1, k2 in zip(res.past_key_values.key_cache, res_eager.past_key_values.key_cache):
self.assertTrue(torch.allclose(k1, k2))
for v1, v2 in zip(res.past_key_values.value_cache, res_eager.past_key_values.value_cache):
self.assertTrue(torch.allclose(v1, v2))
@slow
@require_read_token
def test_static_cache_exportability(self):
"""
Tests that static cache works with `torch.export()`
"""
if not is_torch_greater_or_equal("2.3"):
self.skipTest(reason="This test requires torch >= 2.3 to run.")
set_seed(0)
device = "cpu"
dtype = "bfloat16"
cache_implementation = "static"
attn_implementation = "sdpa" # Export and ExecuTorch only works for SdpaAttention
batch_size = 1
max_cache_len = 1234
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b",
device_map=device,
torch_dtype=dtype,
attn_implementation=attn_implementation,
generation_config=GenerationConfig(
use_cache=True,
cache_implementation=cache_implementation,
max_length=max_cache_len,
cache_config={
"batch_size": batch_size,
"max_cache_len": max_cache_len,
"device": device,
},
),
)
# Check if cache config is passed through correctly
self.assertEqual(model.generation_config.use_cache, True)
self.assertEqual(model.generation_config.cache_implementation, cache_implementation)
self.assertEqual(model.generation_config.max_length, max_cache_len)
self.assertTrue(model.generation_config.cache_config is not None)
self.assertEqual(model.generation_config.cache_config.batch_size, batch_size)
self.assertEqual(model.generation_config.cache_config.max_cache_len, max_cache_len)
exported_program = convert_and_export_with_cache(model)
# Check if the exported model is configured with the `StaticCache` correctly
n_static_key_caches = n_static_value_caches = 0
for buffer_name, buffer in exported_program.named_buffers():
if buffer_name.startswith("key_cache"):
self.assertTrue(buffer.shape[0] == batch_size)
self.assertTrue(buffer.shape[2] == max_cache_len)
n_static_key_caches = n_static_key_caches + 1
if buffer_name.startswith("value_cache"):
self.assertTrue(buffer.shape[0] == batch_size)
self.assertTrue(buffer.shape[2] == max_cache_len)
n_static_value_caches = n_static_value_caches + 1
self.assertEqual(n_static_key_caches, model.config.num_hidden_layers)
self.assertEqual(n_static_value_caches, model.config.num_hidden_layers)
@require_torch_accelerator @require_torch_accelerator
@slow
class CacheIntegrationTest(unittest.TestCase): class CacheIntegrationTest(unittest.TestCase):
"""Cache tests that require loading models"""
def tearDown(self):
# Some tests use large models, which might result in suboptimal torch re-allocation if we run multiple tests
# in a row
cleanup(torch_device, gc_collect=True)
@slow
def test_dynamic_cache_hard(self): def test_dynamic_cache_hard(self):
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", padding_side="left") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", padding_side="left")
model = AutoModelForCausalLM.from_pretrained( model = AutoModelForCausalLM.from_pretrained(
@@ -316,6 +215,7 @@ class CacheIntegrationTest(unittest.TestCase):
) )
self.assertEqual(decoded[0], expected_text) self.assertEqual(decoded[0], expected_text)
@slow
def test_dynamic_cache_batched(self): def test_dynamic_cache_batched(self):
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", padding_side="left") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", padding_side="left")
tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token = tokenizer.eos_token
@@ -331,6 +231,7 @@ class CacheIntegrationTest(unittest.TestCase):
expected_text = ["A sequence: 1, 2, 3, 4, 5, 6, 7, 8,", "A sequence: A, B, C, D, E, F, G, H"] expected_text = ["A sequence: 1, 2, 3, 4, 5, 6, 7, 8,", "A sequence: A, B, C, D, E, F, G, H"]
self.assertListEqual(decoded, expected_text) self.assertListEqual(decoded, expected_text)
@slow
def test_dynamic_cache_beam_search(self): def test_dynamic_cache_beam_search(self):
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", padding_side="left") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", padding_side="left")
model = AutoModelForCausalLM.from_pretrained( model = AutoModelForCausalLM.from_pretrained(
@@ -352,6 +253,7 @@ class CacheIntegrationTest(unittest.TestCase):
] ]
self.assertListEqual(decoded, expected_text) self.assertListEqual(decoded, expected_text)
@slow
def test_hybrid_cache_n_sequences(self): def test_hybrid_cache_n_sequences(self):
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b") tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
model = AutoModelForCausalLM.from_pretrained( model = AutoModelForCausalLM.from_pretrained(
@@ -379,6 +281,7 @@ class CacheIntegrationTest(unittest.TestCase):
@require_non_xpu @require_non_xpu
@require_gptq @require_gptq
@slow
def test_sink_cache_hard(self): def test_sink_cache_hard(self):
tokenizer = AutoTokenizer.from_pretrained("TheBloke/LLaMa-7B-GPTQ") tokenizer = AutoTokenizer.from_pretrained("TheBloke/LLaMa-7B-GPTQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/LLaMa-7B-GPTQ", device_map="auto") model = AutoModelForCausalLM.from_pretrained("TheBloke/LLaMa-7B-GPTQ", device_map="auto")
@@ -392,6 +295,7 @@ class CacheIntegrationTest(unittest.TestCase):
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True) decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
self.assertTrue(decoded[0].endswith("to perform a variety of tasks. The Transformer is a neural network")) self.assertTrue(decoded[0].endswith("to perform a variety of tasks. The Transformer is a neural network"))
@slow
def test_sink_cache_iterative_prompts(self): def test_sink_cache_iterative_prompts(self):
"""Tests that SinkCache supports more than one new token at once, when shifting the cache""" """Tests that SinkCache supports more than one new token at once, when shifting the cache"""
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
@@ -434,13 +338,14 @@ class CacheIntegrationTest(unittest.TestCase):
) )
self.assertTrue(decoded[0].endswith(last_output)) self.assertTrue(decoded[0].endswith(last_output))
@require_torch_gpu
@parameterized.expand( @parameterized.expand(
[ [
("eager", "static"), ("eager", "static"),
("sdpa", "static"), ("sdpa", "static"),
] ]
) )
@require_torch_gpu
@slow
def test_static_cache_greedy_decoding_pad_left(self, attn_implementation, cache_implementation): def test_static_cache_greedy_decoding_pad_left(self, attn_implementation, cache_implementation):
EXPECTED_GENERATION = [ EXPECTED_GENERATION = [
"The best color is the one that complements the skin tone of the", "The best color is the one that complements the skin tone of the",
@@ -479,44 +384,7 @@ class CacheIntegrationTest(unittest.TestCase):
with self.subTest(f"{attn_implementation}, static, compiled"): with self.subTest(f"{attn_implementation}, static, compiled"):
self.assertListEqual(decoded, EXPECTED_GENERATION) self.assertListEqual(decoded, EXPECTED_GENERATION)
@require_torch_gpu @slow
@parameterized.expand(
[
("eager", "static"),
("sdpa", "static"),
]
)
def test_static_cache_greedy_decoding_pad_right(self, attn_implementation, cache_implementation):
EXPECTED_GENERATION = [
"The best color isЋ the one that complements the skin tone of",
"We should not undermind the issues at hand.\nWe should not undermind the issues",
]
tokenizer = AutoTokenizer.from_pretrained(
"NousResearch/Llama-2-7b-chat-hf", padding_side="right", pad_token="<s>"
)
model = AutoModelForCausalLM.from_pretrained(
"NousResearch/Llama-2-7b-chat-hf",
torch_dtype=torch.bfloat16,
attn_implementation=attn_implementation,
).to(torch_device)
inputs = tokenizer(
["The best color is", "We should not undermind the issues at hand"], padding=True, return_tensors="pt"
).to(model.device)
set_seed(0)
gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=10)
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
with self.subTest(f"{attn_implementation}, dynamic"):
self.assertListEqual(decoded, EXPECTED_GENERATION)
set_seed(0)
model.generation_config.cache_implementation = cache_implementation
gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=10)
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
with self.subTest(f"{attn_implementation}, static, eager"):
self.assertListEqual(decoded, EXPECTED_GENERATION)
def test_dynamic_cache_extra_left_padding(self): def test_dynamic_cache_extra_left_padding(self):
"""Tests that adding extra left-padding does not affect the generation with the dynamic cache""" """Tests that adding extra left-padding does not affect the generation with the dynamic cache"""
EXPECTED_GENERATION = [ EXPECTED_GENERATION = [
@@ -551,12 +419,8 @@ class CacheIntegrationTest(unittest.TestCase):
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True) decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
self.assertListEqual(decoded, EXPECTED_GENERATION) self.assertListEqual(decoded, EXPECTED_GENERATION)
@parameterized.expand( @slow
[ def test_static_cache_extra_left_padding(self):
"static",
]
)
def test_static_cache_extra_left_padding(self, cache_implementation):
"""Tests that adding extra left-padding does not affect the generation with the static cache""" """Tests that adding extra left-padding does not affect the generation with the static cache"""
EXPECTED_GENERATION = [ EXPECTED_GENERATION = [
"The best color is the one that complements the skin tone of the", "The best color is the one that complements the skin tone of the",
@@ -574,7 +438,7 @@ class CacheIntegrationTest(unittest.TestCase):
["The best color is", "We should not undermind the issues at hand"], padding=True, return_tensors="pt" ["The best color is", "We should not undermind the issues at hand"], padding=True, return_tensors="pt"
).to(model.device) ).to(model.device)
model.generation_config.cache_implementation = cache_implementation model.generation_config.cache_implementation = "static"
gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=10) gen_out = model.generate(**inputs, do_sample=False, max_new_tokens=10)
decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True) decoded = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
@@ -597,6 +461,7 @@ class CacheIntegrationTest(unittest.TestCase):
pass pass
@require_torch_accelerator @require_torch_accelerator
@slow
def test_offloaded_cache_equivalent_to_dynamic_cache(self): def test_offloaded_cache_equivalent_to_dynamic_cache(self):
"""Tests that OffloadedCache produces the same result as the default DynamicCache""" """Tests that OffloadedCache produces the same result as the default DynamicCache"""
model_name = "microsoft/Phi-3-mini-4k-instruct" model_name = "microsoft/Phi-3-mini-4k-instruct"
@@ -625,6 +490,7 @@ class CacheIntegrationTest(unittest.TestCase):
assert torch.all(original_output == offloaded_output).item() assert torch.all(original_output == offloaded_output).item()
@require_torch_accelerator @require_torch_accelerator
@slow
def test_offloaded_cache_uses_less_memory_than_dynamic_cache(self): def test_offloaded_cache_uses_less_memory_than_dynamic_cache(self):
"""Tests that OffloadedCache uses less memory than the default DynamicCache""" """Tests that OffloadedCache uses less memory than the default DynamicCache"""
model_name = "microsoft/Phi-3-mini-4k-instruct" model_name = "microsoft/Phi-3-mini-4k-instruct"
@@ -664,6 +530,7 @@ class CacheIntegrationTest(unittest.TestCase):
assert offloaded_peak_memory < original_peak_memory assert offloaded_peak_memory < original_peak_memory
@require_torch_gpu @require_torch_gpu
@slow
def test_cache_copy(self): def test_cache_copy(self):
model_name = "microsoft/Phi-3-mini-4k-instruct" model_name = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
@@ -745,6 +612,7 @@ class CacheIntegrationTest(unittest.TestCase):
self.assertEqual(cap.err, "") self.assertEqual(cap.err, "")
@require_torch_multi_gpu @require_torch_multi_gpu
@slow
def test_static_cache_multi_gpu(self): def test_static_cache_multi_gpu(self):
"""Regression test for #35164: static cache with multi-gpu""" """Regression test for #35164: static cache with multi-gpu"""
@@ -764,3 +632,173 @@ class CacheIntegrationTest(unittest.TestCase):
inputs = tokenizer("Today is a beautiful day!", return_tensors="pt").to(0) inputs = tokenizer("Today is a beautiful day!", return_tensors="pt").to(0)
_ = model(**inputs) _ = model(**inputs)
_ = model.generate(**inputs, max_new_tokens=2, cache_implementation="hybrid") _ = model.generate(**inputs, max_new_tokens=2, cache_implementation="hybrid")
@require_torch
class CacheExportIntegrationTest(unittest.TestCase):
"""Cache tests that rely on `torch.export()` and model loading"""
def test_dynamic_cache_exportability(self):
model = AutoModelForCausalLM.from_pretrained("hf-internal-testing/tiny-random-MistralForCausalLM")
model = model.eval()
tokenizer = AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-MistralForCausalLM")
prompt = "What is the best way to debug python script?"
inputs = tokenizer(prompt, return_tensors="pt")
attention_mask = inputs.attention_mask
input_ids = inputs.input_ids
past_key_values = DynamicCache()
ep = torch.export.export(
model,
(),
{
"input_ids": input_ids,
"attention_mask": attention_mask,
"past_key_values": past_key_values,
"use_cache": True,
},
strict=False,
)
res = ep.module()(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
use_cache=True,
)
self.assertTrue(len(res.past_key_values.key_cache) == model.config.num_hidden_layers)
self.assertEqual(2 * model.config.num_hidden_layers + 1, len(ep.graph_signature.output_specs))
self.assertEqual(
3,
len(
[
x
for x in ep.graph_signature.input_specs
if x.kind == torch.export.graph_signature.InputKind.USER_INPUT
]
),
)
past_key_values_eager = DynamicCache()
res_eager = model(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values_eager,
use_cache=True,
)
self.assertTrue(torch.allclose(res.logits, res_eager.logits))
for k1, k2 in zip(res.past_key_values.key_cache, res_eager.past_key_values.key_cache):
self.assertTrue(torch.allclose(k1, k2))
for v1, v2 in zip(res.past_key_values.value_cache, res_eager.past_key_values.value_cache):
self.assertTrue(torch.allclose(v1, v2))
def test_static_cache_exportability(self):
"""
Tests that static cache works with `torch.export()`
"""
if not is_torch_greater_or_equal("2.3"):
self.skipTest(reason="This test requires torch >= 2.3 to run.")
set_seed(0)
device = "cpu"
dtype = "bfloat16"
cache_implementation = "static"
attn_implementation = "sdpa" # Export and ExecuTorch only works for SdpaAttention
batch_size = 1
max_cache_len = 1234
model_id = "hf-internal-testing/tiny-random-LlamaForCausalLM"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device,
torch_dtype=dtype,
attn_implementation=attn_implementation,
generation_config=GenerationConfig(
use_cache=True,
cache_implementation=cache_implementation,
max_length=max_cache_len,
cache_config={
"batch_size": batch_size,
"max_cache_len": max_cache_len,
"device": device,
},
),
)
# Check if cache config is passed through correctly
self.assertEqual(model.generation_config.use_cache, True)
self.assertEqual(model.generation_config.cache_implementation, cache_implementation)
self.assertEqual(model.generation_config.max_length, max_cache_len)
self.assertTrue(model.generation_config.cache_config is not None)
self.assertEqual(model.generation_config.cache_config.batch_size, batch_size)
self.assertEqual(model.generation_config.cache_config.max_cache_len, max_cache_len)
exported_program = convert_and_export_with_cache(model)
# Check if the exported model is configured with the `StaticCache` correctly
n_static_key_caches = n_static_value_caches = 0
for buffer_name, buffer in exported_program.named_buffers():
if buffer_name.startswith("key_cache"):
self.assertTrue(buffer.shape[0] == batch_size)
self.assertTrue(buffer.shape[2] == max_cache_len)
n_static_key_caches = n_static_key_caches + 1
if buffer_name.startswith("value_cache"):
self.assertTrue(buffer.shape[0] == batch_size)
self.assertTrue(buffer.shape[2] == max_cache_len)
n_static_value_caches = n_static_value_caches + 1
self.assertEqual(n_static_key_caches, model.config.num_hidden_layers)
self.assertEqual(n_static_value_caches, model.config.num_hidden_layers)
# Export with dynamic shapes using Dim.AUTO
tokenizer = AutoTokenizer.from_pretrained(model_id)
input_ids = tokenizer("Here's everything I know", return_tensors="pt").input_ids
dynamic_shapes = {"input_ids": {1: torch.export.Dim.AUTO}, "cache_position": None}
exported_program = convert_and_export_with_cache(
model,
example_input_ids=input_ids,
dynamic_shapes=dynamic_shapes,
strict=False,
)
def test_hybrid_cache_exportability(self):
"""
Tests that static cache works with `torch.export()`
"""
if not is_torch_greater_or_equal("2.6"):
self.skipTest(reason="This test requires torch >= 2.6 to run.")
from transformers.integrations.executorch import TorchExportableModuleForDecoderOnlyLM
set_seed(0)
model_id = "hf-internal-testing/tiny-random-Gemma3ForCausalLM"
model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()
self.assertEqual(model.config.use_cache, True)
self.assertEqual(model.config.cache_implementation, "hybrid")
# Export + HybridCache
model.eval()
max_batch_size = 1
max_cache_len = 23
exportable_module = TorchExportableModuleForDecoderOnlyLM(model, max_batch_size, max_cache_len)
exported_program = exportable_module.export()
n_g_key_caches = n_g_value_caches = 0
for buffer_name, buffer in exported_program.named_buffers():
if buffer_name.startswith("key_cache"):
self.assertTrue(buffer.shape[0] == max_batch_size)
self.assertTrue(buffer.shape[2] == max_cache_len)
n_g_key_caches = n_g_key_caches + 1
if buffer_name.startswith("value_cache"):
self.assertTrue(buffer.shape[0] == max_batch_size)
self.assertTrue(buffer.shape[2] == max_cache_len)
n_g_value_caches = n_g_value_caches + 1
self.assertEqual(n_g_key_caches, model.config.num_hidden_layers)
self.assertEqual(n_g_value_caches, model.config.num_hidden_layers)
# Export with dynamic shapes using Dim.AUTO
tokenizer = AutoTokenizer.from_pretrained(model_id)
input_ids = tokenizer("Here's everything I know", return_tensors="pt").input_ids
dynamic_shapes = {"input_ids": {1: torch.export.Dim.AUTO}, "cache_position": None}
exported_program = exportable_module.export(
input_ids=input_ids,
dynamic_shapes=dynamic_shapes,
strict=False,
)