@@ -13,7 +13,7 @@
|
||||
|
||||
في هذا الدليل، سنستعرض التقنيات الفعالة لتُحسِّن من كفاءة نشر نماذج اللغة الكبيرة:
|
||||
|
||||
1. سنتناول تقنية "دقة أقل" التي أثبتت الأبحاث فعاليتها في تحقيق مزايا حسابية دون التأثير بشكل ملحوظ على أداء النموذج عن طريق العمل بدقة رقمية أقل [8 بت و4 بت](/main_classes/quantization.md).
|
||||
1. سنتناول تقنية "دقة أقل" التي أثبتت الأبحاث فعاليتها في تحقيق مزايا حسابية دون التأثير بشكل ملحوظ على أداء النموذج عن طريق العمل بدقة رقمية أقل [8 بت و4 بت](/main_classes/quantization).
|
||||
|
||||
2. **اFlash Attention:** إن Flash Attention وهي نسخة مُعدَّلة من خوارزمية الانتباه التي لا توفر فقط نهجًا أكثر كفاءة في استخدام الذاكرة، ولكنها تحقق أيضًا كفاءة متزايدة بسبب الاستخدام الأمثل لذاكرة GPU.
|
||||
|
||||
|
||||
@@ -27,7 +27,7 @@ This guide shows you how to quickly start chatting with Transformers from the co
|
||||
|
||||
## chat CLI
|
||||
|
||||
After you've [installed Transformers](./installation.md), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session.
|
||||
After you've [installed Transformers](./installation), chat with a model directly from the command line as shown below. It launches an interactive session with a model, with a few base commands listed at the start of the session.
|
||||
|
||||
```bash
|
||||
transformers chat Qwen/Qwen2.5-0.5B-Instruct
|
||||
@@ -158,4 +158,4 @@ The easiest solution for improving generation speed is to either quantize a mode
|
||||
You can also try techniques like [speculative decoding](./generation_strategies#speculative-decoding), where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token per `forward` pass. This significantly alleviates the bandwidth bottleneck and improves generation speed.
|
||||
|
||||
> [!TIP]
|
||||
> Parameters may not be active for every generated token in MoE models such as [Mixtral](./model_doc/mixtral), [Qwen2MoE](./model_doc/qwen2_moe.md), and [DBRX](./model_doc/dbrx). As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because parameters become activated with each new speculated token.
|
||||
> Parameters may not be active for every generated token in MoE models such as [Mixtral](./model_doc/mixtral), [Qwen2MoE](./model_doc/qwen2_moe), and [DBRX](./model_doc/dbrx). As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because parameters become activated with each new speculated token.
|
||||
|
||||
@@ -148,9 +148,9 @@ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
|
||||
| Option name | Type | Simplified description |
|
||||
|---|---|---|
|
||||
| `max_new_tokens` | `int` | Controls the maximum generation length. Be sure to define it, as it usually defaults to a small value. |
|
||||
| `do_sample` | `bool` | Defines whether generation will sample the next token (`True`), or is greedy instead (`False`). Most use cases should set this flag to `True`. Check [this guide](./generation_strategies.md) for more information. |
|
||||
| `do_sample` | `bool` | Defines whether generation will sample the next token (`True`), or is greedy instead (`False`). Most use cases should set this flag to `True`. Check [this guide](./generation_strategies) for more information. |
|
||||
| `temperature` | `float` | How unpredictable the next selected token will be. High values (`>0.8`) are good for creative tasks, low values (e.g. `<0.4`) for tasks that require "thinking". Requires `do_sample=True`. |
|
||||
| `num_beams` | `int` | When set to `>1`, activates the beam search algorithm. Beam search is good on input-grounded tasks. Check [this guide](./generation_strategies.md) for more information. |
|
||||
| `num_beams` | `int` | When set to `>1`, activates the beam search algorithm. Beam search is good on input-grounded tasks. Check [this guide](./generation_strategies) for more information. |
|
||||
| `repetition_penalty` | `float` | Set it to `>1.0` if you're seeing the model repeat itself often. Larger values apply a larger penalty. |
|
||||
| `eos_token_id` | `list[int]` | The token(s) that will cause generation to stop. The default value is usually good, but you can specify a different token. |
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ The crux of these challenges lies in augmenting the computational and memory cap
|
||||
|
||||
In this guide, we will go over the effective techniques for efficient LLM deployment:
|
||||
|
||||
1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization.md) can achieve computational advantages without a considerable decline in model performance.
|
||||
1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization) can achieve computational advantages without a considerable decline in model performance.
|
||||
|
||||
2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization.
|
||||
|
||||
|
||||
@@ -95,7 +95,7 @@ images = [
|
||||
|
||||
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
|
||||
|
||||
The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4.
|
||||
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4.
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
@@ -99,7 +99,7 @@ images = [
|
||||
|
||||
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
|
||||
|
||||
The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4.
|
||||
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to int4.
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
@@ -21,7 +21,7 @@ rendered properly in your Markdown viewer.
|
||||
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model [released by Sesame](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
|
||||
|
||||
**Model Architecture:**
|
||||
CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model [Mimi](./mimi.md), introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
|
||||
CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model [Mimi](./mimi), introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
|
||||
|
||||
The original csm-1b checkpoint is available under the [Sesame](https://huggingface.co/sesame/csm-1b) organization on Hugging Face.
|
||||
|
||||
|
||||
@@ -26,14 +26,14 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
## Overview
|
||||
|
||||
Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by [Nari Labs](https://huggingface.co/nari-labs).
|
||||
It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing.
|
||||
Dia is an open-source text-to-speech (TTS) model (1.6B parameters) developed by [Nari Labs](https://huggingface.co/nari-labs).
|
||||
It can generate highly realistic dialogue from transcript including non-verbal communications such as laughter and coughing.
|
||||
Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
|
||||
|
||||
**Model Architecture:**
|
||||
Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as
|
||||
rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while
|
||||
for the audio portion (decoder), a pretrained codec model [DAC](./dac.md) is used - DAC encodes speech into discrete codebook
|
||||
for the audio portion (decoder), a pretrained codec model [DAC](./dac) is used - DAC encodes speech into discrete codebook
|
||||
tokens and decodes them back into audio.
|
||||
|
||||
## Usage Tips
|
||||
|
||||
@@ -27,7 +27,7 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
ERNIE (Enhanced Representation through kNowledge IntEgration) is designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking.
|
||||
|
||||
Other ERNIE models released by baidu can be found at [Ernie 4.5](./ernie4_5.md), and [Ernie 4.5 MoE](./ernie4_5_moe.md).
|
||||
Other ERNIE models released by baidu can be found at [Ernie 4.5](./ernie4_5), and [Ernie 4.5 MoE](./ernie4_5_moe).
|
||||
|
||||
> [!TIP]
|
||||
> This model was contributed by [nghuyong](https://huggingface.co/nghuyong), and the official code can be found in [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) (in PaddlePaddle).
|
||||
|
||||
@@ -29,9 +29,9 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
The Ernie 4.5 model was released in the [Ernie 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/) release by baidu.
|
||||
This family of models contains multiple different architectures and model sizes. This model in specific targets the base text
|
||||
model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard [Llama](./llama.md) at its core.
|
||||
model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard [Llama](./llama) at its core.
|
||||
|
||||
Other models from the family can be found at [Ernie 4.5 Moe](./ernie4_5_moe.md).
|
||||
Other models from the family can be found at [Ernie 4.5 Moe](./ernie4_5_moe).
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://ernie.baidu.com/blog/posts/ernie4.5/overview.png"/>
|
||||
|
||||
@@ -30,10 +30,10 @@ rendered properly in your Markdown viewer.
|
||||
The Ernie 4.5 Moe model was released in the [Ernie 4.5 Model Family](https://ernie.baidu.com/blog/posts/ernie4.5/) release by baidu.
|
||||
This family of models contains multiple different architectures and model sizes. This model in specific targets the base text
|
||||
model with mixture of experts (moe) - one with 21B total, 3B active parameters and another one with 300B total, 47B active parameters.
|
||||
It uses the standard [Llama](./llama.md) at its core combined with a specialized MoE based on [Mixtral](./mixtral.md) with additional shared
|
||||
It uses the standard [Llama](./llama) at its core combined with a specialized MoE based on [Mixtral](./mixtral) with additional shared
|
||||
experts.
|
||||
|
||||
Other models from the family can be found at [Ernie 4.5](./ernie4_5.md).
|
||||
Other models from the family can be found at [Ernie 4.5](./ernie4_5).
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://ernie.baidu.com/blog/posts/ernie4.5/overview.png"/>
|
||||
|
||||
@@ -30,7 +30,7 @@ Gemma3n is a multimodal model with pretrained and instruction-tuned variants, av
|
||||
large portions of the language model architecture are shared with prior Gemma releases, there are many new additions in
|
||||
this model, including [Alternating Updates][altup] (AltUp), [Learned Augmented Residual Layer][laurel] (LAuReL),
|
||||
[MatFormer][matformer], Per-Layer Embeddings (PLE), [Activation Sparsity with Statistical Top-k][spark-transformer], and KV cache sharing. The language model uses
|
||||
a similar attention pattern to [Gemma 3](./gemma3.md) with alternating 4 local sliding window self-attention layers for
|
||||
a similar attention pattern to [Gemma 3](./gemma3) with alternating 4 local sliding window self-attention layers for
|
||||
every global self-attention layer with a maximum context length of 32k tokens. Gemma 3n introduces
|
||||
[MobileNet v5][mobilenetv5] as the vision encoder, using a default resolution of 768x768 pixels, and adds a newly
|
||||
trained audio encoder based on the [Universal Speech Model][usm] (USM) architecture.
|
||||
|
||||
@@ -169,9 +169,9 @@ model = Idefics2ForConditionalGeneration.from_pretrained(
|
||||
|
||||
## Shrinking down Idefics2 using quantization
|
||||
|
||||
As the Idefics2 model has 8 billion parameters, that would require about 16GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), that requires only about 3.5GB of RAM.
|
||||
As the Idefics2 model has 8 billion parameters, that would require about 16GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization). If the model is quantized to 4 bits (or half a byte per parameter), that requires only about 3.5GB of RAM.
|
||||
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. One can change the code snippet above with the changes below. We'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization.md) for other quantization methods):
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. One can change the code snippet above with the changes below. We'll leverage the BitsAndyBytes quantization (but refer to [this page](../quantization) for other quantization methods):
|
||||
|
||||
```diff
|
||||
+ from transformers import BitsAndBytesConfig
|
||||
@@ -193,7 +193,7 @@ model = Idefics2ForConditionalGeneration.from_pretrained(
|
||||
|
||||
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Idefics2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
||||
|
||||
- A notebook on how to fine-tune Idefics2 on a custom dataset using the [Trainer](../main_classes/trainer.md) can be found [here](https://colab.research.google.com/drive/1NtcTgRbSBKN7pYD3Vdx1j9m8pt3fhFDB?usp=sharing). It supports both full fine-tuning as well as (quantized) LoRa.
|
||||
- A notebook on how to fine-tune Idefics2 on a custom dataset using the [Trainer](../main_classes/trainer) can be found [here](https://colab.research.google.com/drive/1NtcTgRbSBKN7pYD3Vdx1j9m8pt3fhFDB?usp=sharing). It supports both full fine-tuning as well as (quantized) LoRa.
|
||||
- A script regarding how to fine-tune Idefics2 using the TRL library can be found [here](https://gist.github.com/edbeeching/228652fc6c2b29a1641be5a5778223cb).
|
||||
- Demo notebook regarding fine-tuning Idefics2 for JSON extraction use cases can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Idefics2). 🌎
|
||||
|
||||
|
||||
@@ -30,7 +30,7 @@ The abstract from the paper is the following:
|
||||
|
||||
*We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning— such as emotion or non-speech sounds— is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this “Inner Monologue” method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at github.com/kyutai-labs/moshi.*
|
||||
|
||||
Its architecture is based on [Encodec](model_doc/encodec) with several major differences:
|
||||
Its architecture is based on [Encodec](./encodec) with several major differences:
|
||||
* it uses a much lower frame-rate.
|
||||
* it uses additional transformers for encoding and decoding for better latent contextualization
|
||||
* it uses a different quantization scheme: one codebook is dedicated to semantic projection.
|
||||
|
||||
@@ -115,9 +115,9 @@ The Flash Attention-2 model uses also a more memory efficient cache slicing mech
|
||||
|
||||
## Shrinking down MiniMax using quantization
|
||||
|
||||
As the MiniMax model has 456 billion parameters, that would require about 912GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), about 228 GB of RAM is required.
|
||||
As the MiniMax model has 456 billion parameters, that would require about 912GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization). If the model is quantized to 4 bits (or half a byte per parameter), about 228 GB of RAM is required.
|
||||
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization.md) for alternative quantization methods):
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization) for alternative quantization methods):
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
|
||||
@@ -146,9 +146,9 @@ The Flash Attention-2 model uses also a more memory efficient cache slicing mech
|
||||
|
||||
## Shrinking down Mixtral using quantization
|
||||
|
||||
As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization.md). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required.
|
||||
As the Mixtral model has 45 billion parameters, that would require about 90GB of GPU RAM in half precision (float16), since each parameter is stored in 2 bytes. However, one can shrink down the size of the model using [quantization](../quantization). If the model is quantized to 4 bits (or half a byte per parameter), a single A100 with 40GB of RAM is enough to fit the entire model, as in that case only about 27 GB of RAM is required.
|
||||
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization.md) for alternative quantization methods):
|
||||
Quantizing a model is as simple as passing a `quantization_config` to the model. Below, we'll leverage the bitsandbytes quantization library (but refer to [this page](../quantization) for alternative quantization methods):
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
|
||||
@@ -38,7 +38,7 @@ This model was contributed by [ajati](https://huggingface.co/ajati), [vijaye12](
|
||||
|
||||
## Usage example
|
||||
|
||||
The code snippet below shows how to randomly initialize a PatchTSMixer model. The model is compatible with the [Trainer API](../trainer.md).
|
||||
The code snippet below shows how to randomly initialize a PatchTSMixer model. The model is compatible with the [Trainer API](../trainer).
|
||||
|
||||
```python
|
||||
|
||||
|
||||
@@ -69,11 +69,11 @@ print(tokenizer.decode(outputs[0]))
|
||||
## Model card
|
||||
|
||||
The model cards can be found at:
|
||||
* [Zamba-7B](MODEL_CARD_ZAMBA-7B-v1.md)
|
||||
* [Zamba-7B](https://huggingface.co/Zyphra/Zamba-7B-v1)
|
||||
|
||||
|
||||
## Issues
|
||||
For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/zyphra/zamba-7b)
|
||||
For issues with model output, or community discussion, please use the Hugging Face community [forum](https://huggingface.co/Zyphra/Zamba-7B-v1/discussions)
|
||||
|
||||
|
||||
## License
|
||||
|
||||
@@ -25,7 +25,7 @@ Keypoint detection identifies and locates specific points of interest within an
|
||||
|
||||
In this guide, we will show how to extract keypoints from images.
|
||||
|
||||
For this tutorial, we will use [SuperPoint](./model_doc/superpoint.md), a foundation model for keypoint detection.
|
||||
For this tutorial, we will use [SuperPoint](./model_doc/superpoint), a foundation model for keypoint detection.
|
||||
|
||||
```python
|
||||
from transformers import AutoImageProcessor, SuperPointForKeypointDetection
|
||||
|
||||
@@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.
|
||||
|
||||
These models have nearly the same architecture as [image-text-to-text](../image_text_to_text.md) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`".
|
||||
These models have nearly the same architecture as [image-text-to-text](../image_text_to_text) models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? `<video>`".
|
||||
|
||||
In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.
|
||||
|
||||
|
||||
@@ -307,7 +307,7 @@ culture, and they allow us to design the'
|
||||
|
||||
アシストデコーディングを有効にするには、`assistant_model` 引数をモデルで設定します。
|
||||
|
||||
このガイドは、さまざまなデコーディング戦略を可能にする主要なパラメーターを説明しています。さらに高度なパラメーターは [`generate`] メソッドに存在し、[`generate`] メソッドの動作をさらに制御できます。使用可能なパラメーターの完全なリストについては、[APIドキュメント](./main_classes/text_generation.md) を参照してください。
|
||||
このガイドは、さまざまなデコーディング戦略を可能にする主要なパラメーターを説明しています。さらに高度なパラメーターは [`generate`] メソッドに存在し、[`generate`] メソッドの動作をさらに制御できます。使用可能なパラメーターの完全なリストについては、[APIドキュメント](./main_classes/text_generation) を参照してください。
|
||||
|
||||
|
||||
```python
|
||||
|
||||
@@ -111,7 +111,7 @@ BART を始めるのに役立つ公式 Hugging Face およびコミュニティ
|
||||
- [`TFBartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)。
|
||||
- [`FlaxBartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/flax/summarization) でサポートされています。
|
||||
- [要約](https://huggingface.co/course/chapter7/5?fw=pt#summarization) 🤗 ハグフェイスコースの章。
|
||||
- [要約タスクガイド](../tasks/summarization.md)
|
||||
- [要約タスクガイド](../tasks/summarization)
|
||||
|
||||
<PipelineTag pipeline="fill-mask"/>
|
||||
|
||||
|
||||
@@ -289,7 +289,7 @@ time."\n\nHe added: "I am very proud of the work I have been able to do in the l
|
||||
culture, and they allow us to design the'
|
||||
```
|
||||
|
||||
이 가이드에서는 다양한 디코딩 전략을 가능하게 하는 주요 매개변수를 보여줍니다. [`generate`] 메서드에 대한 고급 매개변수가 존재하므로 [`generate`] 메서드의 동작을 더욱 세부적으로 제어할 수 있습니다. 사용 가능한 매개변수의 전체 목록은 [API 문서](./main_classes/text_generation.md)를 참조하세요.
|
||||
이 가이드에서는 다양한 디코딩 전략을 가능하게 하는 주요 매개변수를 보여줍니다. [`generate`] 메서드에 대한 고급 매개변수가 존재하므로 [`generate`] 메서드의 동작을 더욱 세부적으로 제어할 수 있습니다. 사용 가능한 매개변수의 전체 목록은 [API 문서](./main_classes/text_generation)를 참조하세요.
|
||||
|
||||
### 추론 디코딩(Speculative Decoding)[[speculative-decoding]]
|
||||
|
||||
|
||||
@@ -21,7 +21,7 @@ GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), [Llama](https://hugg
|
||||
|
||||
이 가이드에서는 효율적인 대규모 언어 모델 배포를 위한 효과적인 기법들을 살펴보겠습니다.
|
||||
|
||||
1. **낮은 정밀도:** 연구에 따르면, [8비트와 4비트](./main_classes/quantization.md)와 같이 낮은 수치 정밀도로 작동하면 모델 성능의 큰 저하 없이 계산상의 이점을 얻을 수 있습니다.
|
||||
1. **낮은 정밀도:** 연구에 따르면, [8비트와 4비트](./main_classes/quantization)와 같이 낮은 수치 정밀도로 작동하면 모델 성능의 큰 저하 없이 계산상의 이점을 얻을 수 있습니다.
|
||||
|
||||
2. **플래시 어텐션:** 플래시 어텐션은 메모리 효율성을 높일 뿐만 아니라 최적화된 GPU 메모리 활용을 통해 효율성을 향상시키는 어텐션 알고리즘의 변형입니다.
|
||||
|
||||
|
||||
@@ -136,9 +136,9 @@ pip install -U flash-attn --no-build-isolation
|
||||
|
||||
## 양자화로 미스트랄 크기 줄이기[[shrinking-down-mistral-using-quantization]]
|
||||
|
||||
미스트랄 모델은 70억 개의 파라미터를 가지고 있어, 절반의 정밀도(float16)로 약 14GB의 GPU RAM이 필요합니다. 각 파라미터가 2바이트로 저장되기 때문입니다. 하지만 [양자화](../quantization.md)를 사용하면 모델 크기를 줄일 수 있습니다. 모델을 4비트(즉, 파라미터당 반 바이트)로 양자화하면 약 3.5GB의 RAM만 필요합니다.
|
||||
미스트랄 모델은 70억 개의 파라미터를 가지고 있어, 절반의 정밀도(float16)로 약 14GB의 GPU RAM이 필요합니다. 각 파라미터가 2바이트로 저장되기 때문입니다. 하지만 [양자화](../quantization)를 사용하면 모델 크기를 줄일 수 있습니다. 모델을 4비트(즉, 파라미터당 반 바이트)로 양자화하면 약 3.5GB의 RAM만 필요합니다.
|
||||
|
||||
모델을 양자화하는 것은 `quantization_config`를 모델에 전달하는 것만큼 간단합니다. 아래에서는 BitsAndBytes 양자화를 사용하지만, 다른 양자화 방법은 [이 페이지](../quantization.md)를 참고하세요:
|
||||
모델을 양자화하는 것은 `quantization_config`를 모델에 전달하는 것만큼 간단합니다. 아래에서는 BitsAndBytes 양자화를 사용하지만, 다른 양자화 방법은 [이 페이지](../quantization)를 참고하세요:
|
||||
|
||||
```python
|
||||
>>> import torch
|
||||
|
||||
@@ -35,7 +35,7 @@ PatchTSMixer는 MLP-Mixer 아키텍처를 기반으로 한 경량 시계열 모
|
||||
## 사용 예[[usage-example]]
|
||||
|
||||
아래의 코드 스니펫은 PatchTSMixer 모델을 무작위로 초기화하는 방법을 보여줍니다.
|
||||
PatchTSMixer 모델은 [Trainer API](../trainer.md)와 호환됩니다.
|
||||
PatchTSMixer 모델은 [Trainer API](../trainer)와 호환됩니다.
|
||||
|
||||
```python
|
||||
|
||||
|
||||
@@ -57,8 +57,8 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
| 模型 | PyTorch 支持 | TensorFlow 支持 | Flax 支持 |
|
||||
|:------------------------------------------------------------------------:|:---------------:|:------------------:|:------------:|
|
||||
| [ALBERT](../en/model_doc/albert.md) | ✅ | ✅ | ✅ |
|
||||
| [ALIGN](../en/model_doc/align.md) | ✅ | ❌ | ❌ |
|
||||
| [ALBERT](../en/model_doc/albert) | ✅ | ✅ | ✅ |
|
||||
| [ALIGN](../en/model_doc/align) | ✅ | ❌ | ❌ |
|
||||
| [AltCLIP](../en/model_doc/altclip) | ✅ | ❌ | ❌ |
|
||||
| [Audio Spectrogram Transformer](../en/model_doc/audio-spectrogram-transformer) | ✅ | ❌ | ❌ |
|
||||
| [Autoformer](../en/model_doc/autoformer) | ✅ | ❌ | ❌ |
|
||||
|
||||
Reference in New Issue
Block a user