Add support for GrokAdamW optimizer (#32521)
* add grokadamw * reformat * code review feedback, unit test * reformat * reformat
This commit is contained in:
@@ -432,6 +432,57 @@ trainer = trl.SFTTrainer(
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
## GrokAdamW optimizer
|
||||
|
||||
The GrokAdamW optimizer is designed to enhance training performance and stability, particularly for models that benefit from grokking signal functions. To use GrokAdamW, first install the optimizer package with `pip install grokadamw`.
|
||||
|
||||
<Tip>
|
||||
|
||||
GrokAdamW is particularly useful for models that require advanced optimization techniques to achieve better performance and stability.
|
||||
|
||||
</Tip>
|
||||
|
||||
Below is a simple script to demonstrate how to fine-tune [google/gemma-2b](https://huggingface.co/google/gemma-2b) on the IMDB dataset using the GrokAdamW optimizer:
|
||||
|
||||
```python
|
||||
import torch
|
||||
import datasets
|
||||
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM, Trainer
|
||||
|
||||
# Load the IMDB dataset
|
||||
train_dataset = datasets.load_dataset('imdb', split='train')
|
||||
|
||||
# Define the training arguments
|
||||
args = TrainingArguments(
|
||||
output_dir="./test-grokadamw",
|
||||
max_steps=1000,
|
||||
per_device_train_batch_size=4,
|
||||
optim="grokadamw",
|
||||
logging_strategy="steps",
|
||||
logging_steps=1,
|
||||
learning_rate=2e-5,
|
||||
save_strategy="no",
|
||||
run_name="grokadamw-imdb",
|
||||
)
|
||||
|
||||
# Load the model and tokenizer
|
||||
model_id = "google/gemma-2b"
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).to(0)
|
||||
|
||||
# Initialize the Trainer
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=args,
|
||||
train_dataset=train_dataset,
|
||||
)
|
||||
|
||||
# Train the model
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
This script demonstrates how to fine-tune the `google/gemma-2b` model on the IMDB dataset using the GrokAdamW optimizer. The `TrainingArguments` are configured to use GrokAdamW, and the dataset is passed to the `Trainer` for training.
|
||||
|
||||
## Accelerate and Trainer
|
||||
|
||||
The [`Trainer`] class is powered by [Accelerate](https://hf.co/docs/accelerate), a library for easily training PyTorch models in distributed environments with support for integrations such as [FullyShardedDataParallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) and [DeepSpeed](https://www.deepspeed.ai/).
|
||||
|
||||
Reference in New Issue
Block a user