@@ -67,23 +67,23 @@ You can quickly run a FP4 model on a single GPU by running the following code:
|
|||||||
from transformers import AutoModelForCausalLM
|
from transformers import AutoModelForCausalLM
|
||||||
|
|
||||||
model_name = "bigscience/bloom-2b5"
|
model_name = "bigscience/bloom-2b5"
|
||||||
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
|
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
|
||||||
```
|
```
|
||||||
Note that `device_map` is optional but setting `device_map = 'auto'` is prefered for inference as it will dispatch efficiently the model on the available ressources.
|
Note that `device_map` is optional but setting `device_map = 'auto'` is prefered for inference as it will dispatch efficiently the model on the available ressources.
|
||||||
|
|
||||||
### Running FP4 models - multi GPU setup
|
### Running FP4 models - multi GPU setup
|
||||||
|
|
||||||
The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup):
|
The way to load your mixed 4-bit model in multiple GPUs is as follows (same command as single GPU setup):
|
||||||
```py
|
```py
|
||||||
model_name = "bigscience/bloom-2b5"
|
model_name = "bigscience/bloom-2b5"
|
||||||
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
|
model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True)
|
||||||
```
|
```
|
||||||
But you can control the GPU RAM you want to allocate on each GPU using `accelerate`. Use the `max_memory` argument as follows:
|
But you can control the GPU RAM you want to allocate on each GPU using `accelerate`. Use the `max_memory` argument as follows:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
max_memory_mapping = {0: "600MB", 1: "1GB"}
|
max_memory_mapping = {0: "600MB", 1: "1GB"}
|
||||||
model_name = "bigscience/bloom-3b"
|
model_name = "bigscience/bloom-3b"
|
||||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
model_4bit = AutoModelForCausalLM.from_pretrained(
|
||||||
model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping
|
model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user