GPT-J specific half precision on CPU note (#22086)
* re: #21989 * update re: #21989 * removed cpu option * make style
This commit is contained in:
@@ -21,21 +21,22 @@ This model was contributed by [Stella Biderman](https://huggingface.co/stellaath
|
|||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size CPU
|
- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size
|
||||||
RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB of CPU
|
RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB
|
||||||
RAM to just load the model. To reduce the CPU RAM usage there are a few options. The `torch_dtype` argument can be
|
RAM to just load the model. To reduce the RAM usage there are a few options. The `torch_dtype` argument can be
|
||||||
used to initialize the model in half-precision. And the `low_cpu_mem_usage` argument can be used to keep the RAM
|
used to initialize the model in half-precision on a CUDA device only. There is also a fp16 branch which stores the fp16 weights,
|
||||||
usage to 1x. There is also a [fp16 branch](https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) which stores
|
which could be used to further minimize the RAM usage:
|
||||||
the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly
|
|
||||||
12.1GB of CPU RAM to load the model.
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import GPTJForCausalLM
|
>>> from transformers import GPTJForCausalLM
|
||||||
>>> import torch
|
>>> import torch
|
||||||
|
|
||||||
|
>>> device = "cuda"
|
||||||
>>> model = GPTJForCausalLM.from_pretrained(
|
>>> model = GPTJForCausalLM.from_pretrained(
|
||||||
... "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True
|
... "EleutherAI/gpt-j-6B",
|
||||||
... )
|
... revision="float16",
|
||||||
|
... torch_dtype=torch.float16,
|
||||||
|
... ).to(device)
|
||||||
```
|
```
|
||||||
|
|
||||||
- The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
|
- The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam
|
||||||
@@ -85,7 +86,8 @@ model.
|
|||||||
>>> from transformers import GPTJForCausalLM, AutoTokenizer
|
>>> from transformers import GPTJForCausalLM, AutoTokenizer
|
||||||
>>> import torch
|
>>> import torch
|
||||||
|
|
||||||
>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
|
>>> device = "cuda"
|
||||||
|
>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to(device)
|
||||||
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
|
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
|
||||||
|
|
||||||
>>> prompt = (
|
>>> prompt = (
|
||||||
@@ -94,7 +96,7 @@ model.
|
|||||||
... "researchers was the fact that the unicorns spoke perfect English."
|
... "researchers was the fact that the unicorns spoke perfect English."
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
|
||||||
|
|
||||||
>>> gen_tokens = model.generate(
|
>>> gen_tokens = model.generate(
|
||||||
... input_ids,
|
... input_ids,
|
||||||
|
|||||||
Reference in New Issue
Block a user