Aligning modling code for GPT2 to work with vLLM (fallback) (#36934)
* aligning for vllm * using input shape rather than attn outputs * remove demo * revert Conv1D * style * style * Update src/transformers/models/gpt2/modeling_gpt2.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix copies * Apply suggestions from code review Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * adding docs about vllm * chore: style --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
e94a4807df
commit
8a0a508f2b
@@ -73,6 +73,12 @@ echo -e "Hello, I'm a language model" | transformers run --task text-generation
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
One can also serve the model using vLLM with the `transformers backend`.
|
||||
|
||||
```
|
||||
vllm serve openai-community/gpt2 --model-imp transformers
|
||||
```
|
||||
|
||||
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
|
||||
|
||||
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
|
||||
|
||||
Reference in New Issue
Block a user