Aligning modling code for GPT2 to work with vLLM (fallback) (#36934)

* aligning for vllm

* using input shape rather than attn outputs

* remove demo

* revert Conv1D

* style

* style

* Update src/transformers/models/gpt2/modeling_gpt2.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix copies

* Apply suggestions from code review

Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* adding docs about vllm

* chore: style

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Aritra Roy Gosthipaty
2025-05-02 13:25:16 +05:30
committed by GitHub
parent e94a4807df
commit 8a0a508f2b
3 changed files with 13 additions and 0 deletions

View File

@@ -73,6 +73,12 @@ echo -e "Hello, I'm a language model" | transformers run --task text-generation
</hfoption>
</hfoptions>
One can also serve the model using vLLM with the `transformers backend`.
```
vllm serve openai-community/gpt2 --model-imp transformers
```
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.