Aligning modling code for GPT2 to work with vLLM (fallback) (#36934)

* aligning for vllm * using input shape rather than attn outputs * remove demo * revert Conv1D * style * style * Update src/transformers/models/gpt2/modeling_gpt2.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix copies * Apply suggestions from code review Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * adding docs about vllm * chore: style --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-02 13:25:16 +05:30
parent e94a4807df
commit 8a0a508f2b
3 changed files with 13 additions and 0 deletions
--- a/docs/source/en/model_doc/gpt2.md
+++ b/docs/source/en/model_doc/gpt2.md
@@ -73,6 +73,12 @@ echo -e "Hello, I'm a language model" | transformers run --task text-generation
 </hfoption>
 </hfoptions>

+One can also serve the model using vLLM with the `transformers backend`.
+
+```
+vllm serve openai-community/gpt2 --model-imp transformers
+```
+
 Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.

 The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.