[docs] Create page on inference servers with transformers backend (#39550)
* draft docs on inference servers * Update docs/source/en/_toctree.yml Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * update * dic build failed * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/_toctree.yml Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * apply last suggestions --------- Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
cd98c1fee3
commit
1806583390
@@ -16,54 +16,9 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
# Serving
|
||||
|
||||
Transformer models can be efficiently deployed using libraries such as vLLM, Text Generation Inference (TGI), and others. These libraries are designed for production-grade user-facing services, and can scale to multiple servers and millions of concurrent users.
|
||||
Transformer models can be efficiently deployed using libraries such as vLLM, Text Generation Inference (TGI), and others. These libraries are designed for production-grade user-facing services, and can scale to multiple servers and millions of concurrent users. Refer to [Transformers as Backend for Inference Servers](./transformers_as_backends) for usage examples.
|
||||
|
||||
You can also serve transformer models easily using the `transformers serve` CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.
|
||||
|
||||
## TGI
|
||||
|
||||
[TGI](https://huggingface.co/docs/text-generation-inference/index) can serve models that aren't [natively implemented](https://huggingface.co/docs/text-generation-inference/supported_models) by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren't available in the Transformers implementation, but other features like continuous batching and streaming are still supported.
|
||||
|
||||
> [!TIP]
|
||||
> Refer to the [Non-core model serving](https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models) guide for more details.
|
||||
|
||||
Serve a Transformers implementation the same way you'd serve a TGI model.
|
||||
|
||||
```docker
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
|
||||
```
|
||||
|
||||
Add `--trust-remote_code` to the command to serve a custom Transformers model.
|
||||
|
||||
```docker
|
||||
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code
|
||||
```
|
||||
|
||||
## vLLM
|
||||
|
||||
[vLLM](https://docs.vllm.ai/en/latest/index.html) can also serve a Transformers implementation of a model if it isn't [natively implemented](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) in vLLM.
|
||||
|
||||
Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation.
|
||||
|
||||
> [!TIP]
|
||||
> Refer to the [Transformers fallback](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers-fallback) section for more details.
|
||||
|
||||
By default, vLLM serves the native implementation and if it doesn't exist, it falls back on the Transformers implementation. But you can also set `--model-impl transformers` to explicitly use the Transformers model implementation.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2.5-1.5B-Instruct \
|
||||
--task generate \
|
||||
--model-impl transformers
|
||||
```
|
||||
|
||||
Add the `trust-remote-code` parameter to enable loading a remote code model.
|
||||
|
||||
```shell
|
||||
vllm serve Qwen/Qwen2.5-1.5B-Instruct \
|
||||
--task generate \
|
||||
--model-impl transformers \
|
||||
--trust-remote-code
|
||||
```
|
||||
Apart from that you can also serve transformer models easily using the `transformers serve` CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.
|
||||
|
||||
## Serve CLI
|
||||
|
||||
|
||||
Reference in New Issue
Block a user