From e9756cdbc7dcec91ea0dde55c165d6276bd08252 Mon Sep 17 00:00:00 2001 From: Steven Liu <59462357+stevhliu@users.noreply.github.com> Date: Mon, 10 Mar 2025 13:14:19 -0700 Subject: [PATCH] [docs] Serving LLMs (#36522) * initial * fix * model-impl --- docs/source/en/_toctree.yml | 2 ++ docs/source/en/serving.md | 64 +++++++++++++++++++++++++++++++++++++ 2 files changed, 66 insertions(+) create mode 100644 docs/source/en/serving.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 624f4d7352..33c4a7df57 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -74,6 +74,8 @@ title: Optimizing inference - local: kv_cache title: KV cache strategies + - local: serving + title: Serving - local: cache_explanation title: Caching - local: llm_tutorial_optimization diff --git a/docs/source/en/serving.md b/docs/source/en/serving.md new file mode 100644 index 0000000000..1c665e4bc4 --- /dev/null +++ b/docs/source/en/serving.md @@ -0,0 +1,64 @@ + + +# Serving + +Transformer models can be served for inference with specialized libraries such as Text Generation Inference (TGI) and vLLM. These libraries are specifically designed to optimize performance with LLMs and include many unique optimization features that may not be included in Transformers. + +## TGI + +[TGI](https://huggingface.co/docs/text-generation-inference/index) can serve models that aren't [natively implemented](https://huggingface.co/docs/text-generation-inference/supported_models) by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren't available in the Transformers implementation, but other features like continuous batching and streaming are still supported. + +> [!TIP] +> Refer to the [Non-core model serving](https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models) guide for more details. + +Serve a Transformers implementation the same way you'd serve a TGI model. + +```docker +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2 +``` + +Add `--trust-remote_code` to the command to serve a custom Transformers model. + +```docker +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id --trust-remote-code +``` + +## vLLM + +[vLLM](https://docs.vllm.ai/en/latest/index.html) can also serve a Transformers implementation of a model if it isn't [natively implemented](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) in vLLM. + +Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation. + +> [!TIP] +> Refer to the [Transformers fallback](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers-fallback) section for more details. + +By default, vLLM serves the native implementation and if it doesn't exist, it falls back on the Transformers implementation. But you can also set `--model-impl transformers` to explicitly use the Transformers model implementation. + +```shell +vllm serve Qwen/Qwen2.5-1.5B-Instruct \ + --task generate \ + --model-impl transformers \ +``` + +Add the `trust-remote-code` parameter to enable loading a remote code model. + +```shell +vllm serve Qwen/Qwen2.5-1.5B-Instruct \ + --task generate \ + --model-impl transformers \ + --trust-remote-code \ +``` \ No newline at end of file