[docs] Redesign (#31757)
* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
This commit is contained in:
299
docs/source/en/chat_extras.md
Normal file
299
docs/source/en/chat_extras.md
Normal file
@@ -0,0 +1,299 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Tools and RAG
|
||||
|
||||
The [`~PreTrainedTokenizerBase.apply_chat_template`] method supports virtually any additional argument types - strings, lists, dicts - besides the chat message. This makes it possible to use chat templates for many use cases.
|
||||
|
||||
This guide will demonstrate how to use chat templates with tools and retrieval-augmented generation (RAG).
|
||||
|
||||
## Tools
|
||||
|
||||
Tools are functions a large language model (LLM) can call to perform specific tasks. It is a powerful way to extend the capabilities of conversational agents with real-time information, computational tools, or access to large databases.
|
||||
|
||||
Follow the rules below when creating a tool.
|
||||
|
||||
1. The function should have a descriptive name.
|
||||
2. The function arguments must have a type hint in the function header (don't include in the `Args` block).
|
||||
3. The function must have a [Google-style](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) docstring.
|
||||
4. The function can have a return type and `Returns` block, but these are optional because most tool use models ignore them.
|
||||
|
||||
An example tool to get temperature and wind speed is shown below.
|
||||
|
||||
```py
|
||||
def get_current_temperature(location: str, unit: str) -> float:
|
||||
"""
|
||||
Get the current temperature at a location.
|
||||
|
||||
Args:
|
||||
location: The location to get the temperature for, in the format "City, Country"
|
||||
unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"])
|
||||
Returns:
|
||||
The current temperature at the specified location in the specified units, as a float.
|
||||
"""
|
||||
return 22. # A real function should probably actually get the temperature!
|
||||
|
||||
def get_current_wind_speed(location: str) -> float:
|
||||
"""
|
||||
Get the current wind speed in km/h at a given location.
|
||||
|
||||
Args:
|
||||
location: The location to get the temperature for, in the format "City, Country"
|
||||
Returns:
|
||||
The current wind speed at the given location in km/h, as a float.
|
||||
"""
|
||||
return 6. # A real function should probably actually get the wind speed!
|
||||
|
||||
tools = [get_current_temperature, get_current_wind_speed]
|
||||
```
|
||||
|
||||
Load a model and tokenizer that supports tool-use like [NousResearch/Hermes-2-Pro-Llama-3-8B](https://hf.co/NousResearch/Hermes-2-Pro-Llama-3-8B), but you can also consider a larger model like [Command-R](./model_doc/cohere) and [Mixtral-8x22B](./model_doc/mixtral) if your hardware can support it.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained( "NousResearch/Hermes-2-Pro-Llama-3-8B")
|
||||
tokenizer = AutoTokenizer.from_pretrained( "NousResearch/Hermes-2-Pro-Llama-3-8B")
|
||||
model = AutoModelForCausalLM.from_pretrained( "NousResearch/Hermes-2-Pro-Llama-3-8B", torch_dtype=torch.bfloat16, device_map="auto")
|
||||
```
|
||||
|
||||
Create a chat message.
|
||||
|
||||
```py
|
||||
messages = [
|
||||
{"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."},
|
||||
{"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
|
||||
]
|
||||
```
|
||||
|
||||
Pass `messages` and a list of tools to [`~PreTrainedTokenizerBase.apply_chat_template`]. Then you can pass the inputs to the model for generation.
|
||||
|
||||
```py
|
||||
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
|
||||
inputs = {k: v for k, v in inputs.items()}
|
||||
outputs = model.generate(**inputs, max_new_tokens=128)
|
||||
print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))
|
||||
```
|
||||
|
||||
```txt
|
||||
<tool_call>
|
||||
{"arguments": {"location": "Paris, France", "unit": "celsius"}, "name": "get_current_temperature"}
|
||||
</tool_call><|im_end|>
|
||||
```
|
||||
|
||||
The chat model called the `get_current_temperature` tool with the correct parameters from the docstring. It inferred France as the location based on Paris, and that it should use Celsius for the units of temperature.
|
||||
|
||||
Now append the `get_current_temperature` function and these arguments to the chat message as `tool_call`. The `tool_call` dictionary should be provided to the `assistant` role instead of the `system` or `user`.
|
||||
|
||||
> [!WARNING]
|
||||
> The OpenAI API uses a JSON string as its `tool_call` format. This may cause errors or strange model behavior if used in Transformers, which expects a dict.
|
||||
|
||||
<hfoptions id="tool-call">
|
||||
<hfoption id="Llama">
|
||||
|
||||
```py
|
||||
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
|
||||
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
|
||||
```
|
||||
|
||||
Allow the assistant to read the function outputs and chat with the user.
|
||||
|
||||
```py
|
||||
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
|
||||
inputs = {k: v for k, v in inputs.items()}
|
||||
out = model.generate(**inputs, max_new_tokens=128)
|
||||
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))
|
||||
```
|
||||
|
||||
```txt
|
||||
The temperature in Paris, France right now is approximately 12°C (53.6°F).<|im_end|>
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="Mistral/Mixtral">
|
||||
|
||||
For [Mistral](./model_doc/mistral) and [Mixtral](./model_doc/mixtral) models, you need an additional `tool_call_id`. The `tool_call_id` is 9 randomly generated alphanumeric characters assigned to the `id` key in the `tool_call` dictionary.
|
||||
|
||||
```py
|
||||
tool_call_id = "9Ae3bDc2F"
|
||||
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
|
||||
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "id": tool_call_id, "function": tool_call}]})
|
||||
```
|
||||
|
||||
```py
|
||||
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
|
||||
inputs = {k: v for k, v in inputs.items()}
|
||||
out = model.generate(**inputs, max_new_tokens=128)
|
||||
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
## Schema
|
||||
|
||||
[`~PreTrainedTokenizerBase.apply_chat_template`] converts functions into a [JSON schema](https://json-schema.org/learn/getting-started-step-by-step) which is passed to the chat template. A LLM never sees the code inside the function. In other words, a LLM doesn't care how the model works technically, it only cares about function **definition** and **arguments**.
|
||||
|
||||
The JSON schema is automatically generated behind the scenes as long as your function follows the [rules](#tools) listed earlier above. But you can use [get_json_schema](https://github.com/huggingface/transformers/blob/14561209291255e51c55260306c7d00c159381a5/src/transformers/utils/chat_template_utils.py#L205) to manually convert a schema for more visibility or debugging.
|
||||
|
||||
```py
|
||||
from transformers.utils import get_json_schema
|
||||
|
||||
def multiply(a: float, b: float):
|
||||
"""
|
||||
A function that multiplies two numbers
|
||||
|
||||
Args:
|
||||
a: The first number to multiply
|
||||
b: The second number to multiply
|
||||
"""
|
||||
return a * b
|
||||
|
||||
schema = get_json_schema(multiply)
|
||||
print(schema)
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "multiply",
|
||||
"description": "A function that multiplies two numbers",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"a": {
|
||||
"type": "number",
|
||||
"description": "The first number to multiply"
|
||||
},
|
||||
"b": {
|
||||
"type": "number",
|
||||
"description": "The second number to multiply"
|
||||
}
|
||||
},
|
||||
"required": ["a", "b"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
You can edit the schema or write one entirely from scratch. This gives you a lot of flexibility to define precise schemas for more complex functions.
|
||||
|
||||
> [!WARNING]
|
||||
> Try keeping your function signatures simple and the arguments to a minimum. These are easier for a model to understand and use than complex functions for example with nested arguments.
|
||||
|
||||
The example below demonstrates writing a schema manually and then passing it to [`~PreTrainedTokenizerBase.apply_chat_template`].
|
||||
|
||||
```py
|
||||
# A simple function that takes no arguments
|
||||
current_time = {
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "current_time",
|
||||
"description": "Get the current local time as a string.",
|
||||
"parameters": {
|
||||
'type': 'object',
|
||||
'properties': {}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# A more complete function that takes two numerical arguments
|
||||
multiply = {
|
||||
'type': 'function',
|
||||
'function': {
|
||||
'name': 'multiply',
|
||||
'description': 'A function that multiplies two numbers',
|
||||
'parameters': {
|
||||
'type': 'object',
|
||||
'properties': {
|
||||
'a': {
|
||||
'type': 'number',
|
||||
'description': 'The first number to multiply'
|
||||
},
|
||||
'b': {
|
||||
'type': 'number', 'description': 'The second number to multiply'
|
||||
}
|
||||
},
|
||||
'required': ['a', 'b']
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
model_input = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
tools = [current_time, multiply]
|
||||
)
|
||||
```
|
||||
|
||||
## RAG
|
||||
|
||||
Retrieval-augmented generation (RAG) models enhance a models existing knowledge by allowing it to search documents for additional information before returning a query. For RAG models, add a `documents` parameter to [`~PreTrainedTokenizerBase.apply_chat_template`]. This `documents` parameter should be a list of documents, and each document should be a single dict with `title` and `content` keys.
|
||||
|
||||
> [!TIP]
|
||||
> The `documents` parameter for RAG isn't widely supported and many models have chat templates that ignore `documents`. Verify if a model supports `documents` by reading its model card or executing `print(tokenizer.chat_template)` to see if the `documents` key is present. [Command-R](https://hf.co/CohereForAI/c4ai-command-r-08-2024) and [Command-R+](https://hf.co/CohereForAI/c4ai-command-r-plus-08-2024) both support `documents` in their RAG chat templates.
|
||||
|
||||
Create a list of documents to pass to the model.
|
||||
|
||||
```py
|
||||
documents = [
|
||||
{
|
||||
"title": "The Moon: Our Age-Old Foe",
|
||||
"text": "Man has always dreamed of destroying the moon. In this essay, I shall..."
|
||||
},
|
||||
{
|
||||
"title": "The Sun: Our Age-Old Friend",
|
||||
"text": "Although often underappreciated, the sun provides several notable benefits..."
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Set `chat_template="rag"` in [`~PreTrainedTokenizerBase.apply_chat_template`] and generate a response.
|
||||
|
||||
```py
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
# Load the model and tokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01-4bit")
|
||||
model = AutoModelForCausalLM.from_pretrained("CohereForAI/c4ai-command-r-v01-4bit", device_map="auto")
|
||||
device = model.device # Get the device the model is loaded on
|
||||
|
||||
# Define conversation input
|
||||
conversation = [
|
||||
{"role": "user", "content": "What has Man always dreamed of?"}
|
||||
]
|
||||
|
||||
input_ids = tokenizer.apply_chat_template(
|
||||
conversation=conversation,
|
||||
documents=documents,
|
||||
chat_template="rag",
|
||||
tokenize=True,
|
||||
add_generation_prompt=True,
|
||||
return_tensors="pt").to(device)
|
||||
|
||||
# Generate a response
|
||||
generated_tokens = model.generate(
|
||||
input_ids,
|
||||
max_new_tokens=100,
|
||||
do_sample=True,
|
||||
temperature=0.3,
|
||||
)
|
||||
|
||||
# Decode and print the generated text along with generation prompt
|
||||
generated_text = tokenizer.decode(generated_tokens[0])
|
||||
print(generated_text)
|
||||
```
|
||||
Reference in New Issue
Block a user