* Remove nestedness in tool config * Really do it * Use remote tools descriptions * Work * Clean up eval * Changes * Tools * Tools * tool * Fix everything * Use last result/assign for evaluation * Prompt * Remove hardcoded selection * Evaluation for chat agents * correct some spelling * Small fixes * Change summarization model (#23172) * Fix link displayed * Update description of the tool * Fixes in chat prompt * Custom tools, custom prompt * Tool clean up * save_pretrained and push_to_hub for tool * Fix init * Tests * Fix tests * Tool save/from_hub/push_to_hub and tool->load_tool * Clean push_to_hub and add app file * Custom inference API for endpoints too * Clean up * old remote tool and new remote tool * Make a requirements * return_code adds tool creation * Avoid redundancy between global variables * Remote tools can be loaded * Tests * Text summarization tests * Quality * Properly mark tests * Test the python interpreter * And the CI shall be green. * fix loading of additional tools * Work on RemoteTool and fix tests * General clean up * Guard imports * Fix tools * docs: Fix broken link in 'How to add a model...' (#23216) fix link * Get default endpoint from the Hub * Add guide * Simplify tool config * Docs * Some fixes * Docs * Docs * Docs * Fix code returned by agent * Try this * Match args with signature in remote tool * Should fix python interpreter for Python 3.8 * Fix push_to_hub for tools * Other fixes to push_to_hub * Add API doc page * Docs * Docs * Custom tools * Pin tensorflow-probability (#23220) * Pin tensorflow-probability * [all-test] * [all-test] Fix syntax for bash * PoC for some chaining API * Text to speech * J'ai pris des libertés * Rename * Basic python interpreter * Add agents * Quality * Add translation tool * temp * GenQA + LID + S2T * Quality + word missing in translation * Add open assistance, support f-strings in evaluate * captioning + s2t fixes * Style * Refactor descriptions and remove chain * Support errors and rename OpenAssistantAgent * Add setup * Deal with typos + example of inference API * Some rename + README * Fixes * Update prompt * Unwanted change * Make sure everyone has a default * One prompt to rule them all. * SD * Description * Clean up remote tools * More remote tools * Add option to return code and update doc * Image segmentation * ControlNet * Gradio demo * Diffusers protection * Lib protection * ControlNet description * Cleanup * Style * Remove accelerate and try to be reproducible * No randomness * Male Basic optional in token * Clean description * Better prompts * Fix args eval in interpreter * Add tool wrapper * Tool on the Hub * Style post-rebase * Big refactor of descriptions, batch generation and evaluation for agents * Make problems easier - interface to debug * More problems, add python primitives * Back to one prompt * Remove dict for translation * Be consistent * Add prompts * New version of the agent * Evaluate new agents * New endpoints agents * Make all tools a dict variable * Typo * Add problems * Add to big prompt * Harmonize * Add tools * New evaluation * Add more tools * Build prompt with tools descriptions * Tools on the Hub * Let's chat! * Cleanup * Temporary bs4 safeguard * Cache agents and clean up * Blank init * Fix evaluation for agents * New format for tools on the Hub * Add method to reset state * Remove nestedness in tool config * Really do it * Use remote tools descriptions * Work * Clean up eval * Changes * Tools * Tools * tool * Fix everything * Use last result/assign for evaluation * Prompt * Remove hardcoded selection * Evaluation for chat agents * correct some spelling * Small fixes * Change summarization model (#23172) * Fix link displayed * Update description of the tool * Fixes in chat prompt * Custom tools, custom prompt * Tool clean up * save_pretrained and push_to_hub for tool * Fix init * Tests * Fix tests * Tool save/from_hub/push_to_hub and tool->load_tool * Clean push_to_hub and add app file * Custom inference API for endpoints too * Clean up * old remote tool and new remote tool * Make a requirements * return_code adds tool creation * Avoid redundancy between global variables * Remote tools can be loaded * Tests * Text summarization tests * Quality * Properly mark tests * Test the python interpreter * And the CI shall be green. * Work on RemoteTool and fix tests * fix loading of additional tools * General clean up * Guard imports * Fix tools * Get default endpoint from the Hub * Simplify tool config * Add guide * Docs * Some fixes * Docs * Docs * Fix code returned by agent * Try this * Docs * Match args with signature in remote tool * Should fix python interpreter for Python 3.8 * Fix push_to_hub for tools * Other fixes to push_to_hub * Add API doc page * Fixes * Doc fixes * Docs * Fix audio * Custom tools * Audio fix * Improve custom tools docstring * Docstrings * Trigger CI * Mode docstrings * More docstrings * Improve custom tools * Fix for remote tools * Style * Fix repo consistency * Quality * Tip * Cleanup on doc * Cleanup toc * Add disclaimer for starcoder vs openai * Remove disclaimer * Small fixed in the prompts * 4.29 * Update src/transformers/tools/agents.py Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> * Complete documentation * Small fixes * Agent evaluation * Note about gradio-tools & LC * Clean up agents and prompt * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Note about gradio-tools & LC * Add copyrights and address review comments * Quality * Add all language codes * Add remote tool tests * Move custom prompts to other docs * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * TTS tests * Quality --------- Co-authored-by: Lysandre <hi@lyand.re> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com> Co-authored-by: Connor Henderson <connor.henderson@talkiatry.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre <lysandre@huggingface.co> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
330 lines
15 KiB
Plaintext
330 lines
15 KiB
Plaintext
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# Transformers Agent
|
|
|
|
<Tip warning={true}>
|
|
|
|
Transformers Agent is an experimental API which is subject to change at any time. Results returned by the agents
|
|
can vary as the APIs or underlying models are prone to change.
|
|
|
|
</Tip>
|
|
|
|
Transformers version v4.29.0, building on the concept of *tools* and *agents*.
|
|
|
|
In short, it provides a natural language API on top of transformers: we define a set of curated tools, and design an
|
|
agent to interpret natural language and to use these tools. It is extensible by design; we curated some relevant tools,
|
|
but we'll show you how the system can be extended easily to use any tool developed by the community.
|
|
|
|
Let's start with a few examples of what can be achieved with this new API. It is particularly powerful when it comes
|
|
to multimodal tasks, so let's take it for a spin to generate images and read text out loud.
|
|
|
|
```py
|
|
agent.run("Caption the following image", image=image)
|
|
```
|
|
|
|
| **Input** | **Output** |
|
|
|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------|
|
|
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png" width=200> | A beaver is swimming in the water |
|
|
|
|
---
|
|
|
|
```py
|
|
agent.run("Read the following text out loud", text=text)
|
|
```
|
|
| **Input** | **Output** |
|
|
|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------|
|
|
| A beaver is swimming in the water | <audio controls><source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tts_example.wav" type="audio/wav"> your browser does not support the audio element. </audio>
|
|
|
|
---
|
|
|
|
```py
|
|
agent.run(
|
|
"In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?",
|
|
document=document,
|
|
)
|
|
```
|
|
| **Input** | **Output** |
|
|
|-----------------------------------------------------------------------------------------------------------------------------|----------------|
|
|
| <img src="https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/0/image/image.jpg" width=200> | ballroom foyer |
|
|
|
|
## Quickstart
|
|
|
|
Before being able to use `agent.run`, you will need to instantiate an agent, which is a large language model (LLM).
|
|
We recommend using the [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) checkpoint as it works very well
|
|
for the task at hand and is open-source, but please find other examples below.
|
|
|
|
Start by logging-in to have access to the Inference API:
|
|
|
|
```py
|
|
from huggingface_hub import login
|
|
|
|
login("<YOUR_TOKEN>")
|
|
```
|
|
|
|
Then, instantiate the agent
|
|
|
|
```py
|
|
from transformers import HfAgent
|
|
|
|
agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder")
|
|
```
|
|
|
|
This is using the inference API that Hugging Face provides for free at the moment, if you have your own inference
|
|
endpoint for this model (or another one) you can replace the url above by your url endpoint.
|
|
|
|
<Tip>
|
|
|
|
We're showcasing StarCoder as the default in the documentation as the model is free to use and performs admirably well
|
|
on simple tasks. However, the checkpoint doesn't hold up when handling more complex prompts. If you're facing such an
|
|
issue, we recommend trying out the OpenAI model which, while sadly not open-source, performs better at this given time.
|
|
|
|
</Tip>
|
|
|
|
You're now good to go! Let's dive into the two APIs that you now have at your disposal.
|
|
|
|
### Single execution (run)
|
|
|
|
The single execution method is when using the [`~Agent.run`] method of the agent:
|
|
|
|
```py
|
|
agent.run("Draw me a picture of rivers and lakes")
|
|
```
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" width=200>
|
|
|
|
It automatically select the tool (or tools) appropriate for the task you want to perform and run them appropriately. It
|
|
can perform one or several tasks in the same instruction (though the more complex your instruction, the more likely
|
|
the agent is to fail).
|
|
|
|
```py
|
|
agent.chat("Draw me a picture of the sea then transform the picture to add an island.")
|
|
```
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/sea_and_island.png" width=200>
|
|
|
|
<br/>
|
|
|
|
|
|
Every [`~Agent.run`] operation is independent, so you can run it several times in a row with different tasks.
|
|
|
|
Note that your `agent` is just a large-language model, so small variations in your prompt might yield completely
|
|
different results. It's important to explain as clearly as possible the task you want to perform.
|
|
|
|
If you'd like to keep a state across executions or to pass non-text objects to the agent, you can do so by specifying
|
|
variables that you would like the agent to use. For example you could generate the first image of rivers and lakes,
|
|
and ask the model to update that picture to add an island by doing the following:
|
|
|
|
```python
|
|
picture = agent.run("Draw me a picture of rivers and lakes")
|
|
updated_picture = agent.chat("Take that `picture` and add an island to it", picture=picture)
|
|
```
|
|
|
|
<Tip>
|
|
|
|
This can be helpful when the model is unable to understand your request and mixes tools. An example would be:
|
|
|
|
```python
|
|
agent.run("Draw me the picture of a capybara swimming in the sea")
|
|
```
|
|
|
|
Here, the model could interpret it two ways:
|
|
- Have the `text-to-image` generate a capybara swimming in the sea
|
|
- Or, have the `text-to-image` generate capybara, then use the `image-transformation` tool to have it swim in the sea
|
|
|
|
In case you would like to force the first scenario, you could do so by passing it the prompt as an argument:
|
|
|
|
```python
|
|
agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea")
|
|
```
|
|
|
|
</Tip>
|
|
|
|
|
|
### Chat-based execution (chat)
|
|
|
|
The agent also has a chat-based approach, using the [`~Agent.chat`] method:
|
|
|
|
```py
|
|
agent.chat("Draw me a picture of rivers and lakes")
|
|
```
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" width=200>
|
|
|
|
```py
|
|
agent.chat("Transform the picture so that there is a rock in there")
|
|
```
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes_and_beaver.png" width=200>
|
|
|
|
<br/>
|
|
|
|
This is an interesting approach when you want to keep the state across instructions. It's better for experimentation,
|
|
but will tend to be much better at single instructions rather than complex instructions (which the [`~Agent.run`]
|
|
method is better at handling).
|
|
|
|
This method can also take arguments if you would like to pass non-text types or specific prompts.
|
|
|
|
### ⚠️ Remote execution
|
|
|
|
For demonstration purposes and so that this can be used with all setups, we have created remote executors for several
|
|
of the default tools the agent has access to. These are created using
|
|
[inference endpoints](https://huggingface.co/inference-endpoints). To see how to setup remote executors tools yourself,
|
|
we recommend reading the custom tool guide [TODO LINK].
|
|
|
|
In order to run with remote tools, specifying `remote=True` to either [`~Agent.run`] or [`~Agent.chat`] is sufficient.
|
|
|
|
For example, the following command could be run on any device efficiently, without needing significant RAM or GPU:
|
|
|
|
```python
|
|
agent.run("Draw me a picture of rivers and lakes", remote=True)
|
|
```
|
|
|
|
The same can be said for [`~Agent.chat`]:
|
|
|
|
```py
|
|
agent.chat("Draw me a picture of rivers and lakes", remote=True)
|
|
```
|
|
|
|
### What's happening here? What are tools, and what are agents?
|
|
|
|
#### Agents
|
|
|
|
The "agent" here is a large language model, and we're prompting it so that it has access to a specific set of tools.
|
|
|
|
LLMs are pretty good at generating small samples of code, so this API takes advantage of that by prompting the
|
|
LLM to give a small sample of code performing a task with a set of tools. This prompt is then completed by the
|
|
task you give your agent and the description of the tools you give it. This way it gets access to the doc of the
|
|
tools you are using, especially their expected inputs and outputs and can generate the relevant code.
|
|
|
|
#### Tools
|
|
|
|
Tools are very simple: they're a single function, with a name, and a description. We then use these tools description
|
|
to prompt the agent. Through the prompt, we show the agent how it would leverage tools in order to perform what was
|
|
requests in the query.
|
|
|
|
This is using brand-new tools and not pipelines, because the agent writes better code with very atomic tools.
|
|
Pipelines are more refactored and often combine several tasks in one. Tools are really meant to be focused on
|
|
one very simple task only.
|
|
|
|
#### Code-execution?!
|
|
|
|
This code is then executed with our small Python interpreter on the set of inputs passed along with your tools.
|
|
We hear you screaming "Arbitrary code execution!" in the back, but let us explain why that is not the case.
|
|
|
|
The only functions that can be called are the tools you provided and the print function, so you're already
|
|
limited in what can be executed. You should be safe if it's limited to Hugging Face tools.
|
|
|
|
Then, we don't allow any attribute lookup or imports (which shouldn't be needed anyway for passing along
|
|
inputs/outputs to a small set of functions) so all the most obvious attacks (and you'd need to prompt the LLM
|
|
to output them anyway) shouldn't be an issue. If you want to be on the super safe side, you can execute the
|
|
run() method with the additional argument return_code=True, in which case the agent will just return the code
|
|
to execute and you can decide whether to do it or not.
|
|
|
|
The execution will stop at any line trying to perform an illegal operation or if there is a regular Python error
|
|
with the code generated by the agent.
|
|
|
|
### A curated set of tools
|
|
|
|
We identify a set of tools that can empower such agents. Here is an updated list of the tools we have integrated
|
|
in `transformers`:
|
|
|
|
- **Document question answering**: given a document (such as a PDF) in image format, answer a question on this document ([Donut](../model_doc/donut))
|
|
- **Text question answering**: given a long text and a question, answer the question in the text ([Flan-T5](../model_doc/flan-t5))
|
|
- **Unconditional image captioning**: Caption the image! ([BLIP](../model_doc/blip))
|
|
- **Image question answering**: given an image, answer a question on this image ([VILT](../model_doc/vilt))
|
|
- **Image segmentation**: given an image and a prompt, output the segmentation mask of that prompt ([CLIPSeg](../model_doc/clipseg))
|
|
- **Speech to text**: given an audio recording of a person talking, transcribe the speech into text ([Whisper](../model_doc/whisper))
|
|
- **Text to speech**: convert text to speech ([SpeechT5](../model_doc/speecht5))
|
|
- **Zero-shot text classification**: given a text and a list of labels, identify to which label the text corresponds the most ([BART](../model_doc/bart))
|
|
- **Text summarization**: summarize a long text in one or a few sentences ([BART](../model_doc/bart))
|
|
- **Translation**: translate the text into a given language ([NLLB](../model_doc/nllb))
|
|
|
|
These tools have an integration in transformers, and can be used manually as well, for example:
|
|
|
|
```py
|
|
from transformers import load_tool
|
|
|
|
tool = load_tool("text-to-speech")
|
|
audio = tool("This is a text to speech tool")
|
|
```
|
|
|
|
### Custom tools
|
|
|
|
While we identify a curated set of tools, we strongly believe that the main value provided by this implementation is
|
|
the ability to quickly create and share custom tools.
|
|
|
|
By pushing the code of a tool to a Hugging Face Space or a model repository, you're then able to leverage the tool
|
|
directly with the agent. We've added a few
|
|
**transformers-agnostic** tools to the `huggingface-tools` organization:
|
|
|
|
- **Text downloader**: to download a text from a web URL
|
|
- **Text to image**: generate an image according to a prompt, leveraging stable diffusion
|
|
- **Image transformation**: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion
|
|
|
|
The text-to-image tool we have been using since the beginning is actually a remote tool that lives in
|
|
[*huggingface-tools/text-to-image*](https://huggingface.co/spaces/huggingface-tools/text-to-image)! We will
|
|
continue releasing such tools on this and other organization, to further supercharge this implementation.
|
|
|
|
The agents have by default access to tools that reside on `huggingface-tools`.
|
|
We explain how to you can write and share your own tools as well as leverage any custom tool that resides on the Hub in [following guide](custom_tools).
|
|
[following guide](custom_tools).
|
|
|
|
### Leveraging different agents
|
|
|
|
We showcase here how to use the [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) model as an LLM, but
|
|
it isn't the only model available. We also support the OpenAssistant model and OpenAI's davinci models (3.5 and 4).
|
|
|
|
We're planning on supporting local language models in an ulterior version.
|
|
|
|
The tools defined in this implementation are agnostic to the agent used; we are showcasing the agents that work with
|
|
our prompts below, but the tools can also be used with Langchain, Minichain, or any other Agent-based library.
|
|
|
|
#### Example code for the OpenAssistant model
|
|
|
|
```py
|
|
from transformers import HfAgent
|
|
|
|
agent = HfAgent(url_endpoint="https://OpenAssistant/oasst-sft-1-pythia-12b", token="<HF_TOKEN>")
|
|
```
|
|
|
|
#### Example code for OpenAI models
|
|
|
|
```py
|
|
from transformers import OpenAiAgent
|
|
|
|
agent = OpenAiAgent(model="text-davinci-003", api_key="<API_KEY>")
|
|
```
|
|
|
|
### Code generation
|
|
|
|
So far we have shown how to use the agents to perform actions for you. However, the agent is really only generating code
|
|
that we then execute using a very restricted Python interpreter. In case you would like to use the code generated in
|
|
a different setting, the agent can be prompted to return the code, along with tool definition and accurate imports.
|
|
|
|
For example, the following instruction
|
|
```python
|
|
agent.run("Draw me a picture of rivers and lakes", return_code=True)
|
|
```
|
|
|
|
returns the following code
|
|
|
|
```python
|
|
from transformers import load_tool
|
|
|
|
image_generator = load_tool("huggingface-tools/text-to-image")
|
|
|
|
image = image_generator(prompt="rivers and lakes")
|
|
```
|
|
|
|
that you can then modify and execute yourself.
|