[docs] Redesign (#31757)
* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
This commit is contained in:
@@ -14,69 +14,41 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Export to ONNX
|
||||
# ONNX
|
||||
|
||||
Deploying 🤗 Transformers models in production environments often requires, or can benefit from exporting the models into
|
||||
a serialized format that can be loaded and executed on specialized runtimes and hardware.
|
||||
[ONNX](http://onnx.ai) is an open standard that defines a common set of operators and a file format to represent deep learning models in different frameworks, including PyTorch and TensorFlow. When a model is exported to ONNX, the operators construct a computational graph (or *intermediate representation*) which represents the flow of data through the model. Standardized operators and data types makes it easy to switch between frameworks.
|
||||
|
||||
🤗 Optimum is an extension of Transformers that enables exporting models from PyTorch or TensorFlow to serialized formats
|
||||
such as ONNX and TFLite through its `exporters` module. 🤗 Optimum also provides a set of performance optimization tools to train
|
||||
and run models on targeted hardware with maximum efficiency.
|
||||
The [Optimum](https://huggingface.co/docs/optimum/index) library exports a model to ONNX with configuration objects which are supported for [many architectures]((https://huggingface.co/docs/optimum/exporters/onnx/overview)) and can be easily extended. If a model isn't supported, feel free to make a [contribution](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) to Optimum.
|
||||
|
||||
This guide demonstrates how you can export 🤗 Transformers models to ONNX with 🤗 Optimum, for the guide on exporting models to TFLite,
|
||||
please refer to the [Export to TFLite page](tflite).
|
||||
The benefits of exporting to ONNX include the following.
|
||||
|
||||
## Export to ONNX
|
||||
- [Graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization) for improving inference.
|
||||
- Use the [`~optimum.onnxruntime.ORTModel`] API to run a model with [ONNX Runtime](https://onnxruntime.ai/).
|
||||
- Use [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines) for ONNX models.
|
||||
|
||||
[ONNX (Open Neural Network eXchange)](http://onnx.ai) is an open standard that defines a common set of operators and a
|
||||
common file format to represent deep learning models in a wide variety of frameworks, including PyTorch and
|
||||
TensorFlow. When a model is exported to the ONNX format, these operators are used to
|
||||
construct a computational graph (often called an _intermediate representation_) which
|
||||
represents the flow of data through the neural network.
|
||||
Export a Transformers model to ONNX with the Optimum CLI or the `optimum.onnxruntime` module.
|
||||
|
||||
By exposing a graph with standardized operators and data types, ONNX makes it easy to
|
||||
switch between frameworks. For example, a model trained in PyTorch can be exported to
|
||||
ONNX format and then imported in TensorFlow (and vice versa).
|
||||
## Optimum CLI
|
||||
|
||||
Once exported to ONNX format, a model can be:
|
||||
- optimized for inference via techniques such as [graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization).
|
||||
- run with ONNX Runtime via [`ORTModelForXXX` classes](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort),
|
||||
which follow the same `AutoModel` API as the one you are used to in 🤗 Transformers.
|
||||
- run with [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines),
|
||||
which has the same API as the [`pipeline`] function in 🤗 Transformers.
|
||||
|
||||
🤗 Optimum provides support for the ONNX export by leveraging configuration objects. These configuration objects come
|
||||
ready-made for a number of model architectures, and are designed to be easily extendable to other architectures.
|
||||
|
||||
For the list of ready-made configurations, please refer to [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/onnx/overview).
|
||||
|
||||
There are two ways to export a 🤗 Transformers model to ONNX, here we show both:
|
||||
|
||||
- export with 🤗 Optimum via CLI.
|
||||
- export with 🤗 Optimum with `optimum.onnxruntime`.
|
||||
|
||||
### Exporting a 🤗 Transformers model to ONNX with CLI
|
||||
|
||||
To export a 🤗 Transformers model to ONNX, first install an extra dependency:
|
||||
Run the command below to install Optimum and the [exporters](https://huggingface.co/docs/optimum/exporters/overview) module.
|
||||
|
||||
```bash
|
||||
pip install optimum[exporters]
|
||||
```
|
||||
|
||||
To check out all available arguments, refer to the [🤗 Optimum docs](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli),
|
||||
or view help in command line:
|
||||
> [!TIP]
|
||||
> Refer to the [Export a model to ONNX with optimum.exporters.onnx](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) guide for all available arguments or with the command below.
|
||||
> ```bash
|
||||
> optimum-cli export onnx --help
|
||||
> ```
|
||||
|
||||
```bash
|
||||
optimum-cli export onnx --help
|
||||
```
|
||||
|
||||
To export a model's checkpoint from the 🤗 Hub, for example, `distilbert/distilbert-base-uncased-distilled-squad`, run the following command:
|
||||
Set the `--model` argument to export a PyTorch or TensorFlow model from the Hub.
|
||||
|
||||
```bash
|
||||
optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/
|
||||
```
|
||||
|
||||
You should see the logs indicating progress and showing where the resulting `model.onnx` is saved, like this:
|
||||
You should see logs indicating the progress and showing where the resulting `model.onnx` is saved.
|
||||
|
||||
```bash
|
||||
Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
|
||||
@@ -90,20 +62,13 @@ Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
|
||||
The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx
|
||||
```
|
||||
|
||||
The example above illustrates exporting a checkpoint from 🤗 Hub. When exporting a local model, first make sure that you
|
||||
saved both the model's weights and tokenizer files in the same directory (`local_path`). When using CLI, pass the
|
||||
`local_path` to the `model` argument instead of the checkpoint name on 🤗 Hub and provide the `--task` argument.
|
||||
You can review the list of supported tasks in the [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/task_manager).
|
||||
If `task` argument is not provided, it will default to the model architecture without any task specific head.
|
||||
For local models, make sure the model weights and tokenizer files are saved in the same directory, for example `local_path`. Pass the directory to the `--model` argument and use `--task` to indicate the [task](https://huggingface.co/docs/optimum/exporters/task_manager) a model can perform. If `--task` isn't provided, the model architecture without a task-specific head is used.
|
||||
|
||||
```bash
|
||||
optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/
|
||||
```
|
||||
|
||||
The resulting `model.onnx` file can then be run on one of the [many
|
||||
accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the ONNX
|
||||
standard. For example, we can load and run the model with [ONNX
|
||||
Runtime](https://onnxruntime.ai/) as follows:
|
||||
The `model.onnx` file can be deployed with any [accelerator](https://onnx.ai/supported-tools.html#deployModel) that supports ONNX. The example below demonstrates loading and running a model with ONNX Runtime.
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer
|
||||
@@ -115,16 +80,9 @@ Runtime](https://onnxruntime.ai/) as follows:
|
||||
>>> outputs = model(**inputs)
|
||||
```
|
||||
|
||||
The process is identical for TensorFlow checkpoints on the Hub. For instance, here's how you would
|
||||
export a pure TensorFlow checkpoint from the [Keras organization](https://huggingface.co/keras-io):
|
||||
## optimum.onnxruntime
|
||||
|
||||
```bash
|
||||
optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/
|
||||
```
|
||||
|
||||
### Exporting a 🤗 Transformers model to ONNX with `optimum.onnxruntime`
|
||||
|
||||
Alternative to CLI, you can export a 🤗 Transformers model to ONNX programmatically like so:
|
||||
The `optimum.onnxruntime` module supports programmatically exporting a Transformers model. Instantiate a [`~optimum.onnxruntime.ORTModel`] for a task and set `export=True`. Use [`~OptimizedModel.save_pretrained`] to save the ONNX model.
|
||||
|
||||
```python
|
||||
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
|
||||
@@ -133,78 +91,9 @@ Alternative to CLI, you can export a 🤗 Transformers model to ONNX programmati
|
||||
>>> model_checkpoint = "distilbert/distilbert-base-uncased-distilled-squad"
|
||||
>>> save_directory = "onnx/"
|
||||
|
||||
>>> # Load a model from transformers and export it to ONNX
|
||||
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
|
||||
|
||||
>>> # Save the onnx model and tokenizer
|
||||
>>> ort_model.save_pretrained(save_directory)
|
||||
>>> tokenizer.save_pretrained(save_directory)
|
||||
```
|
||||
|
||||
### Exporting a model for an unsupported architecture
|
||||
|
||||
If you wish to contribute by adding support for a model that cannot be currently exported, you should first check if it is
|
||||
supported in [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview),
|
||||
and if it is not, [contribute to 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)
|
||||
directly.
|
||||
|
||||
### Exporting a model with `transformers.onnx`
|
||||
|
||||
<Tip warning={true}>
|
||||
|
||||
`transformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions.
|
||||
|
||||
</Tip>
|
||||
|
||||
To export a 🤗 Transformers model to ONNX with `transformers.onnx`, install extra dependencies:
|
||||
|
||||
```bash
|
||||
pip install transformers[onnx]
|
||||
```
|
||||
|
||||
Use `transformers.onnx` package as a Python module to export a checkpoint using a ready-made configuration:
|
||||
|
||||
```bash
|
||||
python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/
|
||||
```
|
||||
|
||||
This exports an ONNX graph of the checkpoint defined by the `--model` argument. Pass any checkpoint on the 🤗 Hub or one that's stored locally.
|
||||
The resulting `model.onnx` file can then be run on one of the many accelerators that support the ONNX standard. For example,
|
||||
load and run the model with ONNX Runtime as follows:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer
|
||||
>>> from onnxruntime import InferenceSession
|
||||
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
|
||||
>>> session = InferenceSession("onnx/model.onnx")
|
||||
>>> # ONNX Runtime expects NumPy arrays as input
|
||||
>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
|
||||
>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
|
||||
```
|
||||
|
||||
The required output names (like `["last_hidden_state"]`) can be obtained by taking a look at the ONNX configuration of
|
||||
each model. For example, for DistilBERT we have:
|
||||
|
||||
```python
|
||||
>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
|
||||
|
||||
>>> config = DistilBertConfig()
|
||||
>>> onnx_config = DistilBertOnnxConfig(config)
|
||||
>>> print(list(onnx_config.outputs.keys()))
|
||||
["last_hidden_state"]
|
||||
```
|
||||
|
||||
The process is identical for TensorFlow checkpoints on the Hub. For example, export a pure TensorFlow checkpoint like so:
|
||||
|
||||
```bash
|
||||
python -m transformers.onnx --model=keras-io/transformers-qa onnx/
|
||||
```
|
||||
|
||||
To export a model that's stored locally, save the model's weights and tokenizer files in the same directory (e.g. `local-pt-checkpoint`),
|
||||
then export it to ONNX by pointing the `--model` argument of the `transformers.onnx` package to the desired directory:
|
||||
|
||||
```bash
|
||||
python -m transformers.onnx --model=local-pt-checkpoint onnx/
|
||||
```
|
||||
Reference in New Issue
Block a user