Enable ONNX/ONNXRuntime optimizations through converter script (#6131)
* Add onnxruntime transformers optimization support Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Optimization section in ONNX/ONNXRuntime documentation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Improve note reference Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fixing imports order. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add warning about different level of optimization between torch and tf export. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Address @LysandreJik wording suggestion Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address @LysandreJik wording suggestion Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Always optimize model before quantization for maximum performances. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Address comments on the documentation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Improve TensorFlow optimization message as suggested by @yufenglee Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Removed --optimize parameter Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Warn the user about current quantization limitation when model is larger than 2GB. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Trigger CI for last check * Small change in print for the optimization section. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
This commit is contained in:
@@ -5,7 +5,7 @@ Exporting transformers models
|
||||
ONNX / ONNXRuntime
|
||||
==============================================
|
||||
|
||||
Projects ONNX (Open Neural Network eXchange) and ONNXRuntime (ORT) are part of an effort from leading industries in the AI field
|
||||
Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field
|
||||
to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
|
||||
of hardware and dedicated optimizations.
|
||||
|
||||
@@ -34,9 +34,36 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:
|
||||
|
||||
Also, the conversion tool supports different options which let you tune the behavior of the generated model:
|
||||
|
||||
* Change the target opset version of the generated model: More recent opset generally supports more operator and enables faster inference.
|
||||
* Export pipeline specific prediction heads: Allow to export model along with its task-specific prediction head(s).
|
||||
* Use the external data format (PyTorch only): Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_).
|
||||
* **Change the target opset version of the generated model.** (More recent opset generally supports more operators and enables faster inference)
|
||||
|
||||
* **Export pipeline-specific prediction heads.** (Allow to export model along with its task-specific prediction head(s))
|
||||
|
||||
* **Use the external data format (PyTorch only).** (Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_))
|
||||
|
||||
|
||||
Optimizations
|
||||
------------------------------------------------
|
||||
|
||||
ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph.
|
||||
Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
|
||||
|
||||
* Constant folding
|
||||
* Attention Layer fusing
|
||||
* Skip connection LayerNormalization fusing
|
||||
* FastGeLU approximation
|
||||
|
||||
|
||||
Fortunately, you can let ONNXRuntime find all the possible optimized operators for you. Simply add ``--optimize``
|
||||
when exporting your model through ``convert_graph_to_onnx.py``.
|
||||
|
||||
Example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased --optimize bert-base-cased.onnx
|
||||
|
||||
.. note::
|
||||
For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
|
||||
|
||||
Quantization
|
||||
------------------------------------------------
|
||||
@@ -85,6 +112,8 @@ Example of quantized BERT model export:
|
||||
above command will contain the original ONNX model storing `float32` weights.
|
||||
The second one, with ``-quantized`` suffix, will hold the quantized parameters.
|
||||
|
||||
.. note::
|
||||
The quantization export gives the best performances when used in combination with ``--optimize``.
|
||||
|
||||
TorchScript
|
||||
=======================================
|
||||
|
||||
Reference in New Issue
Block a user