Enable ONNX/ONNXRuntime optimizations through converter script (#6131)

* Add onnxruntime transformers optimization support Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added Optimization section in ONNX/ONNXRuntime documentation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Improve note reference Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fixing imports order. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Add warning about different level of optimization between torch and tf export. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Address @LysandreJik wording suggestion Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Address @LysandreJik wording suggestion Co-authored-by: Lysandre Debut <lysandre@huggingface.co> * Always optimize model before quantization for maximum performances. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Address comments on the documentation. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Improve TensorFlow optimization message as suggested by @yufenglee Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Removed --optimize parameter Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Warn the user about current quantization limitation when model is larger than 2GB. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Trigger CI for last check * Small change in print for the optimization section. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-07-31 09:45:13 +02:00
parent c0b93a1c7a
commit 7231f7b503
2 changed files with 95 additions and 18 deletions
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -5,7 +5,7 @@ Exporting transformers models
 ONNX / ONNXRuntime
 ==============================================

-Projects ONNX (Open Neural Network eXchange) and ONNXRuntime (ORT) are part of an effort from leading industries in the AI field
+Projects `ONNX (Open Neural Network eXchange) <http://onnx.ai>`_ and `ONNXRuntime (ORT) <https://microsoft.github.io/onnxruntime/>`_ are part of an effort from leading industries in the AI field
 to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety
 of hardware and dedicated optimizations.

@@ -34,9 +34,36 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:

 Also, the conversion tool supports different options which let you tune the behavior of the generated model:

-* Change the target opset version of the generated model: More recent opset generally supports more operator and enables faster inference.
-* Export pipeline specific prediction heads: Allow to export model along with its task-specific prediction head(s).
-* Use the external data format (PyTorch only): Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_).
+* **Change the target opset version of the generated model.**  (More recent opset generally supports more operators and enables faster inference)
+
+* **Export pipeline-specific prediction heads.**  (Allow to export model along with its task-specific prediction head(s))
+
+* **Use the external data format (PyTorch only).**  (Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_))
+
+
+Optimizations
+------------------------------------------------
+
+ONNXRuntime includes some transformers-specific transformations to leverage optimized operations in the graph.
+Below are some of the operators which can be enabled to speed up inference through ONNXRuntime (*see note below*):
+
+* Constant folding
+* Attention Layer fusing
+* Skip connection LayerNormalization fusing
+* FastGeLU approximation
+
+
+Fortunately, you can let ONNXRuntime find all the possible optimized operators for you. Simply add ``--optimize``
+when exporting your model through ``convert_graph_to_onnx.py``.
+
+Example:
+
+.. code-block:: bash
+
+    python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased --optimize bert-base-cased.onnx
+
+.. note::
+    For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)

 Quantization
 ------------------------------------------------
@@ -85,6 +112,8 @@ Example of quantized BERT model export:
    above command will contain the original ONNX model storing `float32` weights.
    The second one, with ``-quantized`` suffix, will hold the quantized parameters.

+.. note::
+    The quantization export gives the best performances when used in combination with ``--optimize``.

 TorchScript
 =======================================