Added capability to quantize a model while exporting through ONNX. (#6089)

* Added capability to quantize a model while exporting through ONNX. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> We do not support multiple extensions Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Reformat files Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * More quality Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Ensure test_generate_identified_name compares the same object types Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added documentation everywhere on ONNX exporter Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use pathlib.Path instead of plain-old string Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use f-string everywhere Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use the correct parameters for black formatting Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use Python 3 super() style. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Use packaging.version to ensure installed onnxruntime version match requirements Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fixing imports sorting order. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Missing raise(s) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added quantization documentation Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix some spelling. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix bad list header format Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-07-29 13:21:29 +02:00
parent 25de74ccfe
commit 6c002853a6
3 changed files with 288 additions and 46 deletions
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -21,9 +21,10 @@ The following command shows how easy it is to export a BERT model from the libra
    python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased bert-base-cased.onnx

 The conversion tool works for both PyTorch and Tensorflow models and ensures:
-    * The model and its weights are correctly initialized from the Hugging Face model hub or a local checkpoint.
-    * The inputs and outputs are correctly generated to their ONNX counterpart.
-    * The generated model can be correctly loaded through onnxruntime.
+
+* The model and its weights are correctly initialized from the Hugging Face model hub or a local checkpoint.
+* The inputs and outputs are correctly generated to their ONNX counterpart.
+* The generated model can be correctly loaded through onnxruntime.

 .. note::
    Currently, inputs and outputs are always exported with dynamic sequence axes preventing some optimizations
@@ -32,9 +33,57 @@ The conversion tool works for both PyTorch and Tensorflow models and ensures:


 Also, the conversion tool supports different options which let you tune the behavior of the generated model:
-    * Change the target opset version of the generated model: More recent opset generally supports more operator and enables faster inference.
-    * Export pipeline specific prediction heads: Allow to export model along with its task-specific prediction head(s).
-    * Use the external data format (PyTorch only): Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_).
+
+* Change the target opset version of the generated model: More recent opset generally supports more operator and enables faster inference.
+* Export pipeline specific prediction heads: Allow to export model along with its task-specific prediction head(s).
+* Use the external data format (PyTorch only): Lets you export model which size is above 2Gb (`More info <https://github.com/pytorch/pytorch/pull/33062>`_).
+
+Quantization
+------------------------------------------------
+
+ONNX exporter supports generating a quantized version of the model to allow efficient inference.
+
+Quantization works by converting the memory representation of the parameters in the neural network
+to a compact integer format. By default, weights of a neural network are stored as single-precision float (`float32`)
+which can express a wide-range of floating-point numbers with decent precision.
+These properties are especially interesting at training where you want fine-grained representation.
+
+On the other hand, after the training phase, it has been shown one can greatly reduce the range and the precision of `float32` numbers
+without changing the performances of the neural network.
+
+More technically, `float32` parameters are converted to a type requiring fewer bits to represent each number, thus reducing
+the overall size of the model. Here, we are enabling `float32` mapping to `int8` values (a non-floating, single byte, number representation)
+according to the following formula:
+
+.. math::
+    y_{float32} = scale * x_{int8} - zero\_point
+
+.. note::
+    The quantization process will infer the parameter `scale` and `zero_point` from the neural network parameters
+
+Leveraging tiny-integers has numerous advantages when it comes to inference:
+
+* Storing fewer bits instead of 32 bits for the `float32` reduces the size of the model and makes it load faster.
+* Integer operations execute a magnitude faster on modern hardware
+* Integer operations require less power to do the computations
+
+In order to convert a transformers model to ONNX IR with quantized weights you just need to specify ``--quantize``
+when using ``convert_graph_to_onnx.py``. Also, you can have a look at the ``quantize()`` utility-method in this
+same script file.
+
+Example of quantized BERT model export:
+
+.. code-block:: bash
+
+    python convert_graph_to_onnx.py --framework <pt, tf> --model bert-base-cased --quantize bert-base-cased.onnx
+
+.. note::
+    Quantization support requires ONNX Runtime >= 1.4.0
+
+.. note::
+    When exporting quantized model you will end up with two different ONNX files. The one specified at the end of the
+    above command will contain the original ONNX model storing `float32` weights.
+    The second one, with ``-quantized`` suffix, will hold the quantized parameters.


 TorchScript