Breakup export guide (#19271)

* split onnx and torchscript docs * make style * apply reviews
2022-10-03 13:18:29 -07:00
parent 18c06208c4
commit 68f50f3453
3 changed files with 347 additions and 327 deletions
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -33,7 +33,9 @@
    - local: converting_tensorflow_models
      title: Converting from TensorFlow checkpoints
    - local: serialization
-      title: Export 🤗 Transformers models
+      title: Export to ONNX
    - local: torchscript
      title: Export to TorchScript
    - local: troubleshooting
      title: Troubleshoot
    title: General usage
--- a/docs/source/en/serialization.mdx
+++ b/docs/source/en/serialization.mdx
@@ -10,36 +10,36 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
-# Export 🤗 Transformers Models
+# Export to ONNX
-If you need to deploy 🤗 Transformers models in production environments, we
+If you need to deploy 🤗 Transformers models in production environments, we recommend
-recommend exporting them to a serialized format that can be loaded and executed
+exporting them to a serialized format that can be loaded and executed on specialized
-on specialized runtimes and hardware. In this guide, we'll show you how to
+runtimes and hardware. In this guide, we'll show you how to export 🤗 Transformers
-export 🤗 Transformers models in two widely used formats: ONNX and TorchScript.
+models to [ONNX (Open Neural Network eXchange)](http://onnx.ai).
-Once exported, a model can optimized for inference via techniques such as
+<Tip>
-quantization and pruning. If you are interested in optimizing your models to run
+
-with maximum efficiency, check out the [🤗 Optimum
+Once exported, a model can be optimized for inference via techniques such as
 quantization and pruning. If you are interested in optimizing your models to run with
 maximum efficiency, check out the [🤗 Optimum
 library](https://github.com/huggingface/optimum).
-## ONNX
+</Tip>
-The [ONNX (Open Neural Network eXchange)](http://onnx.ai) project is an open
+ONNX is an open standard that defines a common set of operators and a common file format
-standard that defines a common set of operators and a common file format to
+to represent deep learning models in a wide variety of frameworks, including PyTorch and
-represent deep learning models in a wide variety of frameworks, including
+TensorFlow. When a model is exported to the ONNX format, these operators are used to
-PyTorch and TensorFlow. When a model is exported to the ONNX format, these
+construct a computational graph (often called an _intermediate representation_) which
-operators are used to construct a computational graph (often called an
+represents the flow of data through the neural network.
 _intermediate representation_) which represents the flow of data through the
 neural network.
-By exposing a graph with standardized operators and data types, ONNX makes it
+By exposing a graph with standardized operators and data types, ONNX makes it easy to
-easy to switch between frameworks. For example, a model trained in PyTorch can
+switch between frameworks. For example, a model trained in PyTorch can be exported to
-be exported to ONNX format and then imported in TensorFlow (and vice versa).
+ONNX format and then imported in TensorFlow (and vice versa).
-🤗 Transformers provides a `transformers.onnx` package that enables you to
+🤗 Transformers provides a [`transformers.onnx`](main_classes/onnx) package that enables
-convert model checkpoints to an ONNX graph by leveraging configuration objects.
+you to convert model checkpoints to an ONNX graph by leveraging configuration objects.
-These configuration objects come ready made for a number of model architectures,
+These configuration objects come ready made for a number of model architectures, and are
-and are designed to be easily extendable to other architectures.
+designed to be easily extendable to other architectures.
 Ready-made configurations include the following architectures:
@@ -106,10 +106,10 @@ In the next two sections, we'll show you how to:
 * Export a supported model using the `transformers.onnx` package.
 * Export a custom model for an unsupported architecture.
-### Exporting a model to ONNX
+## Exporting a model to ONNX
-To export a 🤗 Transformers model to ONNX, you'll first need to install some
+To export a 🤗 Transformers model to ONNX, you'll first need to install some extra
-extra dependencies:
+dependencies:
 ```bash
 pip install transformers[onnx]
@@ -141,7 +141,7 @@ Exporting a checkpoint using a ready-made configuration can be done as follows:
 python -m transformers.onnx --model=distilbert-base-uncased onnx/
 ```
-which should show the following logs:
+You should see the following logs:
 ```bash
 Validating ONNX model...
@@ -152,13 +152,13 @@ Validating ONNX model...
 All good, model saved at: onnx/model.onnx
 ```
-This exports an ONNX graph of the checkpoint defined by the `--model` argument.
+This exports an ONNX graph of the checkpoint defined by the `--model` argument. In this
-In this example it is `distilbert-base-uncased`, but it can be any checkpoint on
+example, it is `distilbert-base-uncased`, but it can be any checkpoint on the Hugging
-the Hugging Face Hub or one that's stored locally.
+Face Hub or one that's stored locally.
 The resulting `model.onnx` file can then be run on one of the [many
-accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the
+accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the ONNX
-ONNX standard. For example, we can load and run the model with [ONNX
+standard. For example, we can load and run the model with [ONNX
 Runtime](https://onnxruntime.ai/) as follows:
 ```python
@@ -172,9 +172,8 @@ Runtime](https://onnxruntime.ai/) as follows:
 >>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
 ```
-The required output names (i.e. `["last_hidden_state"]`) can be obtained by
+The required output names (like `["last_hidden_state"]`) can be obtained by taking a
-taking a look at the ONNX configuration of each model. For example, for
+look at the ONNX configuration of each model. For example, for DistilBERT we have:
 DistilBERT we have:
 ```python
 >>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig
@@ -185,20 +184,19 @@ DistilBERT we have:
 ["last_hidden_state"]
 ```
-The process is identical for TensorFlow checkpoints on the Hub. For example, we
+The process is identical for TensorFlow checkpoints on the Hub. For example, we can
-can export a pure TensorFlow checkpoint from the [Keras
+export a pure TensorFlow checkpoint from the [Keras
 organization](https://huggingface.co/keras-io) as follows:
 ```bash
 python -m transformers.onnx --model=keras-io/transformers-qa onnx/
 ```
-To export a model that's stored locally, you'll need to have the model's weights
+To export a model that's stored locally, you'll need to have the model's weights and
-and tokenizer files stored in a directory. For example, we can load and save a
+tokenizer files stored in a directory. For example, we can load and save a checkpoint as
-checkpoint as follows:
+follows:
-<frameworkcontent>
+<frameworkcontent> <pt>
 <pt>
 ```python
 >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
@@ -216,8 +214,7 @@ argument of the `transformers.onnx` package to the desired directory:
 ```bash
 python -m transformers.onnx --model=local-pt-checkpoint onnx/
 ```
-</pt>
+</pt> <tf>
 <tf>
 ```python
 >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
@@ -235,14 +232,13 @@ argument of the `transformers.onnx` package to the desired directory:
 ```bash
 python -m transformers.onnx --model=local-tf-checkpoint onnx/
 ```
-</tf>
+</tf> </frameworkcontent>
 </frameworkcontent>
-### Selecting features for different model topologies
+## Selecting features for different model tasks
-Each ready-made configuration comes with a set of _features_ that enable you to
+Each ready-made configuration comes with a set of _features_ that enable you to export
-export models for different types of topologies or tasks. As shown in the table
+models for different types of tasks. As shown in the table below, each feature is
-below, each feature is associated with a different auto class:
+associated with a different `AutoClass`:
 | Feature                              | Auto Class                           |
 | ------------------------------------ | ------------------------------------ |
@@ -255,7 +251,7 @@ below, each feature is associated with a different auto class:
 | `token-classification`               | `AutoModelForTokenClassification`    |
 For each configuration, you can find the list of supported features via the
-`FeaturesManager`. For example, for DistilBERT we have:
+[`~transformers.onnx.FeaturesManager`]. For example, for DistilBERT we have:
 ```python
 >>> from transformers.onnx.features import FeaturesManager
@@ -266,15 +262,15 @@ For each configuration, you can find the list of supported features via the
 ```
 You can then pass one of these features to the `--feature` argument in the
-`transformers.onnx` package. For example, to export a text-classification model
+`transformers.onnx` package. For example, to export a text-classification model we can
-we can pick a fine-tuned model from the Hub and run:
+pick a fine-tuned model from the Hub and run:
 ```bash
 python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english \
                            --feature=sequence-classification onnx/
 ```
-which will display the following logs:
+This displays the following logs:
 ```bash
 Validating ONNX model...
@@ -285,37 +281,35 @@ Validating ONNX model...
 All good, model saved at: onnx/model.onnx
 ```
-Notice that in this case, the output names from the fine-tuned model are
+Notice that in this case, the output names from the fine-tuned model are `logits`
-`logits` instead of the `last_hidden_state` we saw with the
+instead of the `last_hidden_state` we saw with the `distilbert-base-uncased` checkpoint
-`distilbert-base-uncased` checkpoint earlier. This is expected since the
+earlier. This is expected since the fine-tuned model has a sequence classification head.
 fine-tuned model has a sequence classification head.
 <Tip>
-The features that have a `with-past` suffix (e.g. `causal-lm-with-past`)
+The features that have a `with-past` suffix (like `causal-lm-with-past`) correspond to
-correspond to model topologies with precomputed hidden states (key and values
+model classes with precomputed hidden states (key and values in the attention blocks)
-in the attention blocks) that can be used for fast autoregressive decoding.
+that can be used for fast autoregressive decoding.
 </Tip>
-### Exporting a model for an unsupported architecture
+## Exporting a model for an unsupported architecture
-If you wish to export a model whose architecture is not natively supported by
+If you wish to export a model whose architecture is not natively supported by the
-the library, there are three main steps to follow:
+library, there are three main steps to follow:
 1. Implement a custom ONNX configuration.
 2. Export the model to ONNX.
 3. Validate the outputs of the PyTorch and exported models.
-In this section, we'll look at how DistilBERT was implemented to show what's
+In this section, we'll look at how DistilBERT was implemented to show what's involved
-involved with each step.
+with each step.
-#### Implementing a custom ONNX configuration
+### Implementing a custom ONNX configuration
-Let's start with the ONNX configuration object. We provide three abstract
+Let's start with the ONNX configuration object. We provide three abstract classes that
-classes that you should inherit from, depending on the type of model
+you should inherit from, depending on the type of model architecture you wish to export:
 architecture you wish to export:
 * Encoder-based models inherit from [`~onnx.config.OnnxConfig`]
 * Decoder-based models inherit from [`~onnx.config.OnnxConfigWithPast`]
@@ -347,25 +341,24 @@ Since DistilBERT is an encoder-based model, its configuration inherits from
 ...         )
 ```
-Every configuration object must implement the `inputs` property and return a
+Every configuration object must implement the `inputs` property and return a mapping,
-mapping, where each key corresponds to an expected input, and each value
+where each key corresponds to an expected input, and each value indicates the axis of
-indicates the axis of that input. For DistilBERT, we can see that two inputs are
+that input. For DistilBERT, we can see that two inputs are required: `input_ids` and
-required: `input_ids` and `attention_mask`. These inputs have the same shape of
+`attention_mask`. These inputs have the same shape of `(batch_size, sequence_length)`
-`(batch_size, sequence_length)` which is why we see the same axes used in the
+which is why we see the same axes used in the configuration.
 configuration.
 <Tip>
-Notice that `inputs` property for `DistilBertOnnxConfig` returns an
+Notice that `inputs` property for `DistilBertOnnxConfig` returns an `OrderedDict`. This
-`OrderedDict`. This ensures that the inputs are matched with their relative
+ensures that the inputs are matched with their relative position within the
-position within the `PreTrainedModel.forward()` method when tracing the graph.
+`PreTrainedModel.forward()` method when tracing the graph. We recommend using an
-We recommend using an `OrderedDict` for the `inputs` and `outputs` properties
+`OrderedDict` for the `inputs` and `outputs` properties when implementing custom ONNX
-when implementing custom ONNX configurations.
+configurations.
 </Tip>
-Once you have implemented an ONNX configuration, you can instantiate it by
+Once you have implemented an ONNX configuration, you can instantiate it by providing the
-providing the base model's configuration as follows:
+base model's configuration as follows:
 ```python
 >>> from transformers import AutoConfig
@@ -374,8 +367,8 @@ providing the base model's configuration as follows:
 >>> onnx_config = DistilBertOnnxConfig(config)
 ```
-The resulting object has several useful properties. For example you can view the
+The resulting object has several useful properties. For example, you can view the ONNX
-ONNX operator set that will be used during the export:
+operator set that will be used during the export:
 ```python
 >>> print(onnx_config.default_onnx_opset)
@@ -389,15 +382,14 @@ You can also view the outputs associated with the model as follows:
 OrderedDict([("last_hidden_state", {0: "batch", 1: "sequence"})])
 ```
-Notice that the outputs property follows the same structure as the inputs; it
+Notice that the outputs property follows the same structure as the inputs; it returns an
-returns an `OrderedDict` of named outputs and their shapes. The output structure
+`OrderedDict` of named outputs and their shapes. The output structure is linked to the
-is linked to the choice of feature that the configuration is initialised with.
+choice of feature that the configuration is initialised with. By default, the ONNX
-By default, the ONNX configuration is initialized with the `default` feature
+configuration is initialized with the `default` feature that corresponds to exporting a
-that corresponds to exporting a model loaded with the `AutoModel` class. If you
+model loaded with the `AutoModel` class. If you want to export a model for another task,
-want to export a different model topology, just provide a different feature to
+just provide a different feature to the `task` argument when you initialize the ONNX
-the `task` argument when you initialize the ONNX configuration. For example, if
+configuration. For example, if we wished to export DistilBERT with a sequence
-we wished to export DistilBERT with a sequence classification head, we could
+classification head, we could use:
 use:
 ```python
 >>> from transformers import AutoConfig
@@ -410,18 +402,18 @@ OrderedDict([('logits', {0: 'batch'})])
 <Tip>
-All of the base properties and methods associated with [`~onnx.config.OnnxConfig`] and the
+All of the base properties and methods associated with [`~onnx.config.OnnxConfig`] and
-other configuration classes can be overriden if needed. Check out
+the other configuration classes can be overriden if needed. Check out [`BartOnnxConfig`]
-[`BartOnnxConfig`] for an advanced example.
+for an advanced example.
 </Tip>
-#### Exporting the model
+### Exporting the model
-Once you have implemented the ONNX configuration, the next step is to export the
+Once you have implemented the ONNX configuration, the next step is to export the model.
-model. Here we can use the `export()` function provided by the
+Here we can use the `export()` function provided by the `transformers.onnx` package.
-`transformers.onnx` package. This function expects the ONNX configuration, along
+This function expects the ONNX configuration, along with the base model and tokenizer,
-with the base model and tokenizer, and the path to save the exported file:
+and the path to save the exported file:
 ```python
 >>> from pathlib import Path
@@ -436,10 +428,9 @@ with the base model and tokenizer, and the path to save the exported file:
 >>> onnx_inputs, onnx_outputs = export(tokenizer, base_model, onnx_config, onnx_config.default_onnx_opset, onnx_path)
 ```
-The `onnx_inputs` and `onnx_outputs` returned by the `export()` function are
+The `onnx_inputs` and `onnx_outputs` returned by the `export()` function are lists of
-lists of the keys defined in the `inputs` and `outputs` properties of the
+the keys defined in the `inputs` and `outputs` properties of the configuration. Once the
-configuration. Once the model is exported, you can test that the model is well
+model is exported, you can test that the model is well formed as follows:
 formed as follows:
 ```python
 >>> import onnx
@@ -450,21 +441,20 @@ formed as follows:
 <Tip>
-If your model is larger than 2GB, you will see that many additional files are
+If your model is larger than 2GB, you will see that many additional files are created
-created during the export. This is _expected_ because ONNX uses [Protocol
+during the export. This is _expected_ because ONNX uses [Protocol
-Buffers](https://developers.google.com/protocol-buffers/) to store the model and
+Buffers](https://developers.google.com/protocol-buffers/) to store the model and these
-these have a size limit of 2GB. See the [ONNX
+have a size limit of 2GB. See the [ONNX
-documentation](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md)
+documentation](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md) for
-for instructions on how to load models with external data.
+instructions on how to load models with external data.
 </Tip>
-#### Validating the model outputs
+### Validating the model outputs
-The final step is to validate that the outputs from the base and exported model
+The final step is to validate that the outputs from the base and exported model agree
-agree within some absolute tolerance. Here we can use the
+within some absolute tolerance. Here we can use the `validate_model_outputs()` function
-`validate_model_outputs()` function provided by the `transformers.onnx` package
+provided by the `transformers.onnx` package as follows:
 as follows:
 ```python
 >>> from transformers.onnx import validate_model_outputs
@@ -474,220 +464,23 @@ as follows:
 ... )
 ```
-This function uses the `OnnxConfig.generate_dummy_inputs()` method to generate
+This function uses the [`~transformers.onnx.OnnxConfig.generate_dummy_inputs`] method to
-inputs for the base and exported model, and the absolute tolerance can be
+generate inputs for the base and exported model, and the absolute tolerance can be
-defined in the configuration. We generally find numerical agreement in the 1e-6
+defined in the configuration. We generally find numerical agreement in the 1e-6 to 1e-4
-to 1e-4 range, although anything smaller than 1e-3 is likely to be OK.
+range, although anything smaller than 1e-3 is likely to be OK.
-### Contributing a new configuration to 🤗 Transformers
+## Contributing a new configuration to 🤗 Transformers
-We are looking to expand the set of ready-made configurations and welcome
+We are looking to expand the set of ready-made configurations and welcome contributions
-contributions from the community! If you would like to contribute your addition
+from the community! If you would like to contribute your addition to the library, you
-to the library, you will need to:
+will need to:
 * Implement the ONNX configuration in the corresponding `configuration_<model_name>.py`
 file
-* Include the model architecture and corresponding features in [`~onnx.features.FeatureManager`]
+* Include the model architecture and corresponding features in
  [`~onnx.features.FeatureManager`]
 * Add your model architecture to the tests in `test_onnx_v2.py`
 Check out how the configuration for [IBERT was
-contributed](https://github.com/huggingface/transformers/pull/14868/files) to
+contributed](https://github.com/huggingface/transformers/pull/14868/files) to get an
-get an idea of what's involved.
+idea of what's involved.
 ## TorchScript
 <Tip>
 This is the very beginning of our experiments with TorchScript and we are still exploring its capabilities with
 variable-input-size models. It is a focus of interest to us and we will deepen our analysis in upcoming releases,
 with more code examples, a more flexible implementation, and benchmarks comparing python-based codes with compiled
 TorchScript.
 </Tip>
 According to Pytorch's documentation: "TorchScript is a way to create serializable and optimizable models from PyTorch
 code". Pytorch's two modules [JIT and TRACE](https://pytorch.org/docs/stable/jit.html) allow the developer to export
 their model to be re-used in other programs, such as efficiency-oriented C++ programs.
 We have provided an interface that allows the export of 🤗 Transformers models to TorchScript so that they can be reused
 in a different environment than a Pytorch-based python program. Here we explain how to export and use our models using
 TorchScript.
 Exporting a model requires two things:
 - a forward pass with dummy inputs.
 - model instantiation with the `torchscript` flag.
 These necessities imply several things developers should be careful about. These are detailed below.
 ### TorchScript flag and tied weights
 This flag is necessary because most of the language models in this repository have tied weights between their
 `Embedding` layer and their `Decoding` layer. TorchScript does not allow the export of models that have tied
 weights, therefore it is necessary to untie and clone the weights beforehand.
 This implies that models instantiated with the `torchscript` flag have their `Embedding` layer and `Decoding`
 layer separate, which means that they should not be trained down the line. Training would de-synchronize the two
 layers, leading to unexpected results.
 This is not the case for models that do not have a Language Model head, as those do not have tied weights. These models
 can be safely exported without the `torchscript` flag.
 ### Dummy inputs and standard lengths
 The dummy inputs are used to do a model forward pass. While the inputs' values are propagating through the layers,
 Pytorch keeps track of the different operations executed on each tensor. These recorded operations are then used to
 create the "trace" of the model.
 The trace is created relatively to the inputs' dimensions. It is therefore constrained by the dimensions of the dummy
 input, and will not work for any other sequence length or batch size. When trying with a different size, an error such
 as:
 `The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`
 will be raised. It is therefore recommended to trace the model with a dummy input size at least as large as the largest
 input that will be fed to the model during inference. Padding can be performed to fill the missing values. As the model
 will have been traced with a large input size however, the dimensions of the different matrix will be large as well,
 resulting in more calculations.
 It is recommended to be careful of the total number of operations done on each input and to follow performance closely
 when exporting varying sequence-length models.
 ### Using TorchScript in Python
 Below is an example, showing how to save, load models as well as how to use the trace for inference.
 #### Saving a model
 This snippet shows how to use TorchScript to export a `BertModel`. Here the `BertModel` is instantiated according
 to a `BertConfig` class and then saved to disk under the filename `traced_bert.pt`
 ```python
 from transformers import BertModel, BertTokenizer, BertConfig
 import torch
 enc = BertTokenizer.from_pretrained("bert-base-uncased")
 # Tokenizing input text
 text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
 tokenized_text = enc.tokenize(text)
 # Masking one of the input tokens
 masked_index = 8
 tokenized_text[masked_index] = "[MASK]"
 indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
 segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
 # Creating a dummy input
 tokens_tensor = torch.tensor([indexed_tokens])
 segments_tensors = torch.tensor([segments_ids])
 dummy_input = [tokens_tensor, segments_tensors]
 # Initializing the model with the torchscript flag
 # Flag set to True even though it is not necessary as this model does not have an LM Head.
 config = BertConfig(
    vocab_size_or_config_json_file=32000,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    torchscript=True,
 )
 # Instantiating the model
 model = BertModel(config)
 # The model needs to be in evaluation mode
 model.eval()
 # If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
 model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
 # Creating the trace
 traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
 torch.jit.save(traced_model, "traced_bert.pt")
 ```
 #### Loading a model
 This snippet shows how to load the `BertModel` that was previously saved to disk under the name `traced_bert.pt`.
 We are re-using the previously initialised `dummy_input`.
 ```python
 loaded_model = torch.jit.load("traced_bert.pt")
 loaded_model.eval()
 all_encoder_layers, pooled_output = loaded_model(*dummy_input)
 ```
 #### Using a traced model for inference
 Using the traced model for inference is as simple as using its `__call__` dunder method:
 ```python
 traced_model(tokens_tensor, segments_tensors)
 ```
 ### Deploying HuggingFace TorchScript models on AWS using the Neuron SDK
 AWS introduced the [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)
 instance family for low cost, high performance machine learning inference in the cloud.
 The Inf1 instances are powered by the AWS Inferentia chip, a custom-built hardware accelerator,
 specializing in deep learning inferencing workloads.
 [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#)
 is the SDK for Inferentia that supports tracing and optimizing transformers models for
 deployment on Inf1. The Neuron SDK provides:
 1. Easy-to-use API with one line of code change to trace and optimize a TorchScript model for inference in the cloud.
 2. Out of the box performance optimizations for [improved cost-performance](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>)
 3. Support for HuggingFace transformers models built with either [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html)
   or [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html).
 #### Implications
 Transformers Models based on the [BERT (Bidirectional Encoder Representations from Transformers)](https://huggingface.co/docs/transformers/main/model_doc/bert)
 architecture, or its variants such as [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert)
 and [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta)
 will run best on Inf1 for non-generative tasks such as Extractive Question Answering,
 Sequence Classification, Token Classification. Alternatively, text generation
 tasks can be adapted to run on Inf1, according to this [AWS Neuron MarianMT tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html).
 More information about models that can be converted out of the box on Inferentia can be
 found in the [Model Architecture Fit section of the Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia).
 #### Dependencies
 Using AWS Neuron to convert models requires the following dependencies and environment:
 * A [Neuron SDK environment](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide),
  which comes pre-configured on [AWS Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html).
 #### Converting a Model for AWS Neuron
 Using the same script as in [Using TorchScript in Python](https://huggingface.co/docs/transformers/main/en/serialization#using-torchscript-in-python)
 to trace a "BertModel", you import `torch.neuron` framework extension to access
 the components of the Neuron SDK through a Python API.
 ```python
 from transformers import BertModel, BertTokenizer, BertConfig
 import torch
 import torch.neuron
 ```
 And only modify the tracing line of code
 from:
 ```python
 torch.jit.trace(model, [tokens_tensor, segments_tensors])
 ```
 to:
 ```python
 torch.neuron.trace(model, [token_tensor, segments_tensors])
 ```
 This change enables Neuron SDK to trace the model and optimize it to run in Inf1 instances.
 To learn more about AWS Neuron SDK features, tools, example tutorials and latest updates,
 please see the [AWS NeuronSDK documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html).
--- a/docs/source/en/torchscript.mdx
+++ b/docs/source/en/torchscript.mdx
@@ -0,0 +1,225 @@
 <!--Copyright 2022 The HuggingFace Team. All rights reserved.
 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
 the License. You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
 an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
 specific language governing permissions and limitations under the License.
 -->
 # Export to TorchScript
 <Tip>
 This is the very beginning of our experiments with TorchScript and we are still
 exploring its capabilities with variable-input-size models. It is a focus of interest to
 us and we will deepen our analysis in upcoming releases, with more code examples, a more
 flexible implementation, and benchmarks comparing Python-based codes with compiled
 TorchScript.
 </Tip>
 According to the [TorchScript documentation](https://pytorch.org/docs/stable/jit.html):
 > TorchScript is a way to create serializable and optimizable models from PyTorch code.
 There are two PyTorch modules, [JIT and
 TRACE](https://pytorch.org/docs/stable/jit.html), that allow developers to export their
 models to be reused in other programs like efficiency-oriented C++ programs.
 We provide an interface that allows you to export 🤗 Transformers models to TorchScript
 so they can be reused in a different environment than PyTorch-based Python programs.
 Here, we explain how to export and use our models using TorchScript.
 Exporting a model requires two things:
 - model instantiation with the `torchscript` flag
 - a forward pass with dummy inputs
 These necessities imply several things developers should be careful about as detailed
 below.
 ## TorchScript flag and tied weights
 The `torchscript` flag is necessary because most of the 🤗 Transformers language models
 have tied weights between their `Embedding` layer and their `Decoding` layer.
 TorchScript does not allow you to export models that have tied weights, so it is
 necessary to untie and clone the weights beforehand.
 Models instantiated with the `torchscript` flag have their `Embedding` layer and
 `Decoding` layer separated, which means that they should not be trained down the line.
 Training would desynchronize the two layers, leading to unexpected results.
 This is not the case for models that do not have a language model head, as those do not
 have tied weights. These models can be safely exported without the `torchscript` flag.
 ## Dummy inputs and standard lengths
 The dummy inputs are used for a models forward pass. While the inputs' values are
 propagated through the layers, PyTorch keeps track of the different operations executed
 on each tensor. These recorded operations are then used to create the *trace* of the
 model.
 The trace is created relative to the inputs' dimensions. It is therefore constrained by
 the dimensions of the dummy input, and will not work for any other sequence length or
 batch size. When trying with a different size, the following error is raised:
 ```
 `The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`
 ```
 We recommended you trace the model with a dummy input size at least as large as the
 largest input that will be fed to the model during inference. Padding can help fill the
 missing values. However, since the model is traced with a larger input size, the
 dimensions of the matrix will also be large, resulting in more calculations.
 Be careful of the total number of operations done on each input and follow the
 performance closely when exporting varying sequence-length models.
 ## Using TorchScript in Python
 This section demonstrates how to save and load models as well as how to use the trace
 for inference.
 ### Saving a model
 To export a `BertModel` with TorchScript, instantiate `BertModel` from the `BertConfig`
 class and then save it to disk under the filename `traced_bert.pt`:
 ```python
 from transformers import BertModel, BertTokenizer, BertConfig
 import torch
 enc = BertTokenizer.from_pretrained("bert-base-uncased")
 # Tokenizing input text
 text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
 tokenized_text = enc.tokenize(text)
 # Masking one of the input tokens
 masked_index = 8
 tokenized_text[masked_index] = "[MASK]"
 indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
 segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
 # Creating a dummy input
 tokens_tensor = torch.tensor([indexed_tokens])
 segments_tensors = torch.tensor([segments_ids])
 dummy_input = [tokens_tensor, segments_tensors]
 # Initializing the model with the torchscript flag
 # Flag set to True even though it is not necessary as this model does not have an LM Head.
 config = BertConfig(
    vocab_size_or_config_json_file=32000,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    torchscript=True,
 )
 # Instantiating the model
 model = BertModel(config)
 # The model needs to be in evaluation mode
 model.eval()
 # If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
 model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)
 # Creating the trace
 traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
 torch.jit.save(traced_model, "traced_bert.pt")
 ```
 ### Loading a model
 Now you can load the previously saved `BertModel`, `traced_bert.pt`, from disk and use
 it on the previously initialised `dummy_input`:
 ```python
 loaded_model = torch.jit.load("traced_bert.pt")
 loaded_model.eval()
 all_encoder_layers, pooled_output = loaded_model(*dummy_input)
 ```
 ### Using a traced model for inference
 Use the traced model for inference by using its `__call__` dunder method:
 ```python
 traced_model(tokens_tensor, segments_tensors)
 ```
 ## Deploy Hugging Face TorchScript models to AWS with the Neuron SDK
 AWS introduced the [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/)
 instance family for low cost, high performance machine learning inference in the cloud.
 The Inf1 instances are powered by the AWS Inferentia chip, a custom-built hardware
 accelerator, specializing in deep learning inferencing workloads. [AWS
 Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#) is the SDK for
 Inferentia that supports tracing and optimizing transformers models for deployment on
 Inf1. The Neuron SDK provides:
 1. Easy-to-use API with one line of code change to trace and optimize a TorchScript
   model for inference in the cloud.
 2. Out of the box performance optimizations for [improved
   cost-performance](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>).
 3. Support for Hugging Face transformers models built with either
   [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html)
   or
   [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html).
 ### Implications
 Transformers models based on the [BERT (Bidirectional Encoder Representations from
 Transformers)](https://huggingface.co/docs/transformers/main/model_doc/bert)
 architecture, or its variants such as
 [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert) and
 [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta) run best on
 Inf1 for non-generative tasks such as extractive question answering, sequence
 classification, and token classification. However, text generation tasks can still be
 adapted to run on Inf1 according to this [AWS Neuron MarianMT
 tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html).
 More information about models that can be converted out of the box on Inferentia can be
 found in the [Model Architecture
 Fit](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia)
 section of the Neuron documentation.
 ### Dependencies
 Using AWS Neuron to convert models requires a [Neuron SDK
 environment](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide)
 which comes preconfigured on [AWS Deep Learning
 AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html).
 ### Converting a model for AWS Neuron
 Convert a model for AWS NEURON using the same code from [Using TorchScript in
 Python](serialization#using-torchscript-in-python) to trace a `BertModel`. Import the
 `torch.neuron` framework extension to access the components of the Neuron SDK through a
 Python API:
 ```python
 from transformers import BertModel, BertTokenizer, BertConfig
 import torch
 import torch.neuron
 ```
 You only need to modify the following line:
 ```diff
 - torch.jit.trace(model, [tokens_tensor, segments_tensors])
 + torch.neuron.trace(model, [token_tensor, segments_tensors])
 ```
 This enables the Neuron SDK to trace the model and optimize it for Inf1 instances.
 To learn more about AWS Neuron SDK features, tools, example tutorials and latest
 updates, please see the [AWS NeuronSDK
 documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html).