Patch release: v4.9.2

Tpu tie weights (#13030 )
* Fix tied weights on TPU * Manually tie weights in no trainer examples * Fix for test * One last missing * Gettning owned by my scripts * Address review comments * Fix test * Fix tests * Fix reformer tests
2021-08-09 16:01:36 +02:00 · 2021-08-09 15:53:05 +02:00 · 2021-08-09 15:52:16 +02:00 · 2021-08-09 15:52:07 +02:00 · 2021-08-09 15:51:04 +02:00 · 2021-08-09 15:50:58 +02:00
32 changed files with 680 additions and 198 deletions
--- a/.circleci/deploy.sh
+++ b/.circleci/deploy.sh
@@ -67,4 +67,5 @@ deploy_doc "25dee4a" v4.6.0
 deploy_doc "7a6c9fa" v4.7.0
 deploy_doc "9252a51" v4.8.0
 deploy_doc "1366172" v4.8.1
-deploy_doc "96d1cfb"  # v4.8.2 Latest stable release
+deploy_doc "96d1cfb" v4.8.2
 deploy_doc "72aee83" # v4.9.0 Latest stable release
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
@@ -1,10 +1,11 @@
 // These two things need to be updated at each release for the version selector.
 // Last stable version
-const stableVersion = "v4.8.2"
+const stableVersion = "v4.9.0"
 // Dictionary doc folder to label. The last stable version should have an empty key.
 const versionMapping = {
    "master": "master",
-    "": "v4.8.0/v4.8.1/v4.8.2 (stable)",
+    "": "v4.9.0 (stable)",
    "v4.8.2": "v4.8.0/v4.8.1/v4.8.2",
    "v4.7.0": "v4.7.0",
    "v4.6.0": "v4.6.0",
    "v4.5.1": "v4.5.0/v4.5.1",
--- a/docs/source/serialization.rst
+++ b/docs/source/serialization.rst
@@ -99,6 +99,30 @@ It will be exported under ``onnx/bert-base-cased``. You should see similar logs:
                    -[✓] all values close (atol: 0.0001)
    All good, model saved at: onnx/bert-base-cased/model.onnx
 This export can now be used in the ONNX inference runtime:
 .. code-block::
    import onnxruntime as ort
    from transformers import BertTokenizerFast
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
    ort_session = ort.InferenceSession("onnx/bert-base-cased/model.onnx")
    inputs = tokenizer("Using BERT in ONNX!", return_tensors="np")
    outputs = ort_session.run(["last_hidden_state", "pooler_output"], dict(inputs))
 The outputs used (:obj:`["last_hidden_state", "pooler_output"]`) can be obtained by taking a look at the ONNX
 configuration of each model. For example, for BERT:
 .. code-block::
    from transformers.models.bert import BertOnnxConfig, BertConfig
    config = BertConfig()
    onnx_config = BertOnnxConfig(config)
    output_keys = list(onnx_config.outputs.keys())
 Implementing a custom configuration for an unsupported architecture
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -142,6 +166,12 @@ An important fact to notice is the use of `OrderedDict` in both inputs and outpu
 as inputs are matched against their relative position within the `PreTrainedModel.forward()` prototype and outputs are
 match against there position in the returned `BaseModelOutputX` instance.
 An example of such an addition is visible here, for the MBart model: `Making MBART ONNX-convertible
 <https://github.com/huggingface/transformers/pull/13049/commits/d097adcebd89a520f04352eb215a85916934204f>`__
 If you would like to contribute your addition to the library, we recommend you implement tests. An example of such
 tests is visible here: `Adding tests to the MBART ONNX conversion
 <https://github.com/huggingface/transformers/pull/13049/commits/5d642f65abf45ceeb72bd855ca7bfe2506a58e6a>`__
 Graph conversion
 -----------------------------------------------------------------------------------------------------------------------
--- a/examples/pytorch/language-modeling/run_clm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_clm_no_trainer.py
@@ -35,7 +35,7 @@ from torch.utils.data.dataloader import DataLoader
 from tqdm.auto import tqdm
 import transformers
-from accelerate import Accelerator
+from accelerate import Accelerator, DistributedType
 from transformers import (
    CONFIG_MAPPING,
    MODEL_MAPPING,
@@ -403,6 +403,10 @@ def main():
        model, optimizer, train_dataloader, eval_dataloader
    )
    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
    if accelerator.distributed_type == DistributedType.TPU:
        model.tie_weights()
    # Note -> the training dataloader needs to be prepared before we grab his length below (cause its length will be
    # shorter in multiprocess)
--- a/examples/pytorch/language-modeling/run_mlm_no_trainer.py
+++ b/examples/pytorch/language-modeling/run_mlm_no_trainer.py
@@ -35,7 +35,7 @@ from torch.utils.data.dataloader import DataLoader
 from tqdm.auto import tqdm
 import transformers
-from accelerate import Accelerator
+from accelerate import Accelerator, DistributedType
 from transformers import (
    CONFIG_MAPPING,
    MODEL_MAPPING,
@@ -448,6 +448,10 @@ def main():
        model, optimizer, train_dataloader, eval_dataloader
    )
    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
    if accelerator.distributed_type == DistributedType.TPU:
        model.tie_weights()
    # Note -> the training dataloader needs to be prepared before we grab his length below (cause its length will be
    # shorter in multiprocess)
--- a/setup.py
+++ b/setup.py
@@ -337,7 +337,7 @@ install_requires = [
 setup(
    name="transformers",
-    version="4.9.0",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
+    version="4.9.2",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Patrick von Platen, Sylvain Gugger, Suraj Patil, Stas Bekman, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -22,7 +22,7 @@
 # to defer the actual importing for when the objects are requested. This way `import transformers` provides the names
 # in the namespace without actually importing anything (and especially none of the backends).
-__version__ = "4.9.0"
+__version__ = "4.9.2"
 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -274,8 +274,9 @@ PRESET_MIRROR_DICT = {
    "bfsu": "https://mirrors.bfsu.edu.cn/hugging-face-models",
 }
-# This is the version of torch required to run torch.fx features.
+# This is the version of torch required to run torch.fx features and torch.onnx with dictionary inputs.
 TORCH_FX_REQUIRED_VERSION = version.parse("1.8")
 TORCH_ONNX_DICT_INPUTS_MINIMUM_VERSION = version.parse("1.8")
 _is_offline_mode = True if os.environ.get("TRANSFORMERS_OFFLINE", "0").upper() in ENV_VARS_TRUE_VALUES else False
@@ -297,7 +298,7 @@ def is_torch_cuda_available():
        return False
-_torch_fx_available = False
+_torch_fx_available = _torch_onnx_dict_inputs_support_available = False
 if _torch_available:
    torch_version = version.parse(importlib_metadata.version("torch"))
    _torch_fx_available = (torch_version.major, torch_version.minor) == (
@@ -305,11 +306,17 @@ if _torch_available:
        TORCH_FX_REQUIRED_VERSION.minor,
    )
    _torch_onnx_dict_inputs_support_available = torch_version >= TORCH_ONNX_DICT_INPUTS_MINIMUM_VERSION
 def is_torch_fx_available():
    return _torch_fx_available
 def is_torch_onnx_dict_inputs_support_available():
    return _torch_onnx_dict_inputs_support_available
 def is_tf_available():
    return _tf_available
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -594,6 +594,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
                self = getattr(self, self.base_model_prefix)
            self._tie_encoder_decoder_weights(self.encoder, self.decoder, self.base_model_prefix)
        for module in self.modules():
            if hasattr(module, "_tie_weights"):
                module._tie_weights()
    @staticmethod
    def _tie_encoder_decoder_weights(encoder: nn.Module, decoder: nn.Module, base_model_prefix: str):
        uninitialized_encoder_weights: List[str] = []
--- a/src/transformers/models/albert/modeling_albert.py
+++ b/src/transformers/models/albert/modeling_albert.py
@@ -860,8 +860,6 @@ class AlbertMLMHead(nn.Module):
        self.dense = nn.Linear(config.hidden_size, config.embedding_size)
        self.decoder = nn.Linear(config.embedding_size, config.vocab_size)
        self.activation = ACT2FN[config.hidden_act]
        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias
    def forward(self, hidden_states):
@@ -874,6 +872,10 @@ class AlbertMLMHead(nn.Module):
        return prediction_scores
    def _tie_weights(self):
        # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
        self.bias = self.decoder.bias
 class AlbertSOPHead(nn.Module):
    def __init__(self, config):
--- a/src/transformers/models/bert_generation/modeling_bert_generation.py
+++ b/src/transformers/models/bert_generation/modeling_bert_generation.py
@@ -430,16 +430,18 @@ class BertGenerationEncoder(BertGenerationPreTrainedModel):
 class BertGenerationOnlyLMHead(nn.Module):
    def __init__(self, config):
        super().__init__()
-        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias
    def forward(self, hidden_states):
        logits = self.decoder(hidden_states)
        return logits
    def _tie_weights(self):
        # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
        self.bias = self.decoder.bias
@add_start_docstrings(
    """BertGeneration Model with a `language modeling` head on top for CLM fine-tuning. """,
--- a/src/transformers/models/gpt_neo/init.py
+++ b/src/transformers/models/gpt_neo/init.py
@@ -21,7 +21,7 @@ from ...file_utils import _LazyModule, is_flax_available, is_torch_available
 _import_structure = {
-    "configuration_gpt_neo": ["GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoConfig"],
+    "configuration_gpt_neo": ["GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoConfig", "GPTNeoOnnxConfig"],
 }
 if is_torch_available():
@@ -43,7 +43,7 @@ if is_flax_available():
 if TYPE_CHECKING:
-    from .configuration_gpt_neo import GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoConfig
+    from .configuration_gpt_neo import GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoConfig, GPTNeoOnnxConfig
    if is_torch_available():
        from .modeling_gpt_neo import (
--- a/src/transformers/models/gpt_neo/configuration_gpt_neo.py
+++ b/src/transformers/models/gpt_neo/configuration_gpt_neo.py
@@ -14,7 +14,12 @@
 # limitations under the License.
 """ GPT Neo model configuration """
 from collections import OrderedDict
 from typing import Any, Dict, Iterable, Mapping, Optional
 from ... import PreTrainedTokenizer, TensorType, is_torch_available
 from ...configuration_utils import PretrainedConfig
 from ...onnx import OnnxConfigWithPast, PatchingSpec
 from ...utils import logging
@@ -173,3 +178,162 @@ class GPTNeoConfig(PretrainedConfig):
    @property
    def num_hidden_layers(self):
        return self.num_layers
 def custom_unfold(input, dimension, size, step):
    """Custom torch.Tensor.unfold implementation to enable the export to ONNX."""
    import torch
    shape = input.size()
    rank = len(shape)
    sizedim = shape[dimension]
    low_indices = torch.arange(0, sizedim, step)
    min_length = torch.div(sizedim - size, step, rounding_mode="floor") + 1
    indices = torch.arange(size) + low_indices[:min_length][:, None]
    s = [slice(None)] * rank
    s[dimension] = indices
    sliced = input[s]
    perm = list(range(0, rank + 1))
    perm.append(perm.pop(dimension + 1))
    return sliced.permute(perm)
 def custom_get_block_length_and_num_blocks(seq_length, window_size):
    """
    Custom implementation for GPTNeoAttentionMixin._get_block_length_and_num_blocks to enable the export to ONNX as
    original implmentation uses Python variables and control flow.
    """
    import torch
    candidates = torch.arange(1, window_size)
    remainders = torch.remainder(seq_length, candidates)
    divisor_indices = remainders == 0
    divisors = candidates[divisor_indices]
    largest_divisor = torch.max(divisors)
    return largest_divisor, torch.div(seq_length, largest_divisor, rounding_mode="floor")
 class GPTNeoOnnxConfig(OnnxConfigWithPast):
    def __init__(self, config: PretrainedConfig, task: str = "default", use_past: bool = False):
        if is_torch_available():
            import torch
            from .modeling_gpt_neo import GPTNeoAttentionMixin
            patching_specs = [
                PatchingSpec(torch.Tensor, name="unfold", custom_op=custom_unfold),
                PatchingSpec(
                    GPTNeoAttentionMixin,
                    name="_get_block_length_and_num_blocks",
                    custom_op=custom_get_block_length_and_num_blocks,
                    op_wrapper=staticmethod,
                ),
            ]
        super().__init__(config, task=task, patching_specs=patching_specs, use_past=use_past)
        self._num_local_attention = len([type_ for type_ in self._config.attention_layers if type_ == "local"])
        self._key_values_dynamic_axis = []
        for i in range(self._config.num_layers):
            if self._config.attention_layers[i] == "local":
                self._key_values_dynamic_axis.append({0: "batch", 1: "sequence"})
            else:
                self._key_values_dynamic_axis.append({0: "batch", 2: "sequence"})
                self._key_values_dynamic_axis.append({0: "batch", 2: "sequence"})
    @property
    def _number_key_values(self):
        return (self._config.num_layers * 2) - self._num_local_attention
    @property
    def inputs(self) -> Mapping[str, Mapping[int, str]]:
        common_inputs = OrderedDict({"input_ids": {0: "batch", 1: "sequence"}})
        if self.use_past:
            for i in range(self._config.num_layers):
                if self._config.attention_layers[i] == "local":
                    common_inputs[f"past_key_values.{i}.key_value"] = {0: "batch", 1: "sequence"}
                else:
                    common_inputs[f"past_key_values.{i}.key"] = {0: "batch", 2: "sequence"}
                    common_inputs[f"past_key_values.{i}.value"] = {0: "batch", 2: "sequence"}
        common_inputs["attention_mask"] = {0: "batch", 1: "sequence"}
        return common_inputs
    @property
    def outputs(self) -> Mapping[str, Mapping[int, str]]:
        common_outputs = super().outputs
        if self.use_past:
            for i in range(self._config.num_layers):
                if self._config.attention_layers[i] == "local":
                    common_outputs[f"present.{i}.key_value"] = {0: "batch", 1: "sequence"}
                else:
                    common_outputs[f"present.{i}.key"] = {0: "batch", 2: "sequence"}
                    common_outputs[f"present.{i}.value"] = {0: "batch", 2: "sequence"}
        return common_outputs
    def generate_dummy_inputs(
        self,
        tokenizer: PreTrainedTokenizer,
        batch_size: int = -1,
        seq_length: int = -1,
        is_pair: bool = False,
        framework: Optional[TensorType] = None,
    ) -> Mapping[str, Any]:
        common_inputs = super().generate_dummy_inputs(tokenizer, batch_size, seq_length, is_pair, framework)
        # We need to order the input in the way they appears in the forward()
        ordered_inputs = OrderedDict({"input_ids": common_inputs["input_ids"]})
        batch = common_inputs["input_ids"].shape[0]
        past_shapes = {
            "global": (batch, self._config.num_heads, 1, self._config.hidden_size // self._config.num_attention_heads),
            "local": (batch, 1, self._config.hidden_size),
        }
        # Need to add the past_keys
        if self.use_past:
            if not is_torch_available():
                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
            else:
                import torch
                ordered_inputs["past_key_values"] = []
                for i in range(self._config.num_layers):
                    attention_type = self._config.attention_layers[i]
                    if attention_type == "global":
                        ordered_inputs["past_key_values"].append(
                            (
                                torch.zeros(past_shapes[attention_type]),
                                torch.zeros(past_shapes[attention_type]),
                            )
                        )
                    else:
                        ordered_inputs["past_key_values"].append((torch.zeros(past_shapes[attention_type]),))
        ordered_inputs["attention_mask"] = common_inputs["attention_mask"]
        if self.use_past:
            ordered_inputs["attention_mask"] = torch.cat(
                [ordered_inputs["attention_mask"], torch.zeros(batch, 1)], dim=1
            )
        return ordered_inputs
    @staticmethod
    def flatten_output_collection_property(name: str, field: Iterable[Any]) -> Dict[str, Any]:
        if name in ["present", "past_key_values"]:
            flatten_output = {}
            for idx, t in enumerate(field):
                if len(t) == 1:
                    flatten_output[f"{name}.{idx}.key_value"] = t[0]
                else:
                    flatten_output[f"{name}.{idx}.key"] = t[0]
                    flatten_output[f"{name}.{idx}.value"] = t[1]
            return flatten_output
        return super().flatten_output_collection_property(name, field)
--- a/src/transformers/models/gpt_neo/modeling_gpt_neo.py
+++ b/src/transformers/models/gpt_neo/modeling_gpt_neo.py
@@ -1121,7 +1121,7 @@ class GPTNeoForSequenceClassification(GPTNeoPreTrainedModel):
                    f"unexpected if using padding tokens in conjunction with `inputs_embeds.`"
                )
-        pooled_logits = logits[range(batch_size), sequence_lengths]
+        pooled_logits = logits[torch.arange(batch_size), sequence_lengths]
        loss = None
        if labels is not None:
--- a/src/transformers/models/ibert/modeling_ibert.py
+++ b/src/transformers/models/ibert/modeling_ibert.py
@@ -948,10 +948,8 @@ class IBertLMHead(nn.Module):
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias
    def forward(self, features, **kwargs):
@@ -964,6 +962,10 @@ class IBertLMHead(nn.Module):
        return x
    def _tie_weights(self):
        # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
        self.bias = self.decoder.bias
@add_start_docstrings(
    """
--- a/src/transformers/models/longformer/modeling_longformer.py
+++ b/src/transformers/models/longformer/modeling_longformer.py
@@ -1336,10 +1336,8 @@ class LongformerLMHead(nn.Module):
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias
    def forward(self, features, **kwargs):
@@ -1352,6 +1350,10 @@ class LongformerLMHead(nn.Module):
        return x
    def _tie_weights(self):
        # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
        self.bias = self.decoder.bias
 class LongformerPreTrainedModel(PreTrainedModel):
    """
--- a/src/transformers/models/mbart/init.py
+++ b/src/transformers/models/mbart/init.py
@@ -28,7 +28,7 @@ from ...file_utils import (
 _import_structure = {
-    "configuration_mbart": ["MBART_PRETRAINED_CONFIG_ARCHIVE_MAP", "MBartConfig"],
+    "configuration_mbart": ["MBART_PRETRAINED_CONFIG_ARCHIVE_MAP", "MBartConfig", "MBartOnnxConfig"],
 }
 if is_sentencepiece_available():
@@ -68,7 +68,7 @@ if is_flax_available():
 if TYPE_CHECKING:
-    from .configuration_mbart import MBART_PRETRAINED_CONFIG_ARCHIVE_MAP, MBartConfig
+    from .configuration_mbart import MBART_PRETRAINED_CONFIG_ARCHIVE_MAP, MBartConfig, MBartOnnxConfig
    if is_sentencepiece_available():
        from .tokenization_mbart import MBartTokenizer
--- a/src/transformers/models/mbart/configuration_mbart.py
+++ b/src/transformers/models/mbart/configuration_mbart.py
@@ -13,6 +13,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ MBART model configuration """
 from collections import OrderedDict
 from typing import Mapping
 from transformers.onnx import OnnxConfigWithPast
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
@@ -171,3 +175,32 @@ class MBartConfig(PretrainedConfig):
    @property
    def hidden_size(self) -> int:
        return self.d_model
 class MBartOnnxConfig(OnnxConfigWithPast):
    @property
    def inputs(self) -> Mapping[str, Mapping[int, str]]:
        return OrderedDict(
            [
                ("input_ids", {0: "batch", 1: "sequence"}),
                ("attention_mask", {0: "batch", 1: "sequence"}),
            ]
        )
    @property
    def outputs(self) -> Mapping[str, Mapping[int, str]]:
        if self.use_past:
            return OrderedDict(
                [
                    ("last_hidden_state", {0: "batch", 1: "sequence"}),
                    ("past_keys", {0: "batch", 2: "sequence"}),
                    ("encoder_last_hidden_state", {0: "batch", 1: "sequence"}),
                ]
            )
        else:
            return OrderedDict(
                [
                    ("last_hidden_state", {0: "batch", 1: "sequence"}),
                    ("encoder_last_hidden_state", {0: "batch", 1: "sequence"}),
                ]
            )
--- a/src/transformers/models/reformer/modeling_reformer.py
+++ b/src/transformers/models/reformer/modeling_reformer.py
@@ -1747,8 +1747,6 @@ class ReformerOnlyLMHead(nn.Module):
        self.chunk_size_lm_head = config.chunk_size_lm_head
        self.decoder = nn.Linear(2 * config.hidden_size, config.vocab_size, bias=False)
        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias
    def forward(self, hidden_states):
@@ -1758,6 +1756,10 @@ class ReformerOnlyLMHead(nn.Module):
        hidden_states = self.decoder(hidden_states)
        return hidden_states
    def _tie_weights(self):
        # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
        self.bias = self.decoder.bias
 class ReformerPreTrainedModel(PreTrainedModel):
    """
--- a/src/transformers/models/roberta/modeling_roberta.py
+++ b/src/transformers/models/roberta/modeling_roberta.py
@@ -1124,10 +1124,8 @@ class RobertaLMHead(nn.Module):
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
-        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias
    def forward(self, features, **kwargs):
@@ -1140,6 +1138,10 @@ class RobertaLMHead(nn.Module):
        return x
    def _tie_weights(self):
        # To tie those two weights if they get disconnected (on TPU or when the bias is resized)
        self.bias = self.decoder.bias
@add_start_docstrings(
    """
--- a/src/transformers/models/t5/configuration_t5.py
+++ b/src/transformers/models/t5/configuration_t5.py
@@ -14,10 +14,11 @@
 # limitations under the License.
 """ T5 model configuration """
 from collections import OrderedDict
-from typing import Any, Mapping, Optional
+from typing import Any, Dict, Iterable, Mapping, Optional
 from transformers import PreTrainedTokenizer, TensorType
 from ... import is_torch_available
 from ...configuration_utils import PretrainedConfig
 from ...onnx import OnnxConfigWithPast
 from ...utils import logging
@@ -140,9 +141,6 @@ class T5Config(PretrainedConfig):
 class T5OnnxConfig(OnnxConfigWithPast):
    def __init__(self, config: PretrainedConfig, use_past: bool = False):
        super().__init__(config, use_past)
    @property
    def inputs(self) -> Mapping[str, Mapping[int, str]]:
        common_inputs = OrderedDict(
@@ -155,29 +153,30 @@ class T5OnnxConfig(OnnxConfigWithPast):
        )
        if self.use_past:
-            for i in range(self._config.num_layers):
+            for i in range(0, self._config.num_layers):
-                common_inputs[f"past_key_values.{i}.decoder.0"] = ({0: "batch", 2: "past_sequence"},)
+                common_inputs[f"past_key_values.{i}.decoder.key"] = {0: "batch", 2: "past_sequence"}
-                common_inputs[f"past_key_values.{i}.decoder.1"] = ({0: "batch", 2: "past_sequence"},)
+                common_inputs[f"past_key_values.{i}.decoder.value"] = {0: "batch", 2: "past_sequence"}
-                common_inputs[f"past_key_values.{i}.encoder.0"] = ({0: "batch", 2: "past_sequence"},)
+                common_inputs[f"past_key_values.{i}.encoder.key"] = {0: "batch", 2: "past_sequence"}
-                common_inputs[f"past_key_values.{i}.encoder.1"] = ({0: "batch", 2: "past_sequence"},)
+                common_inputs[f"past_key_values.{i}.encoder.value"] = {0: "batch", 2: "past_sequence"}
        return common_inputs
    @property
    def outputs(self) -> Mapping[str, Mapping[int, str]]:
-        common_outputs = OrderedDict(
+        common_outputs = super().outputs
-            [
+
-                ("last_hidden_state", {0: "batch", 1: "decoder_sequence"}),
+        if "last_hidden_state" in common_outputs:
-                ("encoder_last_hidden_state", {0: "batch", 2: "encoder_sequence"}),
+            common_outputs["last_hidden_state"] = {0: "batch", 1: "decoder_sequence"}
            ]
        )
        if self.use_past:
            for i in range(self._config.num_layers):
-                common_outputs[f"past_key_values.{i}.decoder.0"] = ({0: "batch", 2: "decoder_sequence"},)
+                common_outputs[f"present.{i}.decoder.key"] = {0: "batch", 2: "decoder_sequence"}
-                common_outputs[f"past_key_values.{i}.decoder.1"] = ({0: "batch", 2: "decoder_sequence"},)
+                common_outputs[f"present.{i}.decoder.value"] = {0: "batch", 2: "decoder_sequence"}
-                common_outputs[f"past_key_values.{i}.encoder.0"] = ({0: "batch", 2: "encoder_sequence"},)
+                common_outputs[f"present.{i}.encoder.key"] = {0: "batch", 2: "encoder_sequence"}
-                common_outputs[f"past_key_values.{i}.encoder.1"] = ({0: "batch", 2: "encoder_sequence"},)
+                common_outputs[f"present.{i}.encoder.value"] = {0: "batch", 2: "encoder_sequence"}
        if self.task == "default":
            common_outputs["encoder_last_hidden_state"] = {0: "batch", 2: "encoder_sequence"}
        return common_outputs
@@ -189,8 +188,6 @@ class T5OnnxConfig(OnnxConfigWithPast):
        is_pair: bool = False,
        framework: Optional[TensorType] = None,
    ) -> Mapping[str, Any]:
        if self.use_past:
            raise NotImplementedError()
        # Generate encoder inputs
        encoder_inputs = super().generate_dummy_inputs(tokenizer, batch_size, seq_length, is_pair, framework)
@@ -199,4 +196,45 @@ class T5OnnxConfig(OnnxConfigWithPast):
        decoder_inputs = super().generate_dummy_inputs(tokenizer, batch_size, 1, is_pair, framework)
        decoder_inputs = {f"decoder_{name}": tensor for name, tensor in decoder_inputs.items()}
-        return dict(**encoder_inputs, **decoder_inputs)
+        ordered_inputs = dict(**encoder_inputs, **decoder_inputs)
        if self.use_past:
            if not is_torch_available():
                raise ValueError("Cannot generate dummy past_keys inputs without PyTorch installed.")
            else:
                import torch
            batch = encoder_inputs["input_ids"].shape[0]
            encoder_seq_length = encoder_inputs["input_ids"].shape[1]
            encoder_shape = (
                batch,
                self._config.num_heads,
                encoder_seq_length,
                self._config.hidden_size // self._config.num_heads,
            )
            decoder_shape = (batch, self._config.num_heads, 1, self._config.hidden_size // self._config.num_heads)
            ordered_inputs["past_key_values"] = []
            for _ in range(self._config.num_layers):
                ordered_inputs["past_key_values"].append(
                    (
                        torch.zeros(decoder_shape),
                        torch.zeros(decoder_shape),
                        torch.zeros(encoder_shape),
                        torch.zeros(encoder_shape),
                    )
                )
        return ordered_inputs
    @staticmethod
    def flatten_output_collection_property(name: str, field: Iterable[Any]) -> Dict[str, Any]:
        if name in ["present", "past_key_values"]:
            flatten_output = {}
            for idx, t in enumerate(field):
                flatten_output[f"{name}.{idx}.decoder.key"] = t[0]
                flatten_output[f"{name}.{idx}.decoder.value"] = t[1]
                flatten_output[f"{name}.{idx}.encoder.key"] = t[2]
                flatten_output[f"{name}.{idx}.encoder.value"] = t[3]
            return flatten_output
        return super().flatten_output_collection_property(name, field)
--- a/src/transformers/models/t5/modeling_t5.py
+++ b/src/transformers/models/t5/modeling_t5.py
@@ -426,8 +426,6 @@ class T5Attention(nn.Module):
        # past_key_value[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
        batch_size, seq_length = hidden_states.shape[:2]
        int_seq_length = int(seq_length)
        real_seq_length = seq_length
        if past_key_value is not None:
@@ -496,7 +494,7 @@ class T5Attention(nn.Module):
            # if key and values are already calculated
            # we want only the last query position bias
            if past_key_value is not None:
-                position_bias = position_bias[:, :, -int_seq_length:, :]
+                position_bias = position_bias[:, :, -hidden_states.size(1) :, :]
            if mask is not None:
                position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)
@@ -626,7 +624,7 @@ class T5Block(nn.Module):
            if len(past_key_value) != expected_num_past_key_values:
                raise ValueError(
                    f"There should be {expected_num_past_key_values} past states. "
-                    f"{'2 (past / key) for cross attention' if expected_num_past_key_values == 4 else ''}."
+                    f"{'2 (past / key) for cross attention. ' if expected_num_past_key_values == 4 else ''}"
                    f"Got {len(past_key_value)} past key / value states"
                )
--- a/src/transformers/onnx/init.py
+++ b/src/transformers/onnx/init.py
@@ -13,6 +13,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from .config import EXTERNAL_DATA_FORMAT_SIZE_LIMIT, OnnxConfig, OnnxConfigWithPast
+from .config import EXTERNAL_DATA_FORMAT_SIZE_LIMIT, OnnxConfig, OnnxConfigWithPast, PatchingSpec
 from .convert import export, validate_model_outputs
 from .utils import ParameterFormat, compute_serialized_parameters_size
--- a/src/transformers/onnx/main.py
+++ b/src/transformers/onnx/main.py
@@ -14,101 +14,22 @@
 from argparse import ArgumentParser
 from pathlib import Path
 from typing import Callable, Tuple
 from transformers.models.albert import AlbertOnnxConfig
 from transformers.models.auto import AutoTokenizer
 from transformers.models.bart import BartOnnxConfig
 from transformers.models.bert import BertOnnxConfig
 from transformers.models.distilbert import DistilBertOnnxConfig
 from transformers.models.gpt2 import GPT2OnnxConfig
 from transformers.models.longformer import LongformerOnnxConfig
 from transformers.models.roberta import RobertaOnnxConfig
 from transformers.models.t5 import T5OnnxConfig
 from transformers.models.xlm_roberta import XLMRobertaOnnxConfig
 from .. import is_torch_available
 from ..utils import logging
 from .convert import export, validate_model_outputs
-
+from .features import FeaturesManager
 if is_torch_available():
    from transformers import AutoModel, PreTrainedModel
    FEATURES_TO_AUTOMODELS = {
        "default": AutoModel,
    }
 # Set of model topologies we support associated to the features supported by each topology and the factory
 SUPPORTED_MODEL_KIND = {
    "albert": {"default": AlbertOnnxConfig.default},
    "bart": {"default": BartOnnxConfig.default},
    "bert": {"default": BertOnnxConfig.default},
    "distilbert": {"default": DistilBertOnnxConfig.default},
    "gpt2": {"default": GPT2OnnxConfig.default},
    "longformer": {"default": LongformerOnnxConfig.default},
    "roberta": {"default": RobertaOnnxConfig},
    "t5": {"default": T5OnnxConfig.default},
    "xlm-roberta": {"default": XLMRobertaOnnxConfig.default},
 }
 def get_model_from_features(features: str, model: str):
    """
    Attempt to retrieve a model from a model's name and the features to be enabled.
    Args:
        features: The features required
        model: The name of the model to export
    Returns:
    """
    if features not in FEATURES_TO_AUTOMODELS:
        raise KeyError(f"Unknown feature: {features}." f"Possible values are {list(FEATURES_TO_AUTOMODELS.values())}")
    return FEATURES_TO_AUTOMODELS[features].from_pretrained(model)
 def check_supported_model_or_raise(model: PreTrainedModel, features: str = "default") -> Tuple[str, Callable]:
    """
    Check whether or not the model has the requested features
    Args:
        model: The model to export
        features: The name of the features to check if they are avaiable
    Returns:
        (str) The type of the model (OnnxConfig) The OnnxConfig instance holding the model export properties
    """
    if model.config.model_type not in SUPPORTED_MODEL_KIND:
        raise KeyError(
            f"{model.config.model_type} ({model.name}) is not supported yet. "
            f"Only {SUPPORTED_MODEL_KIND} are supported. "
            f"If you want to support ({model.config.model_type}) please propose a PR or open up an issue."
        )
    # Look for the features
    model_features = SUPPORTED_MODEL_KIND[model.config.model_type]
    if features not in model_features:
        raise ValueError(
            f"{model.config.model_type} doesn't support features {features}. "
            f"Supported values are: {list(model_features.keys())}"
        )
    return model.config.model_type, SUPPORTED_MODEL_KIND[model.config.model_type][features]
 def main():
    parser = ArgumentParser("Hugging Face ONNX Exporter tool")
    parser.add_argument("-m", "--model", type=str, required=True, help="Model's name of path on disk to load.")
    parser.add_argument(
-        "--features",
+        "--feature",
-        choices=["default"],
+        choices=list(FeaturesManager.AVAILABLE_FEATURES),
        default="default",
-        help="Export the model with some additional features.",
+        help="Export the model with some additional feature.",
    )
    parser.add_argument(
        "--opset", type=int, default=12, help="ONNX opset version to export the model with (default 12)."
@@ -127,8 +48,8 @@ def main():
    # Allocate the model
    tokenizer = AutoTokenizer.from_pretrained(args.model)
-    model = get_model_from_features(args.features, args.model)
+    model = FeaturesManager.get_model_from_feature(args.feature, args.model)
-    model_kind, model_onnx_config = check_supported_model_or_raise(model, features=args.features)
+    model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=args.feature)
    onnx_config = model_onnx_config(model.config)
    # Ensure the requested opset is sufficient
--- a/src/transformers/onnx/config.py
+++ b/src/transformers/onnx/config.py
@@ -11,9 +11,10 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import dataclasses
 from abc import ABC, abstractmethod
 from collections import OrderedDict
-from typing import Any, Mapping, Optional
+from typing import Any, Callable, Dict, Iterable, List, Mapping, Optional
 from transformers import PretrainedConfig, PreTrainedTokenizer, TensorType
@@ -26,6 +27,27 @@ DEFAULT_ONNX_OPSET = 11
 EXTERNAL_DATA_FORMAT_SIZE_LIMIT = 2 * 1024 * 1024 * 1024
@dataclasses.dataclass
 class PatchingSpec:
    """
    Data class that holds patching specifications.
    Args:
        o: Module / object where the op to patch is located
        name: Name of the op to monkey patch
        custom_op: Custom op that patches the original op
        orig_op: Original op that is being patched
        op_wrapper: Wrapper (optional) that wraps both the original and custom ops.
            It is useful for ops that are class or static methods for instance.
    """
    o: Any
    name: str
    custom_op: Callable
    orig_op: Optional[Callable] = None
    op_wrapper: Optional[Callable] = None
 class OnnxConfig(ABC):
    """
    Base class for ONNX exportable model describing metadata on how to export the model through the ONNX format.
@@ -34,11 +56,39 @@ class OnnxConfig(ABC):
    DEFAULT_FIXED_BATCH = 2
    DEFAULT_FIXED_SEQUENCE = 8
-    def __init__(self, config: PretrainedConfig):
+    _TASKS_TO_COMMON_OUTPUTS = {
        "default": OrderedDict({"last_hidden_state": {0: "batch", 1: "sequence"}}),
        "causal-lm": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
        "seq2seq-lm": OrderedDict({"logits": {0: "batch", 1: "decoder_sequence"}}),
        "sequence-classification": OrderedDict({"logits": {0: "batch"}}),
        "token-classification": OrderedDict({"logits": {0: "batch", 1: "sequence"}}),
        "multiple-choice": OrderedDict({"logits": {0: "batch"}}),
        "question-answering": OrderedDict(
            {
                "start_logits": {0: "batch", 1: "sequence"},
                "end_logits": {0: "batch", 1: "sequence"},
            }
        ),
    }
    def __init__(self, config: PretrainedConfig, task: str = "default", patching_specs: List[PatchingSpec] = None):
        self._config = config
        if task not in self._TASKS_TO_COMMON_OUTPUTS:
            raise ValueError(
                f"{task} is not a supported task, supported tasks: {self._TASKS_TO_COMMON_OUTPUTS.keys()}"
            )
        self.task = task
        self._patching_specs = []
        for spec in patching_specs if patching_specs is not None else []:
            final_spec = spec
            if spec.orig_op is None:
                final_spec = dataclasses.replace(spec, orig_op=getattr(spec.o, spec.name))
            self._patching_specs.append(final_spec)
    @classmethod
-    def default(cls, config: PretrainedConfig) -> "OnnxConfig":
+    def from_model_config(cls, config: PretrainedConfig, task: str = "default") -> "OnnxConfig":
        """
        Instantiate a OnnxConfig for a specific model
@@ -48,7 +98,7 @@ class OnnxConfig(ABC):
        Returns:
            OnnxConfig for this model
        """
-        return cls(config)
+        return cls(config, task=task)
    @property
    @abstractmethod
@@ -62,7 +112,6 @@ class OnnxConfig(ABC):
        raise NotImplementedError()
    @property
    @abstractmethod
    def outputs(self) -> Mapping[str, Mapping[int, str]]:
        """
        Mapping containing the axis definition of the output tensors to provide to the model
@@ -70,7 +119,7 @@ class OnnxConfig(ABC):
        Returns:
            For each output: its name associated to the axes symbolic name and the axis position within the tensor
        """
-        raise NotImplementedError()
+        return self._TASKS_TO_COMMON_OUTPUTS[self.task]
    @property
    def values_override(self) -> Optional[Mapping[str, Any]]:
@@ -170,14 +219,48 @@ class OnnxConfig(ABC):
        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
        return dict(tokenizer(dummy_input, return_tensors=framework))
    def patch_ops(self):
        for spec in self._patching_specs:
            custom_op = spec.custom_op if spec.op_wrapper is None else spec.op_wrapper(spec.custom_op)
            setattr(spec.o, spec.name, custom_op)
    def restore_ops(self):
        for spec in self._patching_specs:
            orig_op = spec.orig_op if spec.op_wrapper is None else spec.op_wrapper(spec.orig_op)
            setattr(spec.o, spec.name, orig_op)
    @staticmethod
    def flatten_output_collection_property(name: str, field: Iterable[Any]) -> Dict[str, Any]:
        """
        Flatten any potential nested structure expanding the name of the field with the index of the element within the
        structure.
        Args:
            name: The name of the nested structure
            field: The structure to, potentially, be flattened
        Returns:
            (Dict[str, Any]): Outputs with flattened structure and key mapping this new structure.
        """
        from itertools import chain
        return {f"{name}.{idx}": item for idx, item in enumerate(chain.from_iterable(field))}
 class OnnxConfigWithPast(OnnxConfig, ABC):
-    def __init__(self, config: PretrainedConfig, use_past: bool = False):
+    def __init__(
-        super().__init__(config)
+        self,
        config: PretrainedConfig,
        task: str = "default",
        patching_specs: List[PatchingSpec] = None,
        use_past: bool = False,
    ):
        super().__init__(config, task=task, patching_specs=patching_specs)
        self.use_past = use_past
    @classmethod
-    def with_past(cls, config: PretrainedConfig) -> "OnnxConfigWithPast":
+    def with_past(cls, config: PretrainedConfig, task: str = "default") -> "OnnxConfigWithPast":
        """
        Instantiate a OnnxConfig with `use_past` attribute set to True
@@ -187,7 +270,7 @@ class OnnxConfigWithPast(OnnxConfig, ABC):
        Returns:
            OnnxConfig with `.use_past = True`
        """
-        return cls(config, use_past=True)
+        return cls(config, task=task, use_past=True)
    @property
    def values_override(self) -> Optional[Mapping[str, Any]]:
@@ -221,3 +304,15 @@ class OnnxConfigWithPast(OnnxConfig, ABC):
        # Generate dummy inputs according to compute batch and sequence
        dummy_input = [" ".join([tokenizer.unk_token]) * seq_length] * batch_size
        return OrderedDict(dict(tokenizer(dummy_input, return_tensors=framework)))
    @staticmethod
    def flatten_output_collection_property(name: str, field: Iterable[Any]) -> Dict[str, Any]:
        if name in ["present", "past_key_values"]:
            flatten_output = {}
            for idx, t in enumerate(field):
                flatten_output[f"{name}.{idx}.key"] = t[0]
                flatten_output[f"{name}.{idx}.value"] = t[1]
            return flatten_output
        return super().flatten_output_collection_property(name, field)
--- a/src/transformers/onnx/convert.py
+++ b/src/transformers/onnx/convert.py
@@ -21,9 +21,9 @@ import numpy as np
 from packaging.version import Version, parse
 from .. import PreTrainedModel, PreTrainedTokenizer, TensorType, TFPreTrainedModel, is_torch_available
 from ..file_utils import is_torch_onnx_dict_inputs_support_available
 from ..utils import logging
 from .config import OnnxConfig
 from .utils import flatten_output_collection_property
 logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
@@ -79,11 +79,16 @@ def export(
    """
    if not is_torch_available():
-        raise Exception("Cannot convert because PyTorch is not installed. Please install torch first.")
+        raise ImportError("Cannot convert because PyTorch is not installed. Please install torch first.")
    import torch
    from torch.onnx import export
    from ..file_utils import torch_version
    if not is_torch_onnx_dict_inputs_support_available():
        raise AssertionError(f"Unsupported PyTorch version, minimum required is 1.8.0, got: {torch_version}")
    logger.info(f"Using framework PyTorch: {torch.__version__}")
    torch.set_grad_enabled(False)
    model.config.return_dict = True
@@ -105,6 +110,8 @@ def export(
    if not inputs_match:
        raise ValueError("Model and config inputs doesn't match")
    config.patch_ops()
    # export can works with named args but the dict containing named args as to be last element of the args tuple
    export(
        model,
@@ -119,6 +126,8 @@ def export(
        opset_version=opset,
    )
    config.restore_ops()
    return matched_inputs, onnx_outputs
@@ -134,6 +143,8 @@ def validate_model_outputs(
    logger.info("Validating ONNX model...")
    # TODO: generate inputs with a different batch_size and seq_len that was used for conversion to properly test
    # dynamic input shapes.
    reference_model_inputs = config.generate_dummy_inputs(tokenizer, framework=TensorType.PYTORCH)
    # Create ONNX Runtime session
@@ -146,8 +157,12 @@ def validate_model_outputs(
    # We flatten potential collection of outputs (i.e. past_keys) to a flat structure
    for name, value in ref_outputs.items():
        # Overwriting the output name as "present" since it is the name used for the ONNX ouputs
        # ("past_key_values" being taken for the ONNX inputs)
        if name == "past_key_values":
            name = "present"
        if isinstance(value, (list, tuple)):
-            value = flatten_output_collection_property(name, value)
+            value = config.flatten_output_collection_property(name, value)
            ref_outputs_dict.update(value)
        else:
            ref_outputs_dict[name] = value
@@ -156,7 +171,7 @@ def validate_model_outputs(
    onnx_inputs = {}
    for name, value in reference_model_inputs.items():
        if isinstance(value, (list, tuple)):
-            value = flatten_output_collection_property(name, value)
+            value = config.flatten_output_collection_property(name, value)
            onnx_inputs.update({tensor_name: pt_tensor.numpy() for tensor_name, pt_tensor in value.items()})
        else:
            onnx_inputs[name] = value.numpy()
@@ -180,7 +195,7 @@ def validate_model_outputs(
    # Check the shape and values match
    for name, ort_value in zip(onnx_named_outputs, onnx_outputs):
-        ref_value = ref_outputs_dict[name].numpy()
+        ref_value = ref_outputs_dict[name].detach().numpy()
        logger.info(f'\t- Validating ONNX Model output "{name}":')
        # Shape
@@ -191,7 +206,7 @@ def validate_model_outputs(
                f"Got {ref_value.shape} (reference) and {ort_value.shape} (ONNX)"
            )
        else:
-            logger.info(f"\t\t-[✓] {ort_value.shape} matchs {ref_value.shape}")
+            logger.info(f"\t\t-[✓] {ort_value.shape} matches {ref_value.shape}")
        # Values
        if not np.allclose(ref_value, ort_value, atol=atol):
--- a/src/transformers/onnx/features.py
+++ b/src/transformers/onnx/features.py
@@ -0,0 +1,141 @@
 from functools import partial, reduce
 from typing import Callable, Tuple
 from .. import is_torch_available
 from ..models.albert import AlbertOnnxConfig
 from ..models.bart import BartOnnxConfig
 from ..models.bert import BertOnnxConfig
 from ..models.distilbert import DistilBertOnnxConfig
 from ..models.gpt2 import GPT2OnnxConfig
 from ..models.gpt_neo import GPTNeoOnnxConfig
 from ..models.longformer import LongformerOnnxConfig
 from ..models.mbart import MBartOnnxConfig
 from ..models.roberta import RobertaOnnxConfig
 from ..models.t5 import T5OnnxConfig
 from ..models.xlm_roberta import XLMRobertaOnnxConfig
 if is_torch_available():
    from transformers import PreTrainedModel
    from transformers.models.auto import (
        AutoModel,
        AutoModelForCausalLM,
        AutoModelForMultipleChoice,
        AutoModelForQuestionAnswering,
        AutoModelForSeq2SeqLM,
        AutoModelForSequenceClassification,
        AutoModelForTokenClassification,
    )
 def supported_features_mapping(*supported_features, onnx_config_cls=None):
    """Generates the mapping between supported features and their corresponding OnnxConfig."""
    if onnx_config_cls is None:
        raise ValueError("A OnnxConfig class must be provided")
    mapping = {}
    for feature in supported_features:
        if "-with-past" in feature:
            task = feature.replace("-with-past", "")
            mapping[feature] = partial(onnx_config_cls.with_past, task=task)
        else:
            mapping[feature] = partial(onnx_config_cls.from_model_config, task=feature)
    return mapping
 class FeaturesManager:
    _TASKS_TO_AUTOMODELS = {
        "default": AutoModel,
        "causal-lm": AutoModelForCausalLM,
        "seq2seq-lm": AutoModelForSeq2SeqLM,
        "sequence-classification": AutoModelForSequenceClassification,
        "token-classification": AutoModelForTokenClassification,
        "multiple-choice": AutoModelForMultipleChoice,
        "question-answering": AutoModelForQuestionAnswering,
    }
    # Set of model topologies we support associated to the features supported by each topology and the factory
    _SUPPORTED_MODEL_KIND = {
        "albert": supported_features_mapping("default", onnx_config_cls=AlbertOnnxConfig),
        "bart": supported_features_mapping("default", onnx_config_cls=BartOnnxConfig),
        "mbart": supported_features_mapping("default", onnx_config_cls=MBartOnnxConfig),
        "bert": supported_features_mapping("default", onnx_config_cls=BertOnnxConfig),
        "distilbert": supported_features_mapping("default", onnx_config_cls=DistilBertOnnxConfig),
        "gpt2": supported_features_mapping("default", onnx_config_cls=GPT2OnnxConfig),
        "longformer": supported_features_mapping("default", onnx_config_cls=LongformerOnnxConfig),
        "roberta": supported_features_mapping("default", onnx_config_cls=RobertaOnnxConfig),
        "t5": supported_features_mapping(
            "default", "default-with-past", "seq2seq-lm", "seq2seq-lm-with-past", onnx_config_cls=T5OnnxConfig
        ),
        "xlm-roberta": supported_features_mapping("default", onnx_config_cls=XLMRobertaOnnxConfig),
        "gpt-neo": supported_features_mapping(
            "default",
            "causal-lm",
            "sequence-classification",
            "default-with-past",
            "causal-lm-with-past",
            "sequence-classification-with-past",
            onnx_config_cls=GPTNeoOnnxConfig,
        ),
    }
    AVAILABLE_FEATURES = sorted(reduce(lambda s1, s2: s1 | s2, (v.keys() for v in _SUPPORTED_MODEL_KIND.values())))
    @staticmethod
    def feature_to_task(feature: str) -> str:
        return feature.replace("-with-past", "")
    @staticmethod
    def get_model_from_feature(feature: str, model: str):
        """
        Attempt to retrieve a model from a model's name and the feature to be enabled.
        Args:
            feature: The feature required
            model: The name of the model to export
        Returns:
        """
        task = FeaturesManager.feature_to_task(feature)
        if task not in FeaturesManager._TASKS_TO_AUTOMODELS:
            raise KeyError(
                f"Unknown task: {feature}."
                f"Possible values are {list(FeaturesManager._TASKS_TO_AUTOMODELS.values())}"
            )
        return FeaturesManager._TASKS_TO_AUTOMODELS[task].from_pretrained(model)
    @staticmethod
    def check_supported_model_or_raise(model: PreTrainedModel, feature: str = "default") -> Tuple[str, Callable]:
        """
        Check whether or not the model has the requested features
        Args:
            model: The model to export
            feature: The name of the feature to check if it is avaiable
        Returns:
            (str) The type of the model (OnnxConfig) The OnnxConfig instance holding the model export properties
        """
        model_type = model.config.model_type.replace("_", "-")
        model_name = getattr(model, "name", "")
        model_name = f"({model_name})" if model_name else ""
        if model_type not in FeaturesManager._SUPPORTED_MODEL_KIND:
            raise KeyError(
                f"{model.config.model_type} ({model_name}) is not supported yet. "
                f"Only {FeaturesManager._SUPPORTED_MODEL_KIND} are supported. "
                f"If you want to support ({model.config.model_type}) please propose a PR or open up an issue."
            )
        # Look for the features
        model_features = FeaturesManager._SUPPORTED_MODEL_KIND[model_type]
        if feature not in model_features:
            raise ValueError(
                f"{model.config.model_type} doesn't support feature {feature}. "
                f"Supported values are: {list(model_features.keys())}"
            )
        return model.config.model_type, FeaturesManager._SUPPORTED_MODEL_KIND[model_type][feature]
--- a/src/transformers/onnx/utils.py
+++ b/src/transformers/onnx/utils.py
@@ -14,7 +14,6 @@
 from ctypes import c_float, sizeof
 from enum import Enum
 from typing import Any, Dict, Iterable
 class ParameterFormat(Enum):
@@ -62,21 +61,3 @@ def compute_serialized_parameters_size(num_parameters: int, dtype: ParameterForm
        Size (in byte) taken to save all the parameters
    """
    return num_parameters * dtype.size
 def flatten_output_collection_property(name: str, field: Iterable[Any]) -> Dict[str, Any]:
    """
    Flatten any potential nested structure expanding the name of the field with the index of the element within the
    structure.
    Args:
        name: The name of the nested structure
        field: The structure to, potentially, be flattened
    Returns:
        (Dict[str, Any]): Outputs with flattened structure and key mapping this new structure.
    """
    from itertools import chain
    return {f"{name}.{idx}": item for idx, item in enumerate(chain.from_iterable(field))}
--- a/src/transformers/testing_utils.py
+++ b/src/transformers/testing_utils.py
@@ -25,6 +25,7 @@ from distutils.util import strtobool
 from io import StringIO
 from pathlib import Path
 from typing import Iterator, Union
 from unittest import mock
 from transformers import logging as transformers_logging
@@ -1007,7 +1008,7 @@ def mockenv(**kwargs):
        use_tf = os.getenv("USE_TF", False)
    """
-    return unittest.mock.patch.dict(os.environ, kwargs)
+    return mock.patch.dict(os.environ, kwargs)
 # from https://stackoverflow.com/a/34333710/9201239
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -364,7 +364,7 @@ class Trainer:
        self.tokenizer = tokenizer
        if self.place_model_on_device:
-            model = model.to(args.device)
+            self._move_model_to_device(model, args.device)
        # Force n_gpu to 1 to avoid DataParallel as MP will manage the GPUs
        if self.is_model_parallel:
@@ -505,6 +505,12 @@ class Trainer:
        """
        self.callback_handler.remove_callback(callback)
    def _move_model_to_device(self, model, device):
        model = model.to(device)
        # Moving a model to an XLA device disconnects the tied weights, so we have to retie them.
        if self.args.parallel_mode == ParallelMode.TPU and hasattr(model, "tie_weights"):
            model.tie_weights()
    def _remove_unused_columns(self, dataset: "datasets.Dataset", description: Optional[str] = None):
        if not self.args.remove_unused_columns:
            return dataset
@@ -1016,7 +1022,7 @@ class Trainer:
        # do_train is not a reliable argument, as it might not be set and .train() still called, so
        # the following is a workaround:
        if args.fp16_full_eval and not args.do_train:
-            self.model = self.model.to(args.device)
+            self._move_model_to_device(self.model, args.device)
        if "model_path" in kwargs:
            resume_from_checkpoint = kwargs.pop("model_path")
@@ -1077,7 +1083,7 @@ class Trainer:
        # If model was re-initialized, put it on the right device and update self.model_wrapped
        if model_reloaded:
            if self.place_model_on_device:
-                self.model = self.model.to(args.device)
+                self._move_model_to_device(self.model, args.device)
            self.model_wrapped = self.model
        # Keeping track whether we can can len() on the dataset or not
@@ -2515,10 +2521,11 @@ class Trainer:
        Returns:
            The url of the commit of your model in the given repository.
        """
        if not self.args.should_save:
            return
        if self.args.should_save:
            self.create_model_card(model_name=self.args.push_to_hub_model_id, **kwargs)
        # Needs to be executed on all processes for TPU training, but will only save on the processed determined by
        # self.args.should_save.
        self.save_model()
        # Only push from one node.
--- a/src/transformers/training_args.py
+++ b/src/transformers/training_args.py
@@ -1052,6 +1052,8 @@ class TrainingArguments:
                    logger.debug(f"{self.process_index}: waiting for the {main_process_desc} to perform {desc}")
                    if is_torch_tpu_available():
                        xm.rendezvous(desc)
                    elif is_sagemaker_dp_enabled():
                        sm_dist.Barrier()
                    else:
                        torch.distributed.barrier()
                yield
@@ -1061,6 +1063,8 @@ class TrainingArguments:
                    logger.debug(f"{self.process_index}: {main_process_desc} completed {desc}, releasing all replicas")
                    if is_torch_tpu_available():
                        xm.rendezvous(desc)
                    elif is_sagemaker_dp_enabled():
                        sm_dist.Barrier()
                    else:
                        torch.distributed.barrier()
        else:
--- a/tests/test_onnx_v2.py
+++ b/tests/test_onnx_v2.py
@@ -9,6 +9,8 @@ from transformers import (  # LongformerConfig,; T5Config,
    BartConfig,
    DistilBertConfig,
    GPT2Config,
    GPTNeoConfig,
    MBartConfig,
    RobertaConfig,
    XLMRobertaConfig,
    is_torch_available,
@@ -20,17 +22,21 @@ from transformers.models.distilbert import DistilBertOnnxConfig
 # from transformers.models.longformer import LongformerOnnxConfig
 from transformers.models.gpt2 import GPT2OnnxConfig
 from transformers.models.gpt_neo import GPTNeoOnnxConfig
 from transformers.models.mbart import MBartOnnxConfig
 from transformers.models.roberta import RobertaOnnxConfig
 # from transformers.models.t5 import T5OnnxConfig
 from transformers.models.xlm_roberta import XLMRobertaOnnxConfig
-from transformers.onnx import EXTERNAL_DATA_FORMAT_SIZE_LIMIT, OnnxConfig, ParameterFormat, validate_model_outputs
+from transformers.onnx import (
-from transformers.onnx.config import DEFAULT_ONNX_OPSET, OnnxConfigWithPast
+    EXTERNAL_DATA_FORMAT_SIZE_LIMIT,
-from transformers.onnx.utils import (
+    OnnxConfig,
-    compute_effective_axis_dimension,
+    ParameterFormat,
-    compute_serialized_parameters_size,
+    export,
-    flatten_output_collection_property,
+    validate_model_outputs,
 )
 from transformers.onnx.config import DEFAULT_ONNX_OPSET, OnnxConfigWithPast
 from transformers.onnx.utils import compute_effective_axis_dimension, compute_serialized_parameters_size
 from transformers.testing_utils import require_onnx, require_torch, slow
@@ -40,6 +46,15 @@ class OnnxUtilsTestCaseV2(TestCase):
    Cover all the utilities involved to export ONNX models
    """
    @require_torch
    @patch("transformers.onnx.convert.is_torch_onnx_dict_inputs_support_available", return_value=False)
    def test_ensure_pytorch_version_ge_1_8_0(self, mock_is_torch_onnx_dict_inputs_support_available):
        """
        Ensure we raise an Exception if the pytorch version is unsupported (< 1.8.0)
        """
        self.assertRaises(AssertionError, export, None, None, None, None, None)
        mock_is_torch_onnx_dict_inputs_support_available.assert_called()
    def test_compute_effective_axis_dimension(self):
        """
        When exporting ONNX model with dynamic axis (batch or sequence) we set batch_size and/or sequence_length = -1.
@@ -78,7 +93,7 @@ class OnnxUtilsTestCaseV2(TestCase):
        ONNX exporter will export nested collections as ${collection_name}.${level_idx_0}.${level_idx_1}...${idx_n}
        """
        self.assertEqual(
-            flatten_output_collection_property("past_key", [[0], [1], [2]]),
+            OnnxConfig.flatten_output_collection_property("past_key", [[0], [1], [2]]),
            {
                "past_key.0": 0,
                "past_key.1": 1,
@@ -136,11 +151,13 @@ class OnnxConfigWithPastTestCaseV2(TestCase):
        for name, config in OnnxConfigWithPastTestCaseV2.SUPPORTED_WITH_PAST_CONFIGS:
            with self.subTest(name):
                self.assertFalse(
-                    OnnxConfigWithPast.default(config()).use_past, "OnnxConfigWithPast.default() should not use_past"
+                    OnnxConfigWithPast.from_model_config(config()).use_past,
                    "OnnxConfigWithPast.from_model_config() should not use_past",
                )
                self.assertTrue(
-                    OnnxConfigWithPast.with_past(config()).use_past, "OnnxConfigWithPast.default() should use_past"
+                    OnnxConfigWithPast.with_past(config()).use_past,
                    "OnnxConfigWithPast.from_model_config() should use_past",
                )
    @patch.multiple(OnnxConfigWithPast, __abstractmethods__=set())
@@ -152,7 +169,7 @@ class OnnxConfigWithPastTestCaseV2(TestCase):
            with self.subTest(name):
                # without past
-                onnx_config_default = OnnxConfigWithPast.default(config())
+                onnx_config_default = OnnxConfigWithPast.from_model_config(config())
                self.assertIsNotNone(onnx_config_default.values_override, "values_override should not be None")
                self.assertIn("use_cache", onnx_config_default.values_override, "use_cache should be present")
                self.assertFalse(
@@ -175,19 +192,23 @@ if is_torch_available():
        BertModel,
        DistilBertModel,
        GPT2Model,
        GPTNeoModel,
        MBartModel,
        RobertaModel,
        XLMRobertaModel,
    )
    PYTORCH_EXPORT_DEFAULT_MODELS = {
-        ("ALBERT", "albert-base-v2", AlbertModel, AlbertConfig, AlbertOnnxConfig),
+        ("ALBERT", "hf-internal-testing/tiny-albert", AlbertModel, AlbertConfig, AlbertOnnxConfig),
        ("BART", "facebook/bart-base", BartModel, BartConfig, BartOnnxConfig),
        ("BERT", "bert-base-cased", BertModel, BertConfig, BertOnnxConfig),
        ("DistilBERT", "distilbert-base-cased", DistilBertModel, DistilBertConfig, DistilBertOnnxConfig),
        ("GPT2", "gpt2", GPT2Model, GPT2Config, GPT2OnnxConfig),
        ("GPT-Neo", "EleutherAI/gpt-neo-125M", GPTNeoModel, GPTNeoConfig, GPTNeoOnnxConfig),
        # ("LongFormer", "longformer-base-4096", LongformerModel, LongformerConfig, LongformerOnnxConfig),
        ("Roberta", "roberta-base", RobertaModel, RobertaConfig, RobertaOnnxConfig),
        ("XLM-Roberta", "roberta-base", XLMRobertaModel, XLMRobertaConfig, XLMRobertaOnnxConfig),
        ("MBart", "sshleifer/tiny-mbart", MBartModel, MBartConfig, MBartOnnxConfig),
        # ("T5", "t5-small", T5Model, T5Config, T5OnnxConfig),
    }
@@ -210,11 +231,11 @@ class OnnxExportTestCaseV2(TestCase):
        for name, model, model_class, config_class, onnx_config_class in PYTORCH_EXPORT_DEFAULT_MODELS:
            with self.subTest(name):
-                self.assertTrue(hasattr(onnx_config_class, "default"))
+                self.assertTrue(hasattr(onnx_config_class, "from_model_config"))
                tokenizer = AutoTokenizer.from_pretrained(model)
-                model = model_class(config_class())
+                model = model_class(config_class.from_pretrained(model))
-                onnx_config = onnx_config_class.default(model.config)
+                onnx_config = onnx_config_class.from_model_config(model.config)
                with NamedTemporaryFile("w") as output:
                    onnx_inputs, onnx_outputs = export(
Author	SHA1	Message	Date
Lysandre	41981a25cd	Patch release: v4.9.2 Some checks failed Release - Conda / build_and_package (push) Has been cancelled Details	2021-08-09 16:01:36 +02:00
Sylvain Gugger	ec784223ea	Tpu tie weights (#13030 ) * Fix tied weights on TPU * Manually tie weights in no trainer examples * Fix for test * One last missing * Gettning owned by my scripts * Address review comments * Fix test * Fix tests * Fix reformer tests	2021-08-09 15:53:05 +02:00
Lysandre Debut	bfd53549b0	Add to ONNX docs (#13048 ) * Add to ONNX docs * Add MBART example * Update docs/source/serialization.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>	2021-08-09 15:52:16 +02:00
Lysandre Debut	226763a262	Add MBART to models exportable with ONNX (#13049 ) * Add MBART to models exportable with ONNX * unittest mock * Add tests * Misc fixes	2021-08-09 15:52:07 +02:00
Lysandre Debut	f595ea33d9	Put smaller ALBERT model (#13028 )	2021-08-09 15:51:04 +02:00
Michael Benayoun	a12fa50693	T5 with past ONNX export (#13014 ) T5 with past ONNX export, and more explicit past_key_values inputs and outputs names for ONNX model Authored-by: Michael Benayoun <michael@huggingface.co>	2021-08-09 15:50:58 +02:00
Michael Benayoun	94b7db97bf	GPT-Neo ONNX export (#12911 ) GPT-Neo ONNX export and task / feature refactoring Authored-by: Michael Benayoun <michael@huggingface.co>	2021-08-09 15:50:44 +02:00
Sylvain Gugger	2c255a2e0c	Fix push_to_hub for TPUs (#12895 )	2021-08-09 15:47:29 +02:00
Funtowicz Morgan	ca272fc523	ONNX v2 raises an Exception when using PyTorch < 1.8.0 (#12933 ) * Raise an issue if the pytorch version is < 1.8.0 * Attempt to add a test to ensure it correctly raises. * Missing docstring. * Second attempt, patch with string absolute import. * Let's do the call before checking it was called ... * use the correct function ... 🤦 * Raise ImportError and AssertionError respectively when unable to find torch and torch version is not sufficient. * Correct path mock patching * relax constraint for torch_onnx_dict_inputs to ge instead of eq. * Style. * Split each version requirements for torch. * Let's compare version directly. * Import torch_version after checking pytorch is installed. * @require_torch	2021-08-09 15:43:31 +02:00
Sylvain Gugger	bff1c71e84	Release: v4.9.1 Some checks failed Release - Conda / build_and_package (push) Has been cancelled Details	2021-07-26 10:21:55 -04:00
Sylvain Gugger	8ee16d84ce	Fix barrier for SM distributed (#12853 )	2021-07-26 10:21:07 -04:00
Sylvain Gugger	6cab8b32e3	Add doc for v4.9.0	2021-07-26 10:20:51 -04:00