Release: v2.7.0

fix lm lables in docstring (#3529 )
[T5] make decoder input ids optional for t5 training (#3521 )
2020-03-30 08:49:24 -04:00 · 2020-03-30 14:26:24 +02:00 · 2020-03-30 13:45:26 +02:00 · 2020-03-30 13:35:53 +02:00 · 2020-03-29 13:25:42 -04:00 · 2020-03-29 10:51:13 -04:00
72 changed files with 5180 additions and 957 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@@ -85,6 +85,8 @@ jobs:
        parallelism: 1
        steps:
            - checkout
+            # we need a version of isort with https://github.com/timothycrosley/isort/pull/1000
+            - run: sudo pip install git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort
            - run: sudo pip install .[tf,torch,quality]
            - run: black --check --line-length 119 --target-version py35 examples templates tests src utils
            - run: isort --check-only --recursive examples templates tests src utils
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -26,7 +26,7 @@ author = u'huggingface'
 # The short X.Y version
 version = u''
 # The full version, including alpha/beta/rc tags
-release = u'2.6.0'
+release = u'2.7.0'


 # -- General configuration ---------------------------------------------------
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -103,3 +103,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
    model_doc/xlmroberta
    model_doc/flaubert
    model_doc/bart
+    model_doc/t5
--- a/docs/source/model_doc/t5.rst
+++ b/docs/source/model_doc/t5.rst
@@ -0,0 +1,101 @@
+T5
+----------------------------------------------------
+**DISCLAIMER:** This model is still a work in progress, if you see something strange,
+file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
+
+Overview
+~~~~~
+The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in 
+Here the abstract: 
+
+*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. 
+In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. 
+Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. 
+By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. 
+To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*
+
+The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .
+
+Training
+~~~~~~~~~~~~~~~~~~~~
+T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
+This means that for training we always need an input sequence and a target sequence. 
+The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* perprended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``lm_labels``. The PAD token is hereby used as the start-sequence token.
+T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
+
+- Unsupervised denoising training
+  In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens) 
+  and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens. 
+  Each sentinel tokens represents a unique mask token for this sentence and should start with ``<extra_id_1>``, ``<extrac_id_2>``, ... up to ``<extra_id_100>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
+  *E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows: 
+
+::
+
+  input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')
+  lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
+  # the forward function automatically creates the correct decoder_input_ids
+  model(input_ids=input_ids, lm_labels=lm_labels)
+
+- Supervised training
+  In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
+  In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should 
+  be processed as follows:
+  
+::
+
+  input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
+  lm_labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
+  # the forward function automatically creates the correct decoder_input_ids
+  model(input_ids=input_ids, lm_labels=lm_labels)
+
+Tips
+~~~~~~~~~~~~~~~~~~~~
+- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised 
+  and supervised tasks and for which each task is converted into a text-to-text format.
+  T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
+  For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
+- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
+- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
+
+
+T5Config
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5Config
+    :members:
+
+
+T5Tokenizer
+~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5Tokenizer
+    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+        create_token_type_ids_from_sequences, save_vocabulary
+
+
+T5Model
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5Model
+    :members:
+
+
+T5ForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.T5ForConditionalGeneration
+    :members:
+
+
+TFT5Model
+~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFT5Model
+    :members:
+
+
+TFT5ForConditionalGeneration
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TFT5ForConditionalGeneration
+    :members:
--- a/docs/source/pretrained_models.rst
+++ b/docs/source/pretrained_models.rst
@@ -275,7 +275,6 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   |                                                            | | FlauBERT large architecture                                                                                                         |
 |                   |                                                            | (see `details <https://github.com/getalp/Flaubert>`__)                                                                                |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
 | Bart              | ``bart-large``                                             | | 12-layer, 1024-hidden, 16-heads, 406M parameters                                                                                    |
 |                   |                                                            | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_)                                                       |
 |                   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
@@ -285,6 +284,3 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
 |                   | ``bart-large-cnn``                                         | | 12-layer, 1024-hidden, 16-heads, 406M parameters       (same as base)                                                               |
 |                   |                                                            | | bart-large base architecture finetuned on cnn summarization task                                                                    |
 +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
-
-
-.. <https://huggingface.co/transformers/examples.html>`__
--- a/examples/ner/utils_ner.py
+++ b/examples/ner/utils_ner.py
@@ -112,12 +112,15 @@ def convert_examples_to_features(
        label_ids = []
        for word, label in zip(example.words, example.labels):
            word_tokens = tokenizer.tokenize(word)
-            tokens.extend(word_tokens)
-            # Use the real label id for the first token of the word, and padding ids for the remaining tokens
-            label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
+
+            # bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
+            if len(word_tokens) > 0:
+                tokens.extend(word_tokens)
+                # Use the real label id for the first token of the word, and padding ids for the remaining tokens
+                label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))

        # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
-        special_tokens_count = 3 if sep_token_extra else 2
+        special_tokens_count = tokenizer.num_added_tokens()
        if len(tokens) > max_seq_length - special_tokens_count:
            tokens = tokens[: (max_seq_length - special_tokens_count)]
            label_ids = label_ids[: (max_seq_length - special_tokens_count)]
--- a/examples/requirements.txt
+++ b/examples/requirements.txt
@@ -3,3 +3,6 @@ tensorboard
 scikit-learn
 seqeval
 psutil
+sacrebleu
+rouge-score
+tensorflow_datasets
--- a/examples/run_language_modeling.py
+++ b/examples/run_language_modeling.py
@@ -38,7 +38,6 @@ from torch.utils.data.distributed import DistributedSampler
 from tqdm import tqdm, trange

 from transformers import (
-    CONFIG_MAPPING,
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
@@ -679,7 +678,12 @@ def main():
    elif args.model_name_or_path:
        config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
-        config = CONFIG_MAPPING[args.model_type]()
+        # When we release a pip version exposing CONFIG_MAPPING,
+        # we can do `config = CONFIG_MAPPING[args.model_type]()`.
+        raise ValueError(
+            "You are instantiating a new config instance from scratch. This is not supported, but you can do it from another script, save it,"
+            "and load it from here, using --config_name"
+        )

    if args.tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
@@ -687,8 +691,8 @@ def main():
        tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
    else:
        raise ValueError(
-            "You are instantiating a new {} tokenizer. This is not supported, but you can do it from another script, save it,"
-            "and load it from here, using --tokenizer_name".format(AutoTokenizer.__name__)
+            "You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it,"
+            "and load it from here, using --tokenizer_name"
        )

    if args.block_size <= 0:
@@ -706,7 +710,7 @@ def main():
        )
    else:
        logger.info("Training new model from scratch")
-        model = AutoModelWithLMHead(config=config)
+        model = AutoModelWithLMHead.from_config(config)

    model.to(args.device)

--- a/examples/summarization/bart/README.md
+++ b/examples/summarization/bart/README.md
@@ -1,23 +1,29 @@
-### Get the CNN Data
+### Get Preprocessed CNN Data
 To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:

 ```bash
-tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
+wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
+tar -xzvf cnn_dm.tgz
 ```
+
 this should make a directory called cnn_dm/ with files like `test.source`. 
 To use your own data, copy that files format. Each article to be summarized is on its own line.

-### Usage
+### Evaluation
 To create summaries for each article in dataset, run:
 ```bash
 python evaluate_cnn.py <path_to_test.source> cnn_test_summaries.txt
 ```
 the default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.

+
+### Training
+Run/modify `run_train.sh`
+    
 ### Where is the code?
 The core model is in `src/transformers/modeling_bart.py`. This directory only contains examples.

-### (WIP) Rouge Scores
+## (WIP) Rouge Scores

 ### Stanford CoreNLP Setup
 ```
--- a/examples/summarization/bart/run_bart_sum.py
+++ b/examples/summarization/bart/run_bart_sum.py
@@ -0,0 +1,172 @@
+import argparse
+import glob
+import logging
+import os
+import time
+
+import torch
+from torch.utils.data import DataLoader
+
+from transformer_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup
+from utils import SummarizationDataset
+
+
+logger = logging.getLogger(__name__)
+
+
+class BartSystem(BaseTransformer):
+
+    mode = "language-modeling"
+
+    def __init__(self, hparams):
+        super(BartSystem, self).__init__(hparams, num_labels=None, mode=self.mode)
+
+    def forward(
+        self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
+    ):
+        return self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            decoder_input_ids=decoder_input_ids,
+            decoder_attention_mask=decoder_attention_mask,
+            lm_labels=lm_labels,
+        )
+
+    def _step(self, batch):
+        y = batch["target_ids"]
+        y_ids = y[:, :-1].contiguous()
+        lm_labels = y[:, 1:].clone()
+        lm_labels[y[:, 1:] == self.tokenizer.pad_token_id] = -100
+        outputs = self(
+            input_ids=batch["source_ids"],
+            attention_mask=batch["source_mask"],
+            decoder_input_ids=y_ids,
+            lm_labels=lm_labels,
+        )
+
+        loss = outputs[0]
+
+        return loss
+
+    def training_step(self, batch, batch_idx):
+        loss = self._step(batch)
+
+        tensorboard_logs = {"train_loss": loss}
+        return {"loss": loss, "log": tensorboard_logs}
+
+    def validation_step(self, batch, batch_idx):
+        loss = self._step(batch)
+        return {"val_loss": loss}
+
+    def validation_end(self, outputs):
+        avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
+        tensorboard_logs = {"val_loss": avg_loss}
+        return {"avg_val_loss": avg_loss, "log": tensorboard_logs}
+
+    def test_step(self, batch, batch_idx):
+        generated_ids = self.model.generate(
+            batch["source_ids"],
+            attention_mask=batch["source_mask"],
+            num_beams=1,
+            max_length=80,
+            repetition_penalty=2.5,
+            length_penalty=1.0,
+            early_stopping=True,
+        )
+        preds = [
+            self.tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+            for g in generated_ids
+        ]
+        target = [
+            self.tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+            for t in batch["target_ids"]
+        ]
+        loss = self._step(batch)
+
+        return {"val_loss": loss, "preds": preds, "target": target}
+
+    def test_end(self, outputs):
+        return self.validation_end(outputs)
+
+    def test_epoch_end(self, outputs):
+        output_test_predictions_file = os.path.join(self.hparams.output_dir, "test_predictions.txt")
+        output_test_targets_file = os.path.join(self.hparams.output_dir, "test_targets.txt")
+        # write predictions and targets for later rouge evaluation.
+        with open(output_test_predictions_file, "w+") as p_writer, open(output_test_targets_file, "w+") as t_writer:
+            for output_batch in outputs:
+                p_writer.writelines(s + "\n" for s in output_batch["preds"])
+                t_writer.writelines(s + "\n" for s in output_batch["target"])
+            p_writer.close()
+            t_writer.close()
+
+        return self.test_end(outputs)
+
+    def train_dataloader(self):
+        train_dataset = SummarizationDataset(
+            self.tokenizer, data_dir=self.hparams.data_dir, type_path="train", block_size=self.hparams.max_seq_length
+        )
+        dataloader = DataLoader(train_dataset, batch_size=self.hparams.train_batch_size)
+        t_total = (
+            (len(dataloader.dataset) // (self.hparams.train_batch_size * max(1, self.hparams.n_gpu)))
+            // self.hparams.gradient_accumulation_steps
+            * float(self.hparams.num_train_epochs)
+        )
+        scheduler = get_linear_schedule_with_warmup(
+            self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=t_total
+        )
+        self.lr_scheduler = scheduler
+        return dataloader
+
+    def val_dataloader(self):
+        val_dataset = SummarizationDataset(
+            self.tokenizer, data_dir=self.hparams.data_dir, type_path="val", block_size=self.hparams.max_seq_length
+        )
+        return DataLoader(val_dataset, batch_size=self.hparams.eval_batch_size)
+
+    def test_dataloader(self):
+        test_dataset = SummarizationDataset(
+            self.tokenizer, data_dir=self.hparams.data_dir, type_path="test", block_size=self.hparams.max_seq_length
+        )
+        return DataLoader(test_dataset, batch_size=self.hparams.eval_batch_size)
+
+    @staticmethod
+    def add_model_specific_args(parser, root_dir):
+        BaseTransformer.add_model_specific_args(parser, root_dir)
+        # Add BART specific options
+        parser.add_argument(
+            "--max_seq_length",
+            default=1024,
+            type=int,
+            help="The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded.",
+        )
+
+        parser.add_argument(
+            "--data_dir",
+            default=None,
+            type=str,
+            required=True,
+            help="The input data dir. Should contain the dataset files for the CNN/DM summarization task.",
+        )
+        return parser
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    add_generic_args(parser, os.getcwd())
+    parser = BartSystem.add_model_specific_args(parser, os.getcwd())
+    args = parser.parse_args()
+
+    # If output_dir not provided, a folder will be generated in pwd
+    if args.output_dir is None:
+        args.output_dir = os.path.join("./results", f"{args.task}_{args.model_type}_{time.strftime('%Y%m%d_%H%M%S')}",)
+        os.makedirs(args.output_dir)
+
+    model = BartSystem(args)
+    trainer = generic_train(model, args)
+
+    # Optionally, predict on dev set and write to output_dir
+    if args.do_predict:
+        checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "checkpointepoch=*.ckpt"), recursive=True)))
+        BartSystem.load_from_checkpoint(checkpoints[-1])
+        trainer.test(model)
--- a/examples/summarization/bart/run_train.sh
+++ b/examples/summarization/bart/run_train.sh
@@ -0,0 +1,23 @@
+# Install newest ptl.
+pip install -U git+http://github.com/PyTorchLightning/pytorch-lightning/
+
+
+export OUTPUT_DIR_NAME=bart_sum
+export CURRENT_DIR=${PWD}
+export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
+
+# Make output directory if it doesn't exist
+mkdir -p $OUTPUT_DIR
+
+# Add parent directory to python path to access transformer_base.py
+export PYTHONPATH="../../":"${PYTHONPATH}"
+
+python run_bart_sum.py \
+--data_dir=./cnn-dailymail/cnn_dm \
+--model_type=bart \
+--model_name_or_path=bart-large \
+--learning_rate=3e-5 \
+--train_batch_size=4 \
+--eval_batch_size=4 \
+--output_dir=$OUTPUT_DIR \
+--do_train
--- a/examples/summarization/bart/test_bart_examples.py
+++ b/examples/summarization/bart/test_bart_examples.py
@@ -1,4 +1,5 @@
 import logging
+import os
 import sys
 import tempfile
 import unittest
@@ -8,6 +9,8 @@ from unittest.mock import patch
 from .evaluate_cnn import _run_generate


+output_file_name = "output_bart_sum.txt"
+
 articles = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]

 logging.basicConfig(level=logging.DEBUG)
@@ -19,10 +22,11 @@ class TestBartExamples(unittest.TestCase):
    def test_bart_cnn_cli(self):
        stream_handler = logging.StreamHandler(sys.stdout)
        logger.addHandler(stream_handler)
-        tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
+        tmp = Path(tempfile.gettempdir()) / "utest_generations_bart_sum.hypo"
        with tmp.open("w") as f:
            f.write("\n".join(articles))
-        testargs = ["evaluate_cnn.py", str(tmp), "output.txt"]
+        testargs = ["evaluate_cnn.py", str(tmp), output_file_name]
        with patch.object(sys, "argv", testargs):
            _run_generate()
-            self.assertTrue(Path("output.txt").exists())
+            self.assertTrue(Path(output_file_name).exists())
+            os.remove(Path(output_file_name))
--- a/examples/summarization/bart/utils.py
+++ b/examples/summarization/bart/utils.py
@@ -0,0 +1,43 @@
+import os
+
+from torch.utils.data import Dataset
+
+
+class SummarizationDataset(Dataset):
+    def __init__(self, tokenizer, data_dir="./cnn-dailymail/cnn_dm/", type_path="train", block_size=1024):
+        super(SummarizationDataset,).__init__()
+        self.tokenizer = tokenizer
+
+        self.source = []
+        self.target = []
+
+        print("loading " + type_path + " source.")
+
+        with open(os.path.join(data_dir, type_path + ".source"), "r") as f:
+            for text in f.readlines():  # each text is a line and a full story
+                tokenized = tokenizer.batch_encode_plus(
+                    [text], max_length=block_size, pad_to_max_length=True, return_tensors="pt"
+                )
+                self.source.append(tokenized)
+            f.close()
+
+        print("loading " + type_path + " target.")
+
+        with open(os.path.join(data_dir, type_path + ".target"), "r") as f:
+            for text in f.readlines():  # each text is a line and a summary
+                tokenized = tokenizer.batch_encode_plus(
+                    [text], max_length=56, pad_to_max_length=True, return_tensors="pt"
+                )
+                self.target.append(tokenized)
+            f.close()
+
+    def __len__(self):
+        return len(self.source)
+
+    def __getitem__(self, index):
+        source_ids = self.source[index]["input_ids"].squeeze()
+        target_ids = self.target[index]["input_ids"].squeeze()
+
+        src_mask = self.source[index]["attention_mask"].squeeze()  # might need to squeeze
+
+        return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids}
--- a/examples/summarization/t5/README.md
+++ b/examples/summarization/t5/README.md
@@ -0,0 +1,25 @@
+***This script evaluates the the multitask pre-trained checkpoint for ``t5-base`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a model fine-tuned on summarization, so that results will be worse here by approx. 0.5 ROUGE points***
+
+### Get the CNN Data
+First, you need to download the CNN data. It's about ~400 MB and can be downloaded by 
+running 
+
+```bash
+python download_cnn_daily_mail.py cnn_articles_input_data.txt cnn_articles_reference_summaries.txt
+```
+
+You should confirm that each file has 11490 lines:
+
+```bash
+wc -l cnn_articles_input_data.txt # should print 11490
+wc -l cnn_articles_reference_summaries.txt # should print 11490
+```
+
+### Usage
+
+To create summaries for each article in dataset, run:
+```bash
+python evaluate_cnn.py cnn_articles_input_data.txt cnn_generated_articles_summaries.txt cnn_articles_reference_summaries.txt rouge_score.txt
+```
+The default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
+The rouge scores "rouge1, rouge2, rougeL" are automatically created and saved in ``rouge_score.txt``.
--- a/examples/summarization/t5/init.py
+++ b/examples/summarization/t5/init.py
--- a/examples/summarization/t5/download_cnn_daily_mail.py
+++ b/examples/summarization/t5/download_cnn_daily_mail.py
@@ -0,0 +1,31 @@
+import argparse
+from pathlib import Path
+
+import tensorflow_datasets as tfds
+
+
+def main(input_path, reference_path, data_dir):
+    cnn_ds = tfds.load("cnn_dailymail", split="test", shuffle_files=False, data_dir=data_dir)
+    cnn_ds_iter = tfds.as_numpy(cnn_ds)
+
+    test_articles_file = Path(input_path).open("w")
+    test_summaries_file = Path(reference_path).open("w")
+
+    for example in cnn_ds_iter:
+        test_articles_file.write(example["article"].decode("utf-8") + "\n")
+        test_articles_file.flush()
+        test_summaries_file.write(example["highlights"].decode("utf-8").replace("\n", " ") + "\n")
+        test_summaries_file.flush()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("input_path", type=str, help="where to save the articles input data")
+    parser.add_argument(
+        "reference_path", type=str, help="where to save the reference summaries",
+    )
+    parser.add_argument(
+        "--data_dir", type=str, default="~/tensorflow_datasets", help="where to save the tensorflow datasets.",
+    )
+    args = parser.parse_args()
+    main(args.input_path, args.reference_path, args.data_dir)
--- a/examples/summarization/t5/evaluate_cnn.py
+++ b/examples/summarization/t5/evaluate_cnn.py
@@ -0,0 +1,101 @@
+import argparse
+from pathlib import Path
+
+import torch
+from tqdm import tqdm
+
+from rouge_score import rouge_scorer, scoring
+from transformers import T5ForConditionalGeneration, T5Tokenizer
+
+
+def chunks(lst, n):
+    """Yield successive n-sized chunks from lst."""
+    for i in range(0, len(lst), n):
+        yield lst[i : i + n]
+
+
+def generate_summaries(lns, output_file_path, model_size, batch_size, device):
+    output_file = Path(output_file_path).open("w")
+
+    model = T5ForConditionalGeneration.from_pretrained(model_size)
+    model.to(device)
+
+    tokenizer = T5Tokenizer.from_pretrained(model_size)
+
+    # update config with summarization specific params
+    task_specific_params = model.config.task_specific_params
+    if task_specific_params is not None:
+        model.config.update(task_specific_params.get("summarization", {}))
+
+    for batch in tqdm(list(chunks(lns, batch_size))):
+        batch = [model.config.prefix + text for text in batch]
+
+        dct = tokenizer.batch_encode_plus(batch, max_length=512, return_tensors="pt", pad_to_max_length=True)
+        input_ids = dct["input_ids"].to(device)
+        attention_mask = dct["attention_mask"].to(device)
+
+        summaries = model.generate(input_ids=input_ids, attention_mask=attention_mask)
+        dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries]
+
+        for hypothesis in dec:
+            output_file.write(hypothesis + "\n")
+            output_file.flush()
+
+
+def calculate_rouge(output_lns, reference_lns, score_path):
+    score_file = Path(score_path).open("w")
+    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
+    aggregator = scoring.BootstrapAggregator()
+
+    for reference_ln, output_ln in zip(reference_lns, output_lns):
+        scores = scorer.score(reference_ln, output_ln)
+        aggregator.add_scores(scores)
+
+    result = aggregator.aggregate()
+    score_file.write(
+        "ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n".format(
+            result["rouge1"], result["rouge2"], result["rougeL"]
+        )
+    )
+
+
+def run_generate():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "model_size",
+        type=str,
+        help="T5 model size, either 't5-small', 't5-base' or 't5-large'. Defaults to base.",
+        default="t5-base",
+    )
+    parser.add_argument(
+        "input_path", type=str, help="like cnn_dm/test_articles_input.txt",
+    )
+    parser.add_argument(
+        "output_path", type=str, help="where to save summaries",
+    )
+    parser.add_argument("reference_path", type=str, help="like cnn_dm/test_reference_summaries.txt")
+    parser.add_argument(
+        "score_path", type=str, help="where to save the rouge score",
+    )
+    parser.add_argument(
+        "--batch_size", type=int, default=8, required=False, help="batch size: how many to summarize at a time",
+    )
+    parser.add_argument(
+        "--no_cuda", default=False, type=bool, help="Whether to force the execution on CPU.",
+    )
+
+    args = parser.parse_args()
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+
+    source_lns = [x.rstrip() for x in open(args.input_path).readlines()]
+
+    generate_summaries(source_lns, args.output_path, args.model_size, args.batch_size, args.device)
+
+    output_lns = [x.rstrip() for x in open(args.output_path).readlines()]
+    reference_lns = [x.rstrip() for x in open(args.reference_path).readlines()]
+
+    calculate_rouge(output_lns, reference_lns, args.score_path)
+
+
+if __name__ == "__main__":
+    run_generate()
--- a/examples/summarization/t5/test_t5_examples.py
+++ b/examples/summarization/t5/test_t5_examples.py
@@ -0,0 +1,35 @@
+import logging
+import os
+import sys
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+
+from .evaluate_cnn import run_generate
+
+
+output_file_name = "output_t5_sum.txt"
+score_file_name = "score_t5_sum.txt"
+
+articles = ["New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
+
+logging.basicConfig(level=logging.DEBUG)
+
+logger = logging.getLogger()
+
+
+class TestT5Examples(unittest.TestCase):
+    def test_t5_cli(self):
+        stream_handler = logging.StreamHandler(sys.stdout)
+        logger.addHandler(stream_handler)
+        tmp = Path(tempfile.gettempdir()) / "utest_generations_t5_sum.hypo"
+        with tmp.open("w") as f:
+            f.write("\n".join(articles))
+        testargs = ["evaluate_cnn.py", "t5-small", str(tmp), output_file_name, str(tmp), score_file_name]
+        with patch.object(sys, "argv", testargs):
+            run_generate()
+            self.assertTrue(Path(output_file_name).exists())
+            self.assertTrue(Path(score_file_name).exists())
+            os.remove(Path(output_file_name))
+            os.remove(Path(score_file_name))
--- a/examples/transformer_base.py
+++ b/examples/transformer_base.py
@@ -53,10 +53,9 @@ class BaseTransformer(pl.LightningModule):
        super(BaseTransformer, self).__init__()
        self.hparams = hparams
        self.hparams.model_type = self.hparams.model_type.lower()
-
        config = AutoConfig.from_pretrained(
            self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path,
-            num_labels=num_labels,
+            **({"num_labels": num_labels} if num_labels is not None else {}),
            cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
        )
        tokenizer = AutoTokenizer.from_pretrained(
--- a/examples/translation/t5/README.md
+++ b/examples/translation/t5/README.md
@@ -0,0 +1,51 @@
+***This script evaluates the multitask pre-trained checkpoint for ``t5-base`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the English to German WMT dataset. Please note that the results in the paper were attained using a model fine-tuned on translation, so that results will be worse here by approx. 1.5 BLEU points***
+
+### Intro
+
+This example shows how T5 (here the official [paper](https://arxiv.org/abs/1910.10683)) can be
+evaluated on the WMT English-German dataset.
+
+### Get the WMT Data
+
+To be able to reproduce the authors' results on WMT English to German, you first need to download 
+the WMT14 en-de news datasets.
+Go on Stanford's official NLP [website](https://nlp.stanford.edu/projects/nmt/) and find "newstest2013.en" and "newstest2013.de" under WMT'14 English-German data or download the dataset directly via:
+
+```bash
+curl https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.en > newstest2013.en
+curl https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.de > newstest2013.de
+```
+
+You should have 3000 sentence in each file. You can verify this by running:
+
+```bash
+wc -l newstest2013.en  # should give 3000
+```
+
+### Usage
+
+Let's check the longest and shortest sentence in our file to find reasonable decoding hyperparameters: 
+
+Get the longest and shortest sentence:
+
+```bash 
+awk '{print NF}' newstest2013.en | sort -n | head -1 # shortest sentence has 1 word
+awk '{print NF}' newstest2013.en | sort -n | tail -1 # longest sentence has 106 words
+```
+
+We will set our `max_length` to ~3 times the longest sentence and leave `min_length` to its default value of 0.
+We decode with beam search `num_beams=4` as proposed in the paper. Also as is common in beam search we set `early_stopping=True` and `length_penalty=2.0`.
+
+To create translation for each in dataset and get a final BLEU score, run:
+```bash
+python evaluate_wmt.py <path_to_newstest2013.en> newstest2013_de_translations.txt <path_to_newstest2013.de> newsstest2013_en_de_bleu.txt
+```
+the default batch size, 16, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
+
+### Where is the code?
+The core model is in `src/transformers/modeling_t5.py`. This directory only contains examples.
+
+### BLEU Scores
+
+The BLEU score is calculated using [sacrebleu](https://github.com/mjpost/sacreBLEU) by mjpost.
+To get the BLEU score we used 
--- a/examples/translation/t5/init.py
+++ b/examples/translation/t5/init.py
--- a/examples/translation/t5/evaluate_wmt.py
+++ b/examples/translation/t5/evaluate_wmt.py
@@ -0,0 +1,90 @@
+import argparse
+from pathlib import Path
+
+import torch
+from tqdm import tqdm
+
+from sacrebleu import corpus_bleu
+from transformers import T5ForConditionalGeneration, T5Tokenizer
+
+
+def chunks(lst, n):
+    """Yield successive n-sized chunks from lst."""
+    for i in range(0, len(lst), n):
+        yield lst[i : i + n]
+
+
+def generate_translations(lns, output_file_path, batch_size, device):
+    output_file = Path(output_file_path).open("w")
+
+    model = T5ForConditionalGeneration.from_pretrained("t5-base")
+    model.to(device)
+
+    tokenizer = T5Tokenizer.from_pretrained("t5-base")
+
+    # update config with summarization specific params
+    task_specific_params = model.config.task_specific_params
+    if task_specific_params is not None:
+        model.config.update(task_specific_params.get("translation_en_to_de", {}))
+
+    for batch in tqdm(list(chunks(lns, batch_size))):
+        batch = [model.config.prefix + text for text in batch]
+
+        dct = tokenizer.batch_encode_plus(batch, max_length=512, return_tensors="pt", pad_to_max_length=True)
+
+        input_ids = dct["input_ids"].to(device)
+        attention_mask = dct["attention_mask"].to(device)
+
+        translations = model.generate(input_ids=input_ids, attention_mask=attention_mask)
+        dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in translations]
+
+        for hypothesis in dec:
+            output_file.write(hypothesis + "\n")
+            output_file.flush()
+
+
+def calculate_bleu_score(output_lns, refs_lns, score_path):
+    bleu = corpus_bleu(output_lns, [refs_lns])
+    result = "BLEU score: {}".format(bleu.score)
+    score_file = Path(score_path).open("w")
+    score_file.write(result)
+
+
+def run_generate():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "input_path", type=str, help="like wmt/newstest2013.en",
+    )
+    parser.add_argument(
+        "output_path", type=str, help="where to save translation",
+    )
+    parser.add_argument(
+        "reference_path", type=str, help="like wmt/newstest2013.de",
+    )
+    parser.add_argument(
+        "score_path", type=str, help="where to save the bleu score",
+    )
+    parser.add_argument(
+        "--batch_size", type=int, default=16, required=False, help="batch size: how many to summarize at a time",
+    )
+    parser.add_argument(
+        "--no_cuda", default=False, type=bool, help="Whether to force the execution on CPU.",
+    )
+
+    args = parser.parse_args()
+    args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
+
+    dash_pattern = (" ##AT##-##AT## ", "-")
+
+    input_lns = [x.strip().replace(dash_pattern[0], dash_pattern[1]) for x in open(args.input_path).readlines()]
+
+    generate_translations(input_lns, args.output_path, args.batch_size, args.device)
+
+    output_lns = [x.strip() for x in open(args.output_path).readlines()]
+    refs_lns = [x.strip().replace(dash_pattern[0], dash_pattern[1]) for x in open(args.reference_path).readlines()]
+
+    calculate_bleu_score(output_lns, refs_lns, args.score_path)
+
+
+if __name__ == "__main__":
+    run_generate()
--- a/examples/translation/t5/test_t5_examples.py
+++ b/examples/translation/t5/test_t5_examples.py
@@ -0,0 +1,43 @@
+import logging
+import os
+import sys
+import tempfile
+import unittest
+from pathlib import Path
+from unittest.mock import patch
+
+from .evaluate_wmt import run_generate
+
+
+text = ["When Liana Barrientos was 23 years old, she got married in Westchester County."]
+translation = ["Als Liana Barrientos 23 Jahre alt war, heiratete sie in Westchester County."]
+
+output_file_name = "output_t5_trans.txt"
+score_file_name = "score_t5_trans.txt"
+
+logging.basicConfig(level=logging.DEBUG)
+
+logger = logging.getLogger()
+
+
+class TestT5Examples(unittest.TestCase):
+    def test_t5_cli(self):
+        stream_handler = logging.StreamHandler(sys.stdout)
+        logger.addHandler(stream_handler)
+
+        tmp_source = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.hypo"
+        with tmp_source.open("w") as f:
+            f.write("\n".join(text))
+
+        tmp_target = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.target"
+        with tmp_target.open("w") as f:
+            f.write("\n".join(translation))
+
+        testargs = ["evaluate_wmt.py", str(tmp_source), output_file_name, str(tmp_target), score_file_name]
+
+        with patch.object(sys, "argv", testargs):
+            run_generate()
+            self.assertTrue(Path(output_file_name).exists())
+            self.assertTrue(Path(score_file_name).exists())
+            os.remove(Path(output_file_name))
+            os.remove(Path(score_file_name))
--- a/examples/utils_multiple_choice.py
+++ b/examples/utils_multiple_choice.py
@@ -320,7 +320,9 @@ def convert_examples_to_features(
            else:
                text_b = example.question + " " + ending

-            inputs = tokenizer.encode_plus(text_a, text_b, add_special_tokens=True, max_length=max_length,)
+            inputs = tokenizer.encode_plus(
+                text_a, text_b, add_special_tokens=True, max_length=max_length, return_token_type_ids=True
+            )
            if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
                logger.info(
                    "Attention! you are cropping tokens (swag task is ok). "
--- a/model_cards/dbmdz/bert-base-german-cased/README.md
+++ b/model_cards/dbmdz/bert-base-german-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 ---

 # 🤗 + 📚 dbmdz German BERT models
--- a/model_cards/dbmdz/bert-base-german-europeana-cased/README.md
+++ b/model_cards/dbmdz/bert-base-german-europeana-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 tags:
  - "historic german"
 ---
--- a/model_cards/dbmdz/bert-base-german-europeana-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-german-europeana-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 tags:
  - "historic german"
 ---
--- a/model_cards/dbmdz/bert-base-german-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-german-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: german
+license: mit
 ---

 # 🤗 + 📚 dbmdz German BERT models
--- a/model_cards/dbmdz/bert-base-italian-cased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---

 # 🤗 + 📚 dbmdz BERT models
--- a/model_cards/dbmdz/bert-base-italian-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---

 # 🤗 + 📚 dbmdz BERT models
--- a/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-xxl-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---

 # 🤗 + 📚 dbmdz BERT models
--- a/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-italian-xxl-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: italian
+license: mit
 ---

 # 🤗 + 📚 dbmdz BERT models
--- a/model_cards/dbmdz/bert-base-turkish-128k-cased/README.md
+++ b/model_cards/dbmdz/bert-base-turkish-128k-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---

 # 🤗 + 📚 dbmdz Turkish BERT model
--- a/model_cards/dbmdz/bert-base-turkish-128k-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-turkish-128k-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---

 # 🤗 + 📚 dbmdz Turkish BERT model
--- a/model_cards/dbmdz/bert-base-turkish-cased/README.md
+++ b/model_cards/dbmdz/bert-base-turkish-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---

 # 🤗 + 📚 dbmdz Turkish BERT model
--- a/model_cards/dbmdz/bert-base-turkish-uncased/README.md
+++ b/model_cards/dbmdz/bert-base-turkish-uncased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---

 # 🤗 + 📚 dbmdz Turkish BERT model
--- a/model_cards/dbmdz/distilbert-base-turkish-cased/README.md
+++ b/model_cards/dbmdz/distilbert-base-turkish-cased/README.md
@@ -1,5 +1,6 @@
 ---
 language: turkish
+license: mit
 ---

 # 🤗 + 📚 dbmdz Distilled Turkish BERT model
--- a/model_cards/gsarti/biobert-nli/README.md
+++ b/model_cards/gsarti/biobert-nli/README.md
@@ -0,0 +1,37 @@
+# BioBERT-NLI
+
+This is the model [BioBERT](https://github.com/dmis-lab/biobert) [1] fine-tuned on the [SNLI](https://nlp.stanford.edu/projects/snli/) and the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) datasets using the [`sentence-transformers` library](https://github.com/UKPLab/sentence-transformers/) to produce universal sentence embeddings [2].
+
+The model uses the original BERT wordpiece vocabulary and was trained using the **average pooling strategy** and a **softmax loss**.
+
+**Base model**: `monologg/biobert_v1.1_pubmed` from HuggingFace's `AutoModel`.
+
+**Training time**: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.
+
+**Parameters**:
+
+| Parameter        | Value |
+|------------------|-------|
+| Batch size       | 64    |
+| Training steps   | 30000 |
+| Warmup steps     | 1450  |
+| Lowercasing      | False |
+| Max. Seq. Length | 128   |
+
+**Performances**: The performance was evaluated on the test portion of the [STS dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) using Spearman rank correlation and compared to the performances of a general BERT base model obtained with the same procedure to verify their similarity.
+
+| Model                         | Score       |
+|-------------------------------|-------------|
+| `biobert-nli` (this)          | 73.40       |
+| `gsarti/scibert-nli`          | 74.50       |
+| `bert-base-nli-mean-tokens`[3]| 77.12       |
+
+An example usage for similarity-based scientific paper retrieval is provided in the [Covid Papers Browser](https://github.com/gsarti/covid-papers-browser) repository.
+
+**References:**
+
+[1] J. Lee et al, [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://academic.oup.com/bioinformatics/article/36/4/1234/5566506)
+
+[2] A. Conneau et al., [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://www.aclweb.org/anthology/D17-1070/)
+
+[3] N. Reimers et I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410/)
--- a/model_cards/huseinzol05/bert-base-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/bert-base-bahasa-cased/README.md
@@ -32,13 +32,54 @@ Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess
 You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  

 ```python
-from transformers import XLNetTokenizer, BertModel
+from transformers import AlbertTokenizer, BertModel

 model = BertModel.from_pretrained('huseinzol05/bert-base-bahasa-cased')
-tokenizer = XLNetTokenizer.from_pretrained('huseinzol05/bert-base-bahasa-cased')
+tokenizer = AlbertTokenizer.from_pretrained(
+    'huseinzol05/bert-base-bahasa-cased',
+    unk_token = '[UNK]',
+    pad_token = '[PAD]',
+    do_lower_case = False,
+)
 ```

-We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `XLNetTokenizer`.
+We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `AlbertTokenizer`.
+
+## Example using AutoModelWithLMHead
+
+```python
+from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
+
+model = AutoModelWithLMHead.from_pretrained('huseinzol05/bert-base-bahasa-cased')
+tokenizer = AlbertTokenizer.from_pretrained(
+    'huseinzol05/bert-base-bahasa-cased',
+    unk_token = '[UNK]',
+    pad_token = '[PAD]',
+    do_lower_case = False,
+)
+fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
+print(fill_mask('makan ayam dengan [MASK]'))
+```
+
+Output is,
+
+```text
+[{'sequence': '[CLS] makan ayam dengan rendang[SEP]',
+  'score': 0.10812027007341385,
+  'token': 2446},
+ {'sequence': '[CLS] makan ayam dengan kicap[SEP]',
+  'score': 0.07653367519378662,
+  'token': 12928},
+ {'sequence': '[CLS] makan ayam dengan nasi[SEP]',
+  'score': 0.06839974224567413,
+  'token': 450},
+ {'sequence': '[CLS] makan ayam dengan ayam[SEP]',
+  'score': 0.059544261544942856,
+  'token': 638},
+ {'sequence': '[CLS] makan ayam dengan sayur[SEP]',
+  'score': 0.05294966697692871,
+  'token': 1639}]
+```

 ## Results

--- a/model_cards/huseinzol05/xlnet-base-bahasa-cased/README.md
+++ b/model_cards/huseinzol05/xlnet-base-bahasa-cased/README.md
@@ -0,0 +1,64 @@
+---
+language: malay
+---
+
+# Bahasa XLNet Model
+
+Pretrained XLNet base language model for Malay and Indonesian. 
+
+## Pretraining Corpus
+
+`XLNET-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
+
+1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
+2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
+3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
+4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
+5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
+6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
+7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
+8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
+9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
+
+Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
+
+## Pretraining details
+
+- This model was trained using zihangdai XLNet's github [repository](https://github.com/zihangdai/xlnet) on 3 Titan V100 32GB VRAM.
+- All steps can reproduce from here, [Malaya/pretrained-model/xlnet](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/xlnet).
+
+## Load Pretrained Model
+
+You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:  
+
+```python
+from transformers import XLNetTokenizer, XLNetModel
+
+model = XLNetModel.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
+tokenizer = XLNetTokenizer.from_pretrained(
+    'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
+)
+```
+
+## Example using AutoModelWithLMHead
+
+```python
+from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
+
+model = AutoModelWithLMHead.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
+tokenizer = XLNetTokenizer.from_pretrained(
+    'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
+)
+fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
+print(fill_mask('makan ayam dengan [MASK]'))
+```
+
+## Results
+
+For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
+
+## Acknowledgement
+
+Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train XLNet for Bahasa. 
+
+
--- a/model_cards/mrm8488/GPT-2-finetuned-CORD19/README.md
+++ b/model_cards/mrm8488/GPT-2-finetuned-CORD19/README.md
@@ -0,0 +1,60 @@
+---
+language: english
+thumbnail:
+---
+
+# GPT-2 + CORD19 dataset : 🦠 ✍ ⚕
+
+**GPT-2** fine-tuned on **biorxiv_medrxiv** and **comm_use_subset files** from [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset.
+
+
+## Datasets details:
+
+| Dataset                | # Files |
+| ---------------------- | ----- |
+| biorxiv_medrxiv        | 885  |
+| comm_use_subse         | 9K   |
+
+## Model training
+
+The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
+
+```bash
+
+export TRAIN_FILE=/path/to/dataset/train.txt
+
+python run_language_modeling.py \
+    --model_type gpt2 \
+    --model_name_or_path gpt2 \
+    --do_train \
+    --train_data_file $TRAIN_FILE \
+    --num_train_epochs 4 \
+    --output_dir model_output \
+    --overwrite_output_dir \
+    --save_steps 10000 \
+    --per_gpu_train_batch_size 3
+```
+
+<img alt="training loss" src="https://svgshare.com/i/JTf.svg' title='GTP-2-finetuned-CORDS19-loss" width="600" height="300" />
+
+## Model in action / Example of usage: ✒
+
+You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
+
+```bash
+python run_generation.py \
+    --model_type gpt2 \
+    --model_name_or_path mrm8488/GPT-2-finetuned-CORD19 \
+    --length 200
+```
+```txt
+# Input: the effects of COVID-19 on the lungs
+# Output: === GENERATED SEQUENCE 1 ===
+the effects of COVID-19 on the lungs are currently debated (86). The role of this virus in the pathogenesis of pneumonia and lung cancer is still debated. MERS-CoV is also known to cause acute respiratory distress syndrome (87) and is associated with increased expression of pulmonary fibrosis markers (88). Thus, early airway inflammation may play an important role in the pathogenesis of coronavirus pneumonia and may contribute to the severe disease and/or mortality observed in coronavirus patients.
+Pneumonia is an acute, often fatal disease characterized by severe edema, leakage of oxygen and bronchiolar inflammation. Viruses include coronaviruses, and the role of oxygen depletion is complicated by lung injury and fibrosis in the lung, in addition to susceptibility to other lung diseases. The progression of the disease may be variable, depending on the lung injury, pathologic role, prognosis, and the immune status of the patient. Inflammatory responses to respiratory viruses cause various pathologies of the respiratory
+```
+
+
+> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
+
+> Made with <span style="color: #e25555;">&hearts;</span> in Spain
--- a/model_cards/mrm8488/bert-spanish-cased-finetuned-pos/README.md
+++ b/model_cards/mrm8488/bert-spanish-cased-finetuned-pos/README.md
@@ -5,7 +5,7 @@ thumbnail: https://i.imgur.com/jgBdimh.png

 # Spanish BERT (BETO) + POS

-This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) Of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **POS** (Part of Speech tagging) downstream task.
+This model is a fine-tuned on Spanish [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **POS** (Part of Speech tagging) downstream task.

 ## Details of the downstream task (POS) - Dataset

--- a/model_cards/twmkn9/albert-base-v2-squad2/README.md
+++ b/model_cards/twmkn9/albert-base-v2-squad2/README.md
@@ -1,22 +1,24 @@
-This model is ALBERT base v2 trained on SQuAD v2 as:
+This model is [ALBERT base v2](https://huggingface.co/albert-base-v2) trained on SQuAD v2 as:

 ```
-python run_squad.py 
--model_type albert 
--model_name_or_path albert-base-v2 
--do_train 
--do_eval 
--overwrite_cache 
--do_lower_case 
--version_2_with_negative 
--train_file $SQUAD_DIR/train-v2.0.json 
--predict_file $SQUAD_DIR/dev-v2.0.json 
--per_gpu_train_batch_size 8 
--num_train_epochs 3 
--learning_rate 3e-5 
--max_seq_length 384 
--doc_stride 128 
--output_dir ./tmp/albert_base_fine/
+export SQUAD_DIR=../../squad2
+python3 run_squad.py 
+    --model_type albert 
+    --model_name_or_path albert-base-v2 
+    --do_train 
+    --do_eval 
+    --overwrite_cache 
+    --do_lower_case 
+    --version_2_with_negative 
+    --save_steps 100000 
+    --train_file $SQUAD_DIR/train-v2.0.json 
+    --predict_file $SQUAD_DIR/dev-v2.0.json 
+    --per_gpu_train_batch_size 8 
+    --num_train_epochs 3 
+    --learning_rate 3e-5 
+    --max_seq_length 384 
+    --doc_stride 128 
+    --output_dir ./tmp/albert_fine/
 ```

 Performance on a dev subset is close to the original paper:
--- a/model_cards/twmkn9/bert-base-uncased-squad2/README.md
+++ b/model_cards/twmkn9/bert-base-uncased-squad2/README.md
@@ -1,22 +1,24 @@
-This model is BERT base uncased trained on SQuAD v2 as:
+This model is [BERT base uncased](https://huggingface.co/bert-base-uncased) trained on SQuAD v2 as:

 ```
-python run_squad.py 
--model_type bert 
--model_name_or_path bert-base-uncased
--do_train 
--do_eval 
--overwrite_cache 
--do_lower_case 
--version_2_with_negative 
--train_file $SQUAD_DIR/train-v2.0.json 
--predict_file $SQUAD_DIR/dev-v2.0.json 
--per_gpu_train_batch_size 8 
--num_train_epochs 3 
--learning_rate 3e-5 
--max_seq_length 384 
--doc_stride 128 
--output_dir ./tmp/bert_base_fine/
+export SQUAD_DIR=../../squad2
+python3 run_squad.py 
+    --model_type bert 
+    --model_name_or_path bert-base-uncased 
+    --do_train 
+    --do_eval 
+    --overwrite_cache 
+    --do_lower_case 
+    --version_2_with_negative 
+    --save_steps 100000 
+    --train_file $SQUAD_DIR/train-v2.0.json 
+    --predict_file $SQUAD_DIR/dev-v2.0.json 
+    --per_gpu_train_batch_size 8 
+    --num_train_epochs 3 
+    --learning_rate 3e-5 
+    --max_seq_length 384 
+    --doc_stride 128 
+    --output_dir ./tmp/bert_fine_tuned/
 ```

 Performance on a dev subset is close to the original paper:
--- a/model_cards/twmkn9/distilbert-base-uncased-squad2/README.md
+++ b/model_cards/twmkn9/distilbert-base-uncased-squad2/README.md
@@ -0,0 +1,45 @@
+This model is [Distilbert base uncased](https://huggingface.co/distilbert-base-uncased) trained on SQuAD v2 as:
+
+```
+export SQUAD_DIR=../../squad2
+python3 run_squad.py 
+    --model_type distilbert 
+    --model_name_or_path distilbert-base-uncased
+    --do_train 
+    --do_eval 
+    --overwrite_cache 
+    --do_lower_case 
+    --version_2_with_negative 
+    --save_steps 100000 
+    --train_file $SQUAD_DIR/train-v2.0.json 
+    --predict_file $SQUAD_DIR/dev-v2.0.json 
+    --per_gpu_train_batch_size 8 
+    --num_train_epochs 3 
+    --learning_rate 3e-5 
+    --max_seq_length 384 
+    --doc_stride 128 
+    --output_dir ./tmp/distilbert_fine_tuned/
+```
+
+Performance on a dev subset is close to the original paper:
+
+```
+Results: 
+{
+    'exact': 64.88976637051661, 
+    'f1': 68.1776176526635, 
+    'total': 6078, 
+    'HasAns_exact': 69.7594501718213, 
+    'HasAns_f1': 76.62665295288285, 
+    'HasAns_total': 2910, 
+    'NoAns_exact': 60.416666666666664, 
+    'NoAns_f1': 60.416666666666664, 
+    'NoAns_total': 3168, 
+    'best_exact': 64.88976637051661, 
+    'best_exact_thresh': 0.0, 
+    'best_f1': 68.17761765266337, 
+    'best_f1_thresh': 0.0
+}
+```
+
+We are hopeful this might save you time, energy, and compute. Cheers!
--- a/model_cards/twmkn9/distilroberta-base-squad2/README.md
+++ b/model_cards/twmkn9/distilroberta-base-squad2/README.md
@@ -0,0 +1,44 @@
+This model is [Distilroberta base](https://huggingface.co/distilroberta-base) trained on SQuAD v2 as:
+
+```
+export SQUAD_DIR=../../squad2
+python3 run_squad.py 
+    --model_type robberta 
+    --model_name_or_path distilroberta-base 
+    --do_train 
+    --do_eval 
+    --overwrite_cache 
+    --do_lower_case 
+    --version_2_with_negative 
+    --save_steps 100000 
+    --train_file $SQUAD_DIR/train-v2.0.json 
+    --predict_file $SQUAD_DIR/dev-v2.0.json 
+    --per_gpu_train_batch_size 8 
+    --num_train_epochs 3 
+    --learning_rate 3e-5 
+    --max_seq_length 384 
+    --doc_stride 128 
+    --output_dir ./tmp/distilroberta_fine_tuned/
+```
+
+Performance on a dev subset is close to the original paper:
+
+```
+Results: 
+{
+    'exact': 70.9279368213228, 
+    'f1': 74.60439802429168, 
+    'total': 6078, 
+    'HasAns_exact': 67.62886597938144, 
+    'HasAns_f1': 75.30774267754136, 
+    'HasAns_total': 2910, 
+    'NoAns_exact': 73.95833333333333, 
+    'NoAns_f1': 73.95833333333333, 'NoAns_total': 3168, 
+    'best_exact': 70.94438960184272, 
+    'best_exact_thresh': 0.0, 
+    'best_f1': 74.62085080481161, 
+    'best_f1_thresh': 0.0
+}
+```
+
+We are hopeful this might save you time, energy, and compute. Cheers!
--- a/notebooks/03-pipelines.ipynb
+++ b/notebooks/03-pipelines.ipynb
--- a/setup.py
+++ b/setup.py
@@ -76,14 +76,14 @@ extras["testing"] = ["pytest", "pytest-xdist"]
 extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme"]
 extras["quality"] = [
    "black",
-    "isort",
+    "isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
    "flake8",
 ]
 extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]

 setup(
    name="transformers",
-    version="2.6.0",
+    version="2.7.0",
    author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
    author_email="thomas@huggingface.co",
    description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
@@ -97,6 +97,8 @@ setup(
    install_requires=[
        "numpy",
        "tokenizers == 0.5.2",
+        # dataclasses for Python versions that don't have it
+        "dataclasses;python_version<'3.7'",
        # accessing files from S3 directly
        "boto3",
        # filesystem locks e.g. to prevent parallel downloads
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -2,7 +2,7 @@
 # There's no way to ignore "F401 '...' imported but unused" warnings in this
 # module, but to preserve other warnings. So, don't check this module at all.

-__version__ = "2.6.0"
+__version__ = "2.7.0"

 # Work around to update TensorFlow's absl.logging threshold which alters the
 # default Python logging output behavior when present.
@@ -32,7 +32,7 @@ from .benchmark_utils import (
    stop_memory_tracing,
 )
 from .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
-from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig
+from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, AutoConfig
 from .configuration_bart import BartConfig
 from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
 from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
@@ -116,10 +116,11 @@ from .pipelines import (
    SummarizationPipeline,
    TextClassificationPipeline,
    TokenClassificationPipeline,
+    TranslationPipeline,
    pipeline,
 )
 from .tokenization_albert import AlbertTokenizer
-from .tokenization_auto import AutoTokenizer
+from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer
 from .tokenization_bart import BartTokenizer
 from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer
 from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
@@ -221,6 +222,7 @@ if is_torch_available():
        XLMModel,
        XLMWithLMHeadModel,
        XLMForSequenceClassification,
+        XLMForTokenClassification,
        XLMForQuestionAnswering,
        XLMForQuestionAnsweringSimple,
        XLM_PRETRAINED_MODEL_ARCHIVE_MAP,
--- a/src/transformers/configuration_bart.py
+++ b/src/transformers/configuration_bart.py
@@ -26,6 +26,7 @@ BART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
    "bart-large": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json",
    "bart-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json",
    "bart-large-cnn": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json",
+    "bart-large-xsum": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/config.json",
 }


--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -78,9 +78,6 @@ class PretrainedConfig(object):
        self.top_k = kwargs.pop("top_k", 50)
        self.top_p = kwargs.pop("top_p", 1.0)
        self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0)
-        self.bos_token_id = kwargs.pop("bos_token_id", None)
-        self.pad_token_id = kwargs.pop("pad_token_id", None)
-        self.eos_token_id = kwargs.pop("eos_token_id", None)
        self.length_penalty = kwargs.pop("length_penalty", 1.0)
        self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", 0)
        self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
@@ -94,6 +91,16 @@ class PretrainedConfig(object):
        self.label2id = kwargs.pop("label2id", dict(zip(self.id2label.values(), self.id2label.keys())))
        self.label2id = dict((key, int(value)) for key, value in self.label2id.items())

+        # Tokenizer arguments TODO: eventually tokenizer and models should share the same config
+        self.prefix = kwargs.pop("prefix", None)
+        self.bos_token_id = kwargs.pop("bos_token_id", None)
+        self.pad_token_id = kwargs.pop("pad_token_id", None)
+        self.eos_token_id = kwargs.pop("eos_token_id", None)
+        self.decoder_start_token_id = kwargs.pop("decoder_start_token_id", None)
+
+        # task specific arguments
+        self.task_specific_params = kwargs.pop("task_specific_params", None)
+
        # Additional attributes without default values
        for key, value in kwargs.items():
            try:
@@ -373,3 +380,14 @@ class PretrainedConfig(object):
        """
        with open(json_file_path, "w", encoding="utf-8") as writer:
            writer.write(self.to_json_string())
+
+    def update(self, config_dict: Dict):
+        """
+        Updates attributes of this class
+        with attributes from `config_dict`.
+
+        Args:
+            :obj:`Dict[str, any]`: Dictionary of attributes that shall be updated for this class.
+        """
+        for key, value in config_dict.items():
+            setattr(self, key, value)
--- a/src/transformers/convert_bart_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/convert_bart_original_pytorch_checkpoint_to_pytorch.py
@@ -17,6 +17,7 @@

 import argparse
 import logging
+import os
 from pathlib import Path

 import fairseq
@@ -30,10 +31,11 @@ from transformers import (
    BartModel,
    BartTokenizer,
 )
+from transformers.modeling_bart import _make_linear_from_emb


-FAIRSEQ_MODELS = ["bart.large", "bart.large.mnli", "bart.large.cnn"]
-
+FAIRSEQ_MODELS = ["bart.large", "bart.large.mnli", "bart.large.cnn", "bart_xsum/model.pt"]
+extra_arch = {"bart.large": BartModel, "bart.large.mnli": BartForSequenceClassification}
 if version.parse(fairseq.__version__) < version.parse("0.9.0"):
    raise Exception("requires fairseq >= 0.9.0")

@@ -57,62 +59,79 @@ def rename_key(dct, old, new):
    dct[new] = val


-def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path):
+def load_xsum_checkpoint(checkpoint_path):
+    """Checkpoint path should end in model.pt"""
+    sd = torch.load(checkpoint_path, map_location="cpu")
+    hub_interface = torch.hub.load("pytorch/fairseq", "bart.large.cnn").eval()
+    hub_interface.model.load_state_dict(sd["model"])
+    return hub_interface
+
+
+@torch.no_grad()
+def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkpoint_name=None):
    """
    Copy/paste/tweak model's weights to our BERT structure.
    """
-    bart = torch.hub.load("pytorch/fairseq", checkpoint_path)
-    bart.eval()  # disable dropout
+    if not os.path.exists(checkpoint_path):
+        bart = torch.hub.load("pytorch/fairseq", checkpoint_path).eval()
+    else:
+        bart = load_xsum_checkpoint(checkpoint_path)
+
    bart.model.upgrade_state_dict(bart.model.state_dict())
-    hf_model_name = checkpoint_path.replace(".", "-")
-    config = BartConfig.from_pretrained(hf_model_name)
+    if hf_checkpoint_name is None:
+        hf_checkpoint_name = checkpoint_path.replace(".", "-")
+    config = BartConfig.from_pretrained(hf_checkpoint_name)
    tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)
-    tokens2 = BartTokenizer.from_pretrained(hf_model_name).encode(SAMPLE_TEXT, return_tensors="pt").unsqueeze(0)
+    tokens2 = BartTokenizer.from_pretrained(hf_checkpoint_name).encode(SAMPLE_TEXT, return_tensors="pt").unsqueeze(0)
    assert torch.eq(tokens, tokens2).all()

-    if checkpoint_path in ["bart.large", "bart.large.cnn"]:
-        state_dict = bart.model.state_dict()
-        for k in IGNORE_KEYS:
-            state_dict.pop(k, None)
-        state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
-        model = BartModel(config)
-        their_output = bart.extract_features(tokens)
-    else:  # MNLI Case
+    if checkpoint_path == "bart.large.mnli":
        state_dict = bart.state_dict()
-        for k in IGNORE_KEYS:
-            state_dict.pop(k, None)
+        remove_ignore_keys_(state_dict)
        state_dict["model.shared.weight"] = state_dict["model.decoder.embed_tokens.weight"]
        for src, dest in rename_keys:
            rename_key(state_dict, src, dest)
-        model = BartForSequenceClassification(config)
-        their_output = bart.predict("mnli", tokens, return_logits=True)
+        model = BartForSequenceClassification(config).eval()
+        model.load_state_dict(state_dict)
+        fairseq_output = bart.predict("mnli", tokens, return_logits=True)
+        new_model_outputs = model(tokens)[0]  # logits
+    else:  # no classification heads to worry about
+        state_dict = bart.model.state_dict()
+        remove_ignore_keys_(state_dict)
+        state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
+        fairseq_output = bart.extract_features(tokens)
+        if hf_checkpoint_name == "bart-large":
+            model = BartModel(config).eval()
+            model.load_state_dict(state_dict)
+            new_model_outputs = model(tokens).model[0]
+        else:
+            model = BartForConditionalGeneration(config).eval()  # an existing summarization ckpt
+            model.model.load_state_dict(state_dict)
+            if hasattr(model, "lm_head"):
+                model.lm_head = _make_linear_from_emb(model.model.shared)
+            new_model_outputs = model.model(tokens)[0]

-    # Load state dict
-    model.load_state_dict(state_dict)
-    model.eval()
    # Check results
-
-    if checkpoint_path == "bart.large.cnn":
-        model = BartForConditionalGeneration(config, base_model=model)
-        assert "lm_head.weight" in model.state_dict()
-        assert model.lm_head.out_features == config.max_position_embeddings
-        model.eval()
-        our_outputs = model.model(tokens)[0]
-    else:
-        our_outputs = model(tokens)[0]
-    assert their_output.shape == our_outputs.shape
-    assert (their_output == our_outputs).all().item()
+    assert fairseq_output.shape == new_model_outputs.shape
+    assert (fairseq_output == new_model_outputs).all().item()
    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
    model.save_pretrained(pytorch_dump_folder_path)


+def remove_ignore_keys_(state_dict):
+    for k in IGNORE_KEYS:
+        state_dict.pop(k, None)
+
+
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # Required parameters
-    parser.add_argument("fairseq_path", choices=FAIRSEQ_MODELS, type=str, help="")
-
-    parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
-    args = parser.parse_args()
-    convert_bart_checkpoint(
-        args.fairseq_path, args.pytorch_dump_folder_path,
+    parser.add_argument(
+        "fairseq_path", type=str, help="bart.large, bart.large.cnn or a path to a model.pt on local filesystem."
    )
+    parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
+    parser.add_argument(
+        "--hf_config", default=None, type=str, help="Which huggingface architecture to use: bart-large-xsum"
+    )
+    args = parser.parse_args()
+    convert_bart_checkpoint(args.fairseq_path, args.pytorch_dump_folder_path, hf_checkpoint_name=args.hf_config)
--- a/src/transformers/data/processors/squad.py
+++ b/src/transformers/data/processors/squad.py
@@ -139,6 +139,7 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
            pad_to_max_length=True,
            stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,
            truncation_strategy="only_second" if tokenizer.padding_side == "right" else "only_first",
+            return_token_type_ids=True,
        )

        paragraph_len = min(
--- a/src/transformers/data/processors/utils.py
+++ b/src/transformers/data/processors/utils.py
@@ -16,8 +16,11 @@

 import copy
 import csv
+import dataclasses
 import json
 import logging
+from dataclasses import dataclass
+from typing import Optional

 from ...file_utils import is_tf_available, is_torch_available

@@ -25,7 +28,8 @@ from ...file_utils import is_tf_available, is_torch_available
 logger = logging.getLogger(__name__)


-class InputExample(object):
+@dataclass(frozen=True)
+class InputExample:
    """
    A single training/test example for simple sequence classification.

@@ -39,23 +43,14 @@ class InputExample(object):
            specified for train and dev examples, but not for test examples.
    """

-    def __init__(self, guid, text_a, text_b=None, label=None):
-        self.guid = guid
-        self.text_a = text_a
-        self.text_b = text_b
-        self.label = label
-
-    def __repr__(self):
-        return str(self.to_json_string())
-
-    def to_dict(self):
-        """Serializes this instance to a Python dictionary."""
-        output = copy.deepcopy(self.__dict__)
-        return output
+    guid: str
+    text_a: str
+    text_b: Optional[str] = None
+    label: Optional[str] = None

    def to_json_string(self):
        """Serializes this instance to a JSON string."""
-        return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
+        return json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"


 class InputFeatures(object):
--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@@ -99,6 +99,7 @@ from .modeling_xlm import (
    XLM_PRETRAINED_MODEL_ARCHIVE_MAP,
    XLMForQuestionAnsweringSimple,
    XLMForSequenceClassification,
+    XLMForTokenClassification,
    XLMModel,
    XLMWithLMHeadModel,
 )
@@ -235,6 +236,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
    [
        (DistilBertConfig, DistilBertForTokenClassification),
        (CamembertConfig, CamembertForTokenClassification),
+        (XLMConfig, XLMForTokenClassification),
        (XLMRobertaConfig, XLMRobertaForTokenClassification),
        (RobertaConfig, RobertaForTokenClassification),
        (BertConfig, BertForTokenClassification),
@@ -418,12 +420,12 @@ class AutoModelForPreTraining(object):
            config (:class:`~transformers.PretrainedConfig`):
                The model class to instantiate is selected based on the configuration class:

-                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForMaskedLM` (DistilBERT model)
-                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForMaskedLM` (RoBERTa model)
+                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
+                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
                - isInstance of `bert` configuration class: :class:`~transformers.BertForPreTraining` (Bert model)
                - isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
-                - isInstance of `gpt2` configuration class: :class:`~transformers.GPT2ModelLMHeadModel` (OpenAI GPT-2 model)
-                - isInstance of `ctrl` configuration class: :class:`~transformers.CTRLModelLMHeadModel` (Salesforce CTRL  model)
+                - isInstance of `gpt2` configuration class: :class:`~transformers.GPT2LMHeadModel` (OpenAI GPT-2 model)
+                - isInstance of `ctrl` configuration class: :class:`~transformers.CTRLLMHeadModel` (Salesforce CTRL  model)
                - isInstance of `transfo-xl` configuration class: :class:`~transformers.TransfoXLLMHeadModel` (Transformer-XL model)
                - isInstance of `xlnet` configuration class: :class:`~transformers.XLNetLMHeadModel` (XLNet model)
                - isInstance of `xlm` configuration class: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
@@ -559,12 +561,12 @@ class AutoModelWithLMHead(object):
            config (:class:`~transformers.PretrainedConfig`):
                The model class to instantiate is selected based on the configuration class:

-                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForMaskedLM` (DistilBERT model)
-                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForMaskedLM` (RoBERTa model)
-                - isInstance of `bert` configuration class: :class:`~transformers.BertModelForMaskedLM` (Bert model)
+                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
+                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
+                - isInstance of `bert` configuration class: :class:`~transformers.BertForMaskedLM` (Bert model)
                - isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
-                - isInstance of `gpt2` configuration class: :class:`~transformers.GPT2ModelLMHeadModel` (OpenAI GPT-2 model)
-                - isInstance of `ctrl` configuration class: :class:`~transformers.CTRLModelLMHeadModel` (Salesforce CTRL  model)
+                - isInstance of `gpt2` configuration class: :class:`~transformers.GPT2LMHeadModel` (OpenAI GPT-2 model)
+                - isInstance of `ctrl` configuration class: :class:`~transformers.CTRLLMHeadModel` (Salesforce CTRL  model)
                - isInstance of `transfo-xl` configuration class: :class:`~transformers.TransfoXLLMHeadModel` (Transformer-XL model)
                - isInstance of `xlnet` configuration class: :class:`~transformers.XLNetLMHeadModel` (XLNet model)
                - isInstance of `xlm` configuration class: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
@@ -701,14 +703,14 @@ class AutoModelForSequenceClassification(object):
            config (:class:`~transformers.PretrainedConfig`):
                The model class to instantiate is selected based on the configuration class:

-                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForSequenceClassification` (DistilBERT model)
-                - isInstance of `albert` configuration class: :class:`~transformers.AlbertModelForSequenceClassification` (ALBERT model)
-                - isInstance of `camembert` configuration class: :class:`~transformers.CamembertModelForSequenceClassification` (CamemBERT model)
-                - isInstance of `xlm roberta` configuration class: :class:`~transformers.XLMRobertaModelForSequenceClassification` (XLM-RoBERTa model)
-                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForSequenceClassification` (RoBERTa model)
-                - isInstance of `bert` configuration class: :class:`~transformers.BertModelForSequenceClassification` (Bert model)
-                - isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModelForSequenceClassification` (XLNet model)
-                - isInstance of `xlm` configuration class: :class:`~transformers.XLMModelForSequenceClassification` (XLM model)
+                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForSequenceClassification` (DistilBERT model)
+                - isInstance of `albert` configuration class: :class:`~transformers.AlbertForSequenceClassification` (ALBERT model)
+                - isInstance of `camembert` configuration class: :class:`~transformers.CamembertForSequenceClassification` (CamemBERT model)
+                - isInstance of `xlm roberta` configuration class: :class:`~transformers.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)
+                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaForSequenceClassification` (RoBERTa model)
+                - isInstance of `bert` configuration class: :class:`~transformers.BertForSequenceClassification` (Bert model)
+                - isInstance of `xlnet` configuration class: :class:`~transformers.XLNetForSequenceClassification` (XLNet model)
+                - isInstance of `xlm` configuration class: :class:`~transformers.XLMForSequenceClassification` (XLM model)
                - isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertForSequenceClassification` (Flaubert model)


@@ -848,11 +850,11 @@ class AutoModelForQuestionAnswering(object):
            config (:class:`~transformers.PretrainedConfig`):
                The model class to instantiate is selected based on the configuration class:

-                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForQuestionAnswering` (DistilBERT model)
-                - isInstance of `albert` configuration class: :class:`~transformers.AlbertModelForQuestionAnswering` (ALBERT model)
+                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForQuestionAnswering` (DistilBERT model)
+                - isInstance of `albert` configuration class: :class:`~transformers.AlbertForQuestionAnswering` (ALBERT model)
                - isInstance of `bert` configuration class: :class:`~transformers.BertModelForQuestionAnswering` (Bert model)
-                - isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModelForQuestionAnswering` (XLNet model)
-                - isInstance of `xlm` configuration class: :class:`~transformers.XLMModelForQuestionAnswering` (XLM model)
+                - isInstance of `xlnet` configuration class: :class:`~transformers.XLNetForQuestionAnswering` (XLNet model)
+                - isInstance of `xlm` configuration class: :class:`~transformers.XLMForQuestionAnswering` (XLM model)
                - isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertForQuestionAnswering` (XLM model)

        Examples::
@@ -989,8 +991,10 @@ class AutoModelForTokenClassification:
                The model class to instantiate is selected based on the configuration class:

                - isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForTokenClassification` (DistilBERT model)
+                - isInstance of `xlm` configuration class: :class:`~transformers.XLMForTokenClassification` (XLM model)
                - isInstance of `xlm roberta` configuration class: :class:`~transformers.XLMRobertaModelForTokenClassification` (XLMRoberta model)
                - isInstance of `bert` configuration class: :class:`~transformers.BertModelForTokenClassification` (Bert model)
+                - isInstance of `albert` configuration class: :class:`~transformers.AlbertForTokenClassification` (AlBert model)
                - isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModelForTokenClassification` (XLNet model)
                - isInstance of `camembert` configuration class: :class:`~transformers.CamembertModelForTokenClassification` (Camembert model)
                - isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForTokenClassification` (Roberta model)
@@ -1025,6 +1029,7 @@ class AutoModelForTokenClassification:
        The model class to instantiate is selected as the first pattern matching
        in the `pretrained_model_name_or_path` string (in the following order):
            - contains `distilbert`: :class:`~transformers.DistilBertForTokenClassification` (DistilBERT model)
+            - contains `xlm`: :class:`~transformers.XLMForTokenClassification` (XLM model)
            - contains `xlm-roberta`: :class:`~transformers.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model)
            - contains `camembert`: :class:`~transformers.CamembertForTokenClassification` (Camembert model)
            - contains `bert`: :class:`~transformers.BertForTokenClassification` (Bert model)
--- a/src/transformers/modeling_bart.py
+++ b/src/transformers/modeling_bart.py
@@ -34,6 +34,7 @@ BART_PRETRAINED_MODEL_ARCHIVE_MAP = {
    "bart-large": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/pytorch_model.bin",
    "bart-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/pytorch_model.bin",
    "bart-large-cnn": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin",
+    "bart-large-xsum": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/pytorch_model.bin",
 }

 BART_START_DOCSTRING = r"""
@@ -72,47 +73,50 @@ BART_INPUTS_DOCSTRING = r"""
            Mask to avoid performing attention on padding token indices in input_ids.
            Mask values selected in ``[0, 1]``:
            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
+        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):
+            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
+            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.
+            Used in the cross-attention of the decoder.
        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
            Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.
-        decoder_attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, 1, tgt_seq_len, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
-            Default behavior: generate a tensor that ignores pad tokens and future tokens, as in the paper.
+        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
+            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
            If you want to change padding behavior, you should read :func:`~transformers.modeling_bart._prepare_decoder_inputs` and modify.
            See diagram 1 in the paper for more info on the default strategy
 """
-LARGE_NEGATIVE = -1e8
+
+
+def invert_mask(attention_mask):
+    assert attention_mask.dim() == 2
+    return attention_mask.eq(0)


 def _prepare_bart_decoder_inputs(
-    config, input_ids, decoder_input_ids=None, decoder_attn_mask=None, mask_dtype=None,
+    config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32
 ):
-    """Prepare masks that ignore padding tokens in the decoder and a causal lm mask for the decoder if
+    """Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if
    none are provided. This mimics the default behavior in fairseq. To override it pass in masks.
    Note: this is not called during generation
    """
    pad_token_id = config.pad_token_id
-    need_causal_mask = not config.output_past
    if decoder_input_ids is None:
        decoder_input_ids = shift_tokens_right(input_ids, pad_token_id)
-    bsz, tgt_len = decoder_input_ids.size()[:2]
-    if decoder_attn_mask is None:
+    bsz, tgt_len = decoder_input_ids.size()
+    if decoder_padding_mask is None:
        decoder_padding_mask = make_padding_mask(decoder_input_ids, pad_token_id)
-        if need_causal_mask:
-            causal_lm_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1)
-        else:
-            causal_lm_mask = None
-        new_shape = (bsz, tgt_len, tgt_len)
-        # make it broadcastable so can just be added to the attention coefficients
-        decoder_attn_mask = _combine_masks(decoder_padding_mask, causal_lm_mask, new_shape).to(device=input_ids.device)
-        if mask_dtype is not None:
-            decoder_attn_mask = decoder_attn_mask.to(mask_dtype)
-    assert decoder_attn_mask is None or decoder_attn_mask.shape == (bsz, 1, tgt_len, tgt_len)
-    return decoder_input_ids, decoder_attn_mask
+    else:
+        decoder_padding_mask = invert_mask(decoder_padding_mask)
+    causal_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1).to(
+        dtype=causal_mask_dtype, device=decoder_input_ids.device
+    )
+    return decoder_input_ids, decoder_padding_mask, causal_mask


 class PretrainedBartModel(PreTrainedModel):
    config_class = BartConfig
    base_model_prefix = "model"
    pretrained_model_archive_map = BART_PRETRAINED_MODEL_ARCHIVE_MAP
+    encoder_outputs_batch_dim_idx = 1  # outputs shaped (seq_len, bs, ...)

    def _init_weights(self, module):
        std = self.config.init_std
@@ -128,13 +132,10 @@ class PretrainedBartModel(PreTrainedModel):
    @property
    def dummy_inputs(self):
        pad_token = self.config.pad_token_id
-        input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]])
-        decoder_input_ids, decoder_attn_mask = _prepare_bart_decoder_inputs(self.config, input_ids,)
+        input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]], device=self.device)
        dummy_inputs = {
-            "decoder_input_ids": decoder_input_ids,
            "attention_mask": input_ids.ne(pad_token),
            "input_ids": input_ids,
-            "decoder_attention_mask": decoder_attn_mask,
        }
        return dummy_inputs

@@ -152,21 +153,6 @@ def _check_shapes(shape_1, shape2):
        raise AssertionError("shape mismatch: {} != {}".format(shape_1, shape2))


-def _combine_masks(key_padding_mask, causal_lm_mask, targ_size):
-    """Make one mask of shape (bsz, 1, tgt_len, src_len) """
-    a = torch.zeros(targ_size)  # targ_size is(bsz, tgt_len, src_len)
-    b = torch.zeros(targ_size)
-    if key_padding_mask is not None:  # (bsz, tgt_len) -> targ_size
-        _check_shapes(key_padding_mask.shape, targ_size[:2])
-        reshaped = key_padding_mask.unsqueeze(2).expand(*targ_size)
-        a[reshaped] = LARGE_NEGATIVE
-
-    if causal_lm_mask is not None:  # (tgt_len, src_len) -> targ_size
-        _check_shapes(causal_lm_mask.shape, targ_size[-2:])
-        b = causal_lm_mask.unsqueeze(0).expand(*targ_size)
-    return (a + b).unsqueeze(1).clamp(LARGE_NEGATIVE,)
-
-
 def shift_tokens_right(input_ids, pad_token_id):
    """Shift input ids one token to the right, and wrap the last non pad token (usually <eos>)."""
    prev_output_tokens = input_ids.clone()
@@ -216,7 +202,9 @@ class EncoderLayer(nn.Module):
            encoded output of shape `(seq_len, batch, embed_dim)`
        """
        residual = x
-        x, attn_weights = self.self_attn(query=x, key=x, key_padding_mask=encoder_padding_mask,)
+        x, attn_weights = self.self_attn(
+            query=x, key=x, key_padding_mask=encoder_padding_mask, need_weights=self.output_attentions
+        )
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        x = self.self_attn_layer_norm(x)
@@ -278,8 +266,7 @@ class BartEncoder(nn.Module):
        """
        # check attention mask and invert
        if attention_mask is not None:
-            assert attention_mask.dim() == 2
-            attention_mask = attention_mask.eq(0)
+            attention_mask = invert_mask(attention_mask)

        inputs_embeds = self.embed_tokens(input_ids)
        embed_pos = self.embed_positions(input_ids)
@@ -315,6 +302,7 @@ class DecoderLayer(nn.Module):
    def __init__(self, config: BartConfig):
        super().__init__()
        self.embed_dim = config.d_model
+        self.output_attentions = config.output_attentions
        self.self_attn = SelfAttention(
            embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout,
        )
@@ -335,21 +323,34 @@ class DecoderLayer(nn.Module):
        self.final_layer_norm = LayerNorm(self.embed_dim)

    def forward(
-        self, x, encoder_hidden_states, encoder_attn_mask=None, layer_state=None, attention_mask=None,
+        self,
+        x,
+        encoder_hidden_states,
+        encoder_attn_mask=None,
+        layer_state=None,
+        causal_mask=None,
+        decoder_padding_mask=None,
    ):
        residual = x

        if layer_state is None:
            layer_state = {}
        # next line mutates layer state
-        x, self_attn_weights = self.self_attn(query=x, key=x, layer_state=layer_state, attn_mask=attention_mask,)
+        x, self_attn_weights = self.self_attn(
+            query=x,
+            key=x,
+            layer_state=layer_state,
+            key_padding_mask=decoder_padding_mask,
+            attn_mask=causal_mask,
+            need_weights=self.output_attentions,
+        )
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        x = self.self_attn_layer_norm(x)
        residual = x
        assert self.encoder_attn.cache_key != self.self_attn.cache_key

-        x, encoder_attn_weights = self.encoder_attn(
+        x, _ = self.encoder_attn(
            query=x,
            key=encoder_hidden_states,
            key_padding_mask=encoder_attn_mask,
@@ -406,7 +407,8 @@ class BartDecoder(nn.Module):
        input_ids,
        encoder_hidden_states,
        encoder_padding_mask,
-        combined_mask,
+        decoder_padding_mask,
+        decoder_causal_mask,
        decoder_cached_states=None,
        generation_mode=False,
        **unused
@@ -431,8 +433,7 @@ class BartDecoder(nn.Module):
        """
        # check attention mask and invert
        if encoder_padding_mask is not None:
-            assert encoder_padding_mask.dim() == 2
-            encoder_padding_mask = encoder_padding_mask.eq(0)
+            encoder_padding_mask = invert_mask(encoder_padding_mask)

        # embed positions
        positions = self.embed_positions(input_ids, generation_mode=generation_mode)
@@ -452,7 +453,6 @@ class BartDecoder(nn.Module):
        all_hidden_states = ()
        all_self_attns = ()
        next_decoder_cache = []
-
        for i, decoder_layer in enumerate(self.layers):
            decoder_layer  # type: DecoderLayer
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
@@ -462,7 +462,12 @@ class BartDecoder(nn.Module):

            layer_state = decoder_cached_states[i] if decoder_cached_states is not None else None
            x, layer_self_attn, layer_past = decoder_layer(
-                x, encoder_hidden_states, encoder_padding_mask, layer_state=layer_state, attention_mask=combined_mask,
+                x,
+                encoder_hidden_states,
+                encoder_attn_mask=encoder_padding_mask,
+                decoder_padding_mask=decoder_padding_mask,
+                layer_state=layer_state,
+                causal_mask=decoder_causal_mask,
            )

            if self.output_past:
@@ -526,6 +531,7 @@ class SelfAttention(nn.Module):
        key_padding_mask: Optional[Tensor] = None,
        layer_state: Optional[Dict[str, Optional[Tensor]]] = None,
        attn_mask: Optional[Tensor] = None,
+        need_weights=False,
    ) -> Tuple[Tensor, Optional[Tensor]]:
        """Input shape: Time(SeqLen) x Batch x Channel"""
        static_kv = self.encoder_decoder_attention  # type: bool
@@ -597,7 +603,10 @@ class SelfAttention(nn.Module):
        assert attn_output.size() == (bsz * self.num_heads, tgt_len, self.head_dim)
        attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
        attn_output = self.out_proj(attn_output)
-        attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+        if need_weights:
+            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+        else:
+            attn_weights = None
        return attn_output, attn_weights

    def _use_saved_state(self, k, v, saved_state, key_padding_mask, static_kv, bsz):
@@ -726,6 +735,8 @@ def _filter_out_falsey_values(tup) -> Tuple:


 # Public API
+def _get_shape(t):
+    return getattr(t, "shape", None)


@add_start_docstrings(
@@ -759,13 +770,16 @@ class BartModel(PretrainedBartModel):

        # make masks if user doesn't supply
        if not generation_mode:
-            decoder_input_ids, decoder_attention_mask = _prepare_bart_decoder_inputs(
+            decoder_input_ids, decoder_padding_mask, causal_mask = _prepare_bart_decoder_inputs(
                self.config,
                input_ids,
                decoder_input_ids=decoder_input_ids,
-                decoder_attn_mask=decoder_attention_mask,
-                mask_dtype=self.shared.weight.dtype,
+                decoder_padding_mask=decoder_attention_mask,
+                causal_mask_dtype=self.shared.weight.dtype,
            )
+        else:
+            decoder_padding_mask, causal_mask = None, None
+
        assert decoder_input_ids is not None
        if encoder_outputs is None:
            encoder_outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
@@ -775,7 +789,8 @@ class BartModel(PretrainedBartModel):
            decoder_input_ids,
            encoder_outputs[0],
            attention_mask,
-            decoder_attention_mask,
+            decoder_padding_mask,
+            decoder_causal_mask=causal_mask,
            decoder_cached_states=decoder_cached_states,
            generation_mode=generation_mode,
        )
@@ -804,13 +819,8 @@ class BartForConditionalGeneration(PretrainedBartModel):

    def __init__(self, config: BartConfig):
        super().__init__(config)
-        # if base_model is None:
        base_model = BartModel(config)
        self.model = base_model
-        self.lm_head = _make_linear_from_emb(self.model.shared)
-
-    def tie_weights(self):
-        pass  # hack to prevent changing lm_head.out_features. The input and output embeddings are still the same.

    @add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)
    def forward(
@@ -875,7 +885,7 @@ class BartForConditionalGeneration(PretrainedBartModel):
            decoder_cached_states=decoder_cached_states,
            generation_mode=generation_mode,
        )
-        lm_logits = self.lm_head(outputs[0])
+        lm_logits = F.linear(outputs[0], self.model.shared.weight)
        outputs = (lm_logits,) + outputs[1:]  # Add hidden states and attention if they are here
        if lm_labels is not None:
            loss_fct = nn.CrossEntropyLoss()
@@ -893,7 +903,6 @@ class BartForConditionalGeneration(PretrainedBartModel):
            encoder_outputs, decoder_cached_states = past, None
        else:
            encoder_outputs, decoder_cached_states = past
-
        return {
            "input_ids": None,  # encoder_outputs is defined. input_ids not needed
            "encoder_outputs": encoder_outputs,
@@ -932,7 +941,7 @@ class BartForConditionalGeneration(PretrainedBartModel):
        return self.model.encoder

    def get_output_embeddings(self):
-        return self.lm_head
+        return _make_linear_from_emb(self.model.shared)  # make it on the fly


@add_start_docstrings(
@@ -968,7 +977,7 @@ class BartForSequenceClassification(PretrainedBartModel):
    Returns:
        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BartConfig`) and inputs:
            loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):
-                Classification  loss (cross entropy)
+                Classification loss (cross entropy)
            logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
                Classification (or regression if config.num_labels==1) scores (before SoftMax).
            hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
--- a/src/transformers/modeling_t5.py
+++ b/src/transformers/modeling_t5.py
@@ -27,7 +27,7 @@ from torch import nn
 from torch.nn import CrossEntropyLoss

 from .configuration_t5 import T5Config
-from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings
+from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable
 from .modeling_utils import PreTrainedModel, prune_linear_layer


@@ -457,6 +457,7 @@ class T5PreTrainedModel(PreTrainedModel):
    pretrained_model_archive_map = T5_PRETRAINED_MODEL_ARCHIVE_MAP
    load_tf_weights = load_tf_weights_in_t5
    base_model_prefix = "transformer"
+    encoder_outputs_batch_dim_idx = 0  # outputs shaped (bs, ...)

    @property
    def dummy_inputs(self):
@@ -501,6 +502,27 @@ class T5PreTrainedModel(PreTrainedModel):
            if module.has_relative_attention_bias:
                module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))

+    def _shift_right(self, input_ids):
+        decoder_start_token_id = self.config.decoder_start_token_id
+        pad_token_id = self.config.pad_token_id
+
+        assert (
+            decoder_start_token_id is not None
+        ), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
+
+        # shift inputs to the right
+        shifted_input_ids = input_ids.new_zeros(input_ids.shape)
+        shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
+        shifted_input_ids[..., 0] = decoder_start_token_id
+
+        assert pad_token_id is not None, "self.model.config.pad_token_id has to be defined."
+        # replace possible -100 values in lm_labels by `pad_token_id`
+        shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
+
+        assert torch.all(shifted_input_ids >= 0).item(), "Verify that `lm_labels` has only positive values and -100"
+
+        return shifted_input_ids
+

 class T5Stack(T5PreTrainedModel):
    def __init__(self, config, embed_tokens=None):
@@ -695,30 +717,38 @@ T5_START_DOCSTRING = r"""    The T5 model was proposed in
 """

 T5_INPUTS_DOCSTRING = r"""
-    Inputs:
-        **input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.
-            To match pre-training, T5 input sequence should be formatted with [CLS] and [SEP] tokens as follows:
-
-            (a) For sequence pairs:
-
-                ``tokens:         [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
-
-            (b) For single sequences:
-
-                ``tokens:         [CLS] the dog is hairy . [SEP]``
-
            T5 is a model with relative position embeddings so you should be able to pad the inputs on
-            the right or the left.
-
            Indices can be obtained using :class:`transformers.T5Tokenizer`.
            See :func:`transformers.PreTrainedTokenizer.encode` and
            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
-        **attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
+            To know more on how to prepare :obj:`input_ids` for pre-training take a look at
+            `T5 Training <./t5.html#training>`_ .
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Mask to avoid performing attention on padding token indices.
            Mask values selected in ``[0, 1]``:
            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):
+            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
+            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.
+            Used in the cross-attention of the decoder.
+        decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
+            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
+            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
+            `T5 Training <./t5.html#training>`_ .
+        decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
+            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            than the model's internal embedding lookup matrix.
+        decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
+            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+            than the model's internal embedding lookup matrix.
+        head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
            Mask to nullify selected heads of the self-attention modules.
            Mask values selected in ``[0, 1]``:
            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -728,31 +758,8 @@ T5_INPUTS_DOCSTRING = r"""
@add_start_docstrings(
    "The bare T5 Model transformer outputting raw hidden-states" "without any specific head on top.",
    T5_START_DOCSTRING,
-    T5_INPUTS_DOCSTRING,
 )
 class T5Model(T5PreTrainedModel):
-    r"""
-    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
-        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
-            Sequence of hidden-states at the output of the last layer of the model.
-        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
-            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
-            of shape ``(batch_size, sequence_length, hidden_size)``:
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
-            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
-    Examples::
-
-        tokenizer = T5Tokenizer.from_pretrained('t5-small')
-        model = T5Model.from_pretrained('t5-small')
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids=input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
-
-    """
-
    def __init__(self, config):
        super().__init__(config)
        self.shared = nn.Embedding(config.vocab_size, config.d_model)
@@ -782,6 +789,7 @@ class T5Model(T5PreTrainedModel):
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

+    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids=None,
@@ -793,6 +801,34 @@ class T5Model(T5PreTrainedModel):
        decoder_inputs_embeds=None,
        head_mask=None,
    ):
+        r"""
+    Return:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
+        last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    Examples::
+
+            from transformers import T5Tokenizer, T5Model
+
+            tokenizer = T5Tokenizer.from_pretrained('t5-small')
+            model = T5Model.from_pretrained('t5-small')
+            input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")  # Batch size 1
+            outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)
+            last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+        """

        # Encode if needed (training, first prediction pass)
        if encoder_outputs is None:
@@ -815,38 +851,8 @@ class T5Model(T5PreTrainedModel):
        return decoder_outputs + encoder_outputs


-@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
+@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING)
 class T5ForConditionalGeneration(T5PreTrainedModel):
-    r"""
-        **lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
-            Labels for computing the masked language modeling loss.
-            Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).
-            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
-            in ``[0, ..., config.vocab_size]``.
-
-    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
-        **loss**: (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
-            Masked language modeling loss.
-        **prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
-            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
-            of shape ``(batch_size, sequence_length, hidden_size)``:
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
-            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
-    Examples::
-
-        tokenizer = T5Tokenizer.from_pretrained('t5-small')
-        model = T5ForConditionalGeneration.from_pretrained('t5-small')
-        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
-        outputs = model(input_ids=input_ids, lm_labels=input_ids)
-        loss, prediction_scores = outputs[:2]
-
-    """
-
    def __init__(self, config):
        super().__init__(config)
        self.model_dim = config.d_model
@@ -878,6 +884,7 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
    def get_encoder(self):
        return self.encoder

+    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids=None,
@@ -890,6 +897,45 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
        decoder_inputs_embeds=None,
        head_mask=None,
    ):
+        r"""
+        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
+                Labels for computing the sequence classification/regression loss.
+                Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
+                If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+                All labels set to ``-100`` are ignored (masked), the loss is only
+                computed for labels in ``[0, ..., config.vocab_size]``
+
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):
+            Classification loss (cross entropy).
+        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.
+
+    Examples::
+
+        from transformers import T5Tokenizer, T5ForConditionalGeneration
+
+        tokenizer = T5Tokenizer.from_pretrained('t5-small')
+        model = T5ForConditionalGeneration.from_pretrained('t5-small')
+        input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="pt")  # Batch size 1
+        outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)
+        loss, prediction_scores = outputs[:2]
+
+        tokenizer = T5Tokenizer.from_pretrained('t5-small')
+        model = T5ForConditionalGeneration.from_pretrained('t5-small')
+        input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="pt")  # Batch size 1
+        outputs = model.generate(input_ids)
+        """

        # Encode if needed (training, first prediction pass)
        if encoder_outputs is None:
@@ -900,6 +946,10 @@ class T5ForConditionalGeneration(T5PreTrainedModel):

        hidden_states = encoder_outputs[0]

+        if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
+            # get decoder inputs from shifting lm labels to the right
+            decoder_input_ids = self._shift_right(lm_labels)
+
        # Decode
        decoder_outputs = self.decoder(
            input_ids=decoder_input_ids,
@@ -918,10 +968,8 @@ class T5ForConditionalGeneration(T5PreTrainedModel):

        decoder_outputs = (lm_logits,) + decoder_outputs[1:]  # Add hidden states and attention if they are here
        if lm_labels is not None:
-            shift_logits = lm_logits[..., :-1, :].contiguous()
-            shift_labels = lm_labels[..., 1:].contiguous()
            loss_fct = CrossEntropyLoss(ignore_index=-100)
-            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1))
            decoder_outputs = (
                loss,
            ) + decoder_outputs  # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666
--- a/src/transformers/modeling_tf_t5.py
+++ b/src/transformers/modeling_tf_t5.py
@@ -24,7 +24,7 @@ import math
 import tensorflow as tf

 from .configuration_t5 import T5Config
-from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings
+from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable
 from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list


@@ -630,31 +630,41 @@ T5_START_DOCSTRING = r"""    The T5 model was proposed in
 """

 T5_INPUTS_DOCSTRING = r"""
-    Inputs:
-        **input_ids**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
+    Args:
+        decoder_input_ids are usually used as a `dict` (see T5 description above for more information) containing all the following.
+        decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
+            Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
+
+        input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.
-            To match pre-training, T5 input sequence should be formatted with [CLS] and [SEP] tokens as follows:
-
-            (a) For sequence pairs:
-
-                ``tokens:         [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
-
-            (b) For single sequences:
-
-                ``tokens:         [CLS] the dog is hairy . [SEP]``
-
-
            T5 is a model with relative position embeddings so you should be able to pad the inputs on
            the right or the left.
-
            Indices can be obtained using :class:`transformers.T5Tokenizer`.
+            To know more on how to prepare :obj:`input_ids` for pre-training take a look at
+            `T5 Training <./t5.html#training>`_ .
            See :func:`transformers.PreTrainedTokenizer.encode` and
            :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
-        **attention_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
+        attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Mask to avoid performing attention on padding token indices.
            Mask values selected in ``[0, 1]``:
            ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
-        **head_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
+        encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`, defaults to :obj:`None`):
+            Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
+            `last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.
+            Used in the cross-attention of the decoder.
+        decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
+            Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
+        inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+            than the model's internal embedding lookup matrix.
+        decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
+            Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
+            than the model's internal embedding lookup matrix.
+            To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
+            `T5 Training <./t5.html#training>`_ .
+        head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
            Mask to nullify selected heads of the self-attention modules.
            Mask values selected in ``[0, 1]``:
            ``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -664,34 +674,8 @@ T5_INPUTS_DOCSTRING = r"""
@add_start_docstrings(
    "The bare T5 Model transformer outputting raw hidden-states" "without any specific head on top.",
    T5_START_DOCSTRING,
-    T5_INPUTS_DOCSTRING,
 )
 class TFT5Model(TFT5PreTrainedModel):
-    r"""
-    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
-        **last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
-            Sequence of hidden-states at the output of the last layer of the model.
-        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
-            list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
-            of shape ``(batch_size, sequence_length, hidden_size)``:
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
-            list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
-    Examples::
-
-        import tensorflow as tf
-        from transformers import T5Tokenizer, TFT5Model
-
-        tokenizer = T5Tokenizer.from_pretrained('t5-small')
-        model = TFT5Model.from_pretrained('t5-small')
-        input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
-        outputs = model(input_ids=input_ids)
-        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
-
-    """
-
    def __init__(self, config, *inputs, **kwargs):
        super().__init__(config, *inputs, **kwargs)
        self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name="shared")
@@ -715,7 +699,36 @@ class TFT5Model(TFT5PreTrainedModel):
    def get_output_embeddings(self):
        return self.shared

+    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
    def call(self, decoder_input_ids, **kwargs):
+        r"""
+    Return:
+        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
+        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`tf.Tensor` (one for each layer) of shape
+                :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    Examples::
+
+        from transformers import T5Tokenizer, TFT5Model
+
+        tokenizer = T5Tokenizer.from_pretrained('t5-small')
+        model = TFT5Model.from_pretrained('t5-small')
+        input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf")  # Batch size 1
+        outputs = model(input_ids, input_ids=input_ids)
+        last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
+
+        """

        if isinstance(decoder_input_ids, dict):
            kwargs.update(decoder_input_ids)
@@ -753,33 +766,8 @@ class TFT5Model(TFT5PreTrainedModel):
        return decoder_outputs + encoder_outputs


-@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
+@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING)
 class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
-    r"""
-    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
-        **prediction_scores**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
-            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
-        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
-            list of ``Numpy array`` or ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
-            of shape ``(batch_size, sequence_length, hidden_size)``:
-            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
-        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
-            list of ``Numpy array`` or ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
-            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
-    Examples::
-
-        import tensorflow as tf
-        from transformers import T5Tokenizer, TFT5ForConditionalGeneration
-
-        tokenizer = T5Tokenizer.from_pretrained('t5-small')
-        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
-        input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
-        outputs = model(input_ids=input_ids)
-        prediction_scores = outputs[0]
-
-    """
-
    def __init__(self, config, *inputs, **kwargs):
        super().__init__(config, *inputs, **kwargs)
        self.model_dim = config.d_model
@@ -808,7 +796,42 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
    def get_encoder(self):
        return self.encoder

+    @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
    def call(self, decoder_input_ids, **kwargs):
+        r"""
+    Return:
+        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
+        loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):
+            Classification loss (cross entropy).
+        prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`tf.Tensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.
+
+    Examples::
+
+        from transformers import T5Tokenizer, TFT5ForConditionalGeneration
+
+        tokenizer = T5Tokenizer.from_pretrained('t5-small')
+        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
+        input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf")  # Batch size 1
+        outputs = model(input_ids, input_ids=input_ids)
+        prediction_scores = outputs[0]
+
+        tokenizer = T5Tokenizer.from_pretrained('t5-small')
+        model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
+        input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="tf")  # Batch size 1
+        model.generate(input_ids)
+
+        """

        if isinstance(decoder_input_ids, dict):
            kwargs.update(decoder_input_ids)
--- a/src/transformers/modeling_tf_utils.py
+++ b/src/transformers/modeling_tf_utils.py
@@ -231,7 +231,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):

    def save_pretrained(self, save_directory):
        """ Save a model and its configuration file to a directory, so that it
-            can be re-loaded using the `:func:`~transformers.PreTrainedModel.from_pretrained`` class method.
+            can be re-loaded using the :func:`~transformers.PreTrainedModel.from_pretrained` class method.
        """
        assert os.path.isdir(
            save_directory
@@ -523,8 +523,8 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
            pad_token_id: (`optional`) int
                Pad token. Defaults to pad_token_id as defined in the models config.

-            eos_token_ids: (`optional`) int or list of int
-                End of sequence token or list of tokens to stop the generation. Default to 0.
+            eos_token_id: (`optional`) int
+                EOS token. Defaults to eos_token_id as defined in the models config.

            length_penalty: (`optional`) float
                Exponential penalty to the length. Default to 1.
@@ -541,7 +541,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
                ``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
                Defaults to `None`.

-            `What are attention masks? <../glossary.html#attention-mask>`__
+                `What are attention masks? <../glossary.html#attention-mask>`__

            decoder_start_token_id=None: (`optional`) int
                If an encoder-decoder model starts decoding with a different token than BOS.
@@ -610,7 +610,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
        num_return_sequences = (
            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
        )
-        decoder_start_token_id = decoder_start_token_id if decoder_start_token_id is not None else bos_token_id
+        decoder_start_token_id = (
+            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
+        )

        if input_ids is not None:
            batch_size = shape_list(input_ids)[0]  # overriden by the input batch_size
@@ -635,9 +637,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
        assert (eos_token_id is None) or (
            isinstance(eos_token_id, int) and (eos_token_id >= 0)
        ), "`eos_token_id` should be a positive integer."
-        assert (
-            decoder_start_token_id is not None or self.config.is_encoder_decoder is False
-        ), "`decoder_start_token_id` has to be defined if model is encoder-decoder model"
        assert length_penalty > 0, "`length_penalty` should be strictely positive."
        assert (
            isinstance(num_return_sequences, int) and num_return_sequences > 0
@@ -708,8 +707,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)

        if self.config.is_encoder_decoder:
+            if decoder_start_token_id is None:
+                decoder_start_token_id = bos_token_id

-            assert bos_token_id is not None, "Encoder Decoder Models need to have a bos_token_id"
+            assert (
+                decoder_start_token_id is not None
+            ), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
            assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
            assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)

@@ -996,10 +999,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
            # set eos token prob to zero if min_length is not reached
            if eos_token_id is not None and cur_len < min_length:
                # create eos_token_id boolean mask
+                num_batch_hypotheses = batch_size * num_beams
+
                is_token_logit_eos_token = tf.convert_to_tensor(
                    [True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool
                )
-                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [batch_size, vocab_size])
+                eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [num_batch_hypotheses, vocab_size])

                scores = set_tensor_by_indices_to_value(scores, eos_token_indices_mask, -float("inf"))

--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -108,6 +108,10 @@ class ModuleUtilsMixin:
            module.mem_rss_post_forward = 0
            module.mem_rss_pre_forward = 0

+    @property
+    def device(self):
+        return next(self.parameters()).device
+

 class PreTrainedModel(nn.Module, ModuleUtilsMixin):
    r""" Base class for all models.
@@ -717,13 +721,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
                Padding token. Default to specicic model pad_token_id or None if it does not exist.

            bos_token_id: (`optional`) int
-                BOS token. Defaults to bos_token_id as defined in the models config.
+                BOS token. Defaults to `bos_token_id` as defined in the models config.

-            pad_token_id: (`optional`) int
-                Pad token. Defaults to pad_token_id as defined in the models config.
-
-            eos_token_ids: (`optional`) int or list of int
-                End of sequence token or list of tokens to stop the generation. Default to eos_token_ids as defined in the models config.
+            eos_token_id: (`optional`) int
+                EOS token. Defaults to `eos_token_id` as defined in the models config.

            length_penalty: (`optional`) float
                Exponential penalty to the length. Default to 1.
@@ -809,7 +810,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
        num_return_sequences = (
            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
        )
-        decoder_start_token_id = decoder_start_token_id if decoder_start_token_id is not None else bos_token_id
+        decoder_start_token_id = (
+            decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
+        )

        if input_ids is not None:
            batch_size = input_ids.shape[0]  # overriden by the input batch_size
@@ -831,9 +834,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
        assert pad_token_id is None or (
            isinstance(pad_token_id, int) and (pad_token_id >= 0)
        ), "`pad_token_id` should be a positive integer."
-        assert (
-            decoder_start_token_id is not None or self.config.is_encoder_decoder is False
-        ), "`decoder_start_token_id` has to be defined if model is encoder-decoder model"
        assert (eos_token_id is None) or (
            isinstance(eos_token_id, int) and (eos_token_id >= 0)
        ), "`eos_token_id` should be a positive integer."
@@ -896,6 +896,21 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            effective_batch_size = batch_size
            effective_batch_mult = 1

+        if self.config.is_encoder_decoder:
+            if decoder_start_token_id is None:
+                decoder_start_token_id = bos_token_id
+
+            assert (
+                decoder_start_token_id is not None
+            ), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
+            assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
+            assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
+
+            # get encoder and store encoder outputs
+            encoder = self.get_encoder()
+
+            encoder_outputs = encoder(input_ids, attention_mask=attention_mask)
+
        # Expand input ids if num_beams > 1 or num_return_sequences > 1
        if num_return_sequences > 1 or num_beams > 1:
            input_ids_len = input_ids.shape[-1]
@@ -912,15 +927,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
            )  # shape: (batch_size * num_return_sequences * num_beams, cur_len)

        if self.config.is_encoder_decoder:
-            assert bos_token_id is not None, "Encoder Decoder Models need to have a bos_token_id"
-            assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
-            assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
-
-            # get encoder and store encoder outputs
-            encoder = self.get_encoder()
-
-            encoder_outputs = encoder(input_ids, attention_mask=attention_mask)
-
            # create empty decoder_input_ids
            input_ids = torch.full(
                (effective_batch_size * num_beams, 1),
@@ -929,6 +935,18 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
                device=next(self.parameters()).device,
            )
            cur_len = 1
+            batch_idx = self.encoder_outputs_batch_dim_idx
+            assert (
+                batch_size == encoder_outputs[0].shape[batch_idx]
+            ), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[1]} "
+            expanded_idx = (
+                torch.arange(batch_size)
+                .view(-1, 1)
+                .repeat(1, num_beams * effective_batch_mult)
+                .view(-1)
+                .to(input_ids.device)
+            )
+            encoder_outputs = (encoder_outputs[0].index_select(batch_idx, expanded_idx), *encoder_outputs[1:])
        else:
            encoder_outputs = None
            cur_len = input_ids.shape[-1]
--- a/src/transformers/modeling_xlm.py
+++ b/src/transformers/modeling_xlm.py
@@ -1040,3 +1040,98 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
        outputs = outputs + transformer_outputs[1:]  # Keep new_mems and attention/hidden states if they are here

        return outputs
+
+
+@add_start_docstrings(
+    """XLM Model with a token classification head on top (a linear layer on top of
+    the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
+    XLM_START_DOCSTRING,
+)
+class XLMForTokenClassification(XLMPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.transformer = XLMModel(config)
+        self.dropout = nn.Dropout(config.dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        langs=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        labels=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
+            Labels for computing the token classification loss.
+            Indices should be in ``[0, ..., config.num_labels - 1]``.
+
+    Returns:
+        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.XLMConfig`) and inputs:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :
+            Classification loss.
+        scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)
+            Classification scores (before SoftMax).
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
+            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
+
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+
+    Examples::
+
+        from transformers import XLMTokenizer, XLMForTokenClassification
+        import torch
+
+        tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-100-1280')
+        model = XLMForTokenClassification.from_pretrained('xlm-mlm-100-1280')
+        input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
+        labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0)  # Batch size 1
+        outputs = model(input_ids, labels=labels)
+        loss, scores = outputs[:2]
+
+        """
+        outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            langs=langs,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+        )
+
+        sequence_output = outputs[0]
+
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+
+        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()
+            # Only keep active parts of the loss
+            if attention_mask is not None:
+                active_loss = attention_mask.view(-1) == 1
+                active_logits = logits.view(-1, self.num_labels)
+                active_labels = torch.where(
+                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)
+                )
+                loss = loss_fct(active_logits, active_labels)
+            else:
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            outputs = (loss,) + outputs
+
+        return outputs  # (loss), scores, (hidden_states), (attentions)
--- a/src/transformers/pipelines.py
+++ b/src/transformers/pipelines.py
@@ -31,6 +31,7 @@ from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig
 from .configuration_bart import BartConfig
 from .configuration_distilbert import DistilBertConfig
 from .configuration_roberta import RobertaConfig
+from .configuration_t5 import T5Config
 from .configuration_utils import PretrainedConfig
 from .configuration_xlm import XLMConfig
 from .data import SquadExample, squad_convert_examples_to_features
@@ -60,7 +61,6 @@ if is_torch_available():
        AutoModelForTokenClassification,
        AutoModelWithLMHead,
    )
-    from .modeling_bart import BartForConditionalGeneration


 logger = logging.getLogger(__name__)
@@ -130,7 +130,9 @@ class PipelineDataFormat:

    SUPPORTED_FORMATS = ["json", "csv", "pipe"]

-    def __init__(self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False):
+    def __init__(
+        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
+    ):
        self.output_path = output_path
        self.input_path = input_path
        self.column = column.split(",") if column is not None else [""]
@@ -176,7 +178,7 @@ class PipelineDataFormat:

    @staticmethod
    def from_str(
-        format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False
+        format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
    ):
        if format == "json":
            return JsonPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)
@@ -189,7 +191,9 @@ class PipelineDataFormat:


 class CsvPipelineDataFormat(PipelineDataFormat):
-    def __init__(self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False):
+    def __init__(
+        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
+    ):
        super().__init__(output_path, input_path, column, overwrite=overwrite)

    def __iter__(self):
@@ -210,7 +214,9 @@ class CsvPipelineDataFormat(PipelineDataFormat):


 class JsonPipelineDataFormat(PipelineDataFormat):
-    def __init__(self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False):
+    def __init__(
+        self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
+    ):
        super().__init__(output_path, input_path, column, overwrite=overwrite)

        with open(input_path, "r") as f:
@@ -336,6 +342,7 @@ class Pipeline(_ScikitCompat):
        tokenizer: PreTrainedTokenizer,
        modelcard: Optional[ModelCard] = None,
        framework: Optional[str] = None,
+        task: str = "",
        args_parser: ArgumentHandler = None,
        device: int = -1,
        binary_output: bool = False,
@@ -356,6 +363,11 @@ class Pipeline(_ScikitCompat):
        if self.framework == "pt" and self.device.type == "cuda":
            self.model = self.model.to(self.device)

+        # Update config with task specific parameters
+        task_specific_params = self.model.config.task_specific_params
+        if task_specific_params is not None and task in task_specific_params:
+            self.model.config.update(task_specific_params.get(task))
+
    def save_pretrained(self, save_directory):
        """
        Save the pipeline's model and tokenizer to the specified save_directory
@@ -420,7 +432,7 @@ class Pipeline(_ScikitCompat):
        """
        args = ["input_ids", "attention_mask"]

-        if not isinstance(self.model.config, (DistilBertConfig, XLMConfig, RobertaConfig, BartConfig)):
+        if not isinstance(self.model.config, (DistilBertConfig, XLMConfig, RobertaConfig, BartConfig, T5Config)):
            args += ["token_type_ids"]

        # PR #1548 (CLI) There is an issue with attention_mask
@@ -432,14 +444,18 @@ class Pipeline(_ScikitCompat):
        else:
            return {k: [feature[k] for feature in features] for k in args}

-    def _parse_and_tokenize(self, *texts, **kwargs):
+    def _parse_and_tokenize(self, *texts, pad_to_max_length=False, **kwargs):
        """
        Parse arguments and tokenize
        """
        # Parse arguments
        inputs = self._args_parser(*texts, **kwargs)
        inputs = self.tokenizer.batch_encode_plus(
-            inputs, add_special_tokens=True, return_tensors=self.framework, max_length=self.tokenizer.max_len
+            inputs,
+            add_special_tokens=True,
+            return_tensors=self.framework,
+            max_length=self.tokenizer.max_len,
+            pad_to_max_length=pad_to_max_length,
        )

        # Filter out features not available on specific models
@@ -520,6 +536,7 @@ class FeatureExtractionPipeline(Pipeline):
        framework: Optional[str] = None,
        args_parser: ArgumentHandler = None,
        device: int = -1,
+        task: str = "",
    ):
        super().__init__(
            model=model,
@@ -529,6 +546,7 @@ class FeatureExtractionPipeline(Pipeline):
            args_parser=args_parser,
            device=device,
            binary_output=True,
+            task=task,
        )

    def __call__(self, *args, **kwargs):
@@ -625,6 +643,7 @@ class FillMaskPipeline(Pipeline):
        args_parser: ArgumentHandler = None,
        device: int = -1,
        topk=5,
+        task: str = "",
    ):
        super().__init__(
            model=model,
@@ -634,6 +653,7 @@ class FillMaskPipeline(Pipeline):
            args_parser=args_parser,
            device=device,
            binary_output=True,
+            task=task,
        )

        self.topk = topk
@@ -725,6 +745,7 @@ class NerPipeline(Pipeline):
        device: int = -1,
        binary_output: bool = False,
        ignore_labels=["O"],
+        task: str = "",
    ):
        super().__init__(
            model=model,
@@ -734,6 +755,7 @@ class NerPipeline(Pipeline):
            args_parser=args_parser,
            device=device,
            binary_output=binary_output,
+            task=task,
        )

        self._basic_tokenizer = BasicTokenizer(do_lower_case=False)
@@ -896,6 +918,7 @@ class QuestionAnsweringPipeline(Pipeline):
        modelcard: Optional[ModelCard] = None,
        framework: Optional[str] = None,
        device: int = -1,
+        task: str = "",
        **kwargs
    ):
        super().__init__(
@@ -905,6 +928,7 @@ class QuestionAnsweringPipeline(Pipeline):
            framework=framework,
            args_parser=QuestionAnsweringArgumentHandler(),
            device=device,
+            task=task,
            **kwargs,
        )

@@ -1102,7 +1126,11 @@ class QuestionAnsweringPipeline(Pipeline):
            chars_idx += len(word) + 1

        # Join text with spaces
-        return {"answer": " ".join(words), "start": max(0, char_start_idx), "end": min(len(text), char_end_idx)}
+        return {
+            "answer": " ".join(words),
+            "start": max(0, char_start_idx),
+            "end": min(len(text), char_end_idx),
+        }


 class SummarizationPipeline(Pipeline):
@@ -1111,12 +1139,16 @@ class SummarizationPipeline(Pipeline):

    Usage::

+        # use bart in pytorch
        summarizer = pipeline("summarization")
-        summarizer("Sam Shleifer writes the best docstring examples in the whole world.")
+        summarizer("Sam Shleifer writes the best docstring examples in the whole world.", min_length=5, max_length=20)
+
+        # use t5 in tf
+        summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")
+        summarizer("Sam Shleifer writes the best docstring examples in the whole world.", min_length=5, max_length=20)

    Supported Models:
-        The models that this pipeline can use are models that have been fine-tuned on a summarization task, which is
-        currently only ``BartForConditionalGeneration.from_pretrained('bart-large-cnn')``
+        The models that this pipeline can use are models that have been fine-tuned on a summarization task, which is currently, '`bart-large-cnn`', '`t5-small`', '`t5-base`', '`t5-large`', '`t5-3b`', '`t5-11b`'.

    Arguments:
        model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):
@@ -1147,17 +1179,8 @@ class SummarizationPipeline(Pipeline):
            on the associated CUDA device id.
    """

-    task = "summarization"
-
    def __call__(
-        self,
-        *documents,
-        return_tensors=False,
-        return_text=True,
-        max_length=142,
-        min_length=21,
-        clean_up_tokenization_spaces=False,
-        **generate_kwargs
+        self, *documents, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs
    ):
        r"""
        Args:
@@ -1165,10 +1188,6 @@ class SummarizationPipeline(Pipeline):
            return_text: (bool, default=True) whether to add a decoded "summary_text" to each result
            return_tensors: (bool, default=False) whether to return the raw "summary_token_ids" to each result

-            max_length: (`optional`) int
-                The max length of the sequence to be generated. Does not include tokens in input_ids.
-            min_len: (`optional`) int
-            no_repeat_ngram_size:  (`optional`) int. ban ngrams of this length from being repeated in the generated text
            clean_up_tokenization_spaces: (`optional`) bool whether to include extra spaces in the output
            **generate_kwargs: extra kwargs passed to `self.model.generate`_

@@ -1180,19 +1199,60 @@ class SummarizationPipeline(Pipeline):

        """
        assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True"
-        if self.framework == "tf":
-            raise NotImplementedError("Tensorflow not supported")
-        with self.device_placement():
-            inputs = self._parse_and_tokenize(*documents)
-            inputs = self.ensure_tensor_on_device(**inputs)
-            summaries = self.model.generate(
-                inputs["input_ids"],
-                attention_mask=inputs["attention_mask"],
-                max_length=max_length,
-                min_length=min_length,
-                do_sample=False,
-                **generate_kwargs,
+        assert len(documents) > 0, "Please provide a document to summarize"
+
+        if self.framework == "tf" and "BartForConditionalGeneration" in self.model.__class__.__name__:
+            raise NotImplementedError(
+                "Tensorflow is not yet supported for Bart. Please consider using T5, e.g. `t5-base`"
            )
+
+        prefix = self.model.config.prefix if self.model.config.prefix is not None else ""
+
+        if isinstance(documents[0], list):
+            assert (
+                self.tokenizer.pad_token_id is not None
+            ), "Please make sure that the tokenizer has a pad_token_id when using a batch input"
+
+            documents = ([prefix + document for document in documents[0]],)
+            pad_to_max_length = True
+
+        elif isinstance(documents[0], str):
+            documents = (prefix + documents[0],)
+            pad_to_max_length = False
+        else:
+            raise ValueError(
+                " `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`".format(
+                    documents[0]
+                )
+            )
+
+        with self.device_placement():
+            inputs = self._parse_and_tokenize(*documents, pad_to_max_length=pad_to_max_length)
+
+            if self.framework == "pt":
+                inputs = self.ensure_tensor_on_device(**inputs)
+                input_length = inputs["input_ids"].shape[-1]
+            elif self.framework == "tf":
+                input_length = tf.shape(inputs["input_ids"])[-1].numpy()
+
+            if input_length < self.model.config.min_length // 2:
+                logger.warning(
+                    "Your min_length is set to {}, but you input_length is only {}. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)".format(
+                        self.model.config.min_length, input_length
+                    )
+                )
+
+            if input_length < self.model.config.max_length:
+                logger.warning(
+                    "Your max_length is set to {}, but you input_length is only {}. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)".format(
+                        self.model.config.max_length, input_length
+                    )
+                )
+
+            summaries = self.model.generate(
+                inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
+            )
+
            results = []
            for summary in summaries:
                record = {}
@@ -1200,7 +1260,115 @@ class SummarizationPipeline(Pipeline):
                    record["summary_token_ids"] = summary
                if return_text:
                    record["summary_text"] = self.tokenizer.decode(
-                        summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces
+                        summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+                    )
+                results.append(record)
+            return results
+
+
+class TranslationPipeline(Pipeline):
+    """
+    Translates from one language to another.
+
+    Usage::
+        en_fr_translator = pipeline("translation_en_to_fr")
+        en_fr_translator("How old are you?")
+
+    Supported Models: "t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"
+
+    Arguments:
+        model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):
+            The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string
+            checkpoint identifier or an actual pre-trained model inheriting from
+            :class:`~transformers.PreTrainedModel` for PyTorch and :class:`~transformers.TFPreTrainedModel` for
+            TensorFlow.
+            If :obj:`None`, the default of the pipeline will be loaded.
+        tokenizer (:obj:`str` or :obj:`~transformers.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):
+            The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,
+            a string checkpoint identifier or an actual pre-trained tokenizer inheriting from
+            :class:`~transformers.PreTrainedTokenizer`.
+            If :obj:`None`, the default of the pipeline will be loaded.
+        modelcard (:obj:`str` or :class:`~transformers.ModelCard`, `optional`, defaults to :obj:`None`):
+            Model card attributed to the model for this pipeline.
+        framework (:obj:`str`, `optional`, defaults to :obj:`None`):
+            The framework to use, either "pt" for PyTorch or "tf" for TensorFlow. The specified framework must be
+            installed.
+            If no framework is specified, will default to the one currently installed. If no framework is specified
+            and both frameworks are installed, will default to PyTorch.
+        args_parser (:class:`~transformers.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):
+            Reference to the object in charge of parsing supplied pipeline parameters.
+        device (:obj:`int`, `optional`, defaults to :obj:`-1`):
+            Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model
+            on the associated CUDA device id.
+    """
+
+    def __call__(
+        self, *texts, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs
+    ):
+        r"""
+        Args:
+            *texts: (list of strings) texts to be translated
+            return_text: (bool, default=True) whether to add a decoded "translation_text" to each result
+            return_tensors: (bool, default=False) whether to return the raw "translation_token_ids" to each result
+
+            **generate_kwargs: extra kwargs passed to `self.model.generate`_
+
+        Returns:
+            list of dicts with 'translation_text' and/or 'translation_token_ids' for each text_to_translate
+        .. _`self.model.generate`:
+            https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate
+        """
+        assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True"
+
+        prefix = self.model.config.prefix if self.model.config.prefix is not None else ""
+
+        if isinstance(texts[0], list):
+            assert (
+                self.tokenizer.pad_token_id is not None
+            ), "Please make sure that the tokenizer has a pad_token_id when using a batch input"
+            texts = ([prefix + text for text in texts[0]],)
+            pad_to_max_length = True
+
+        elif isinstance(texts[0], str):
+            texts = (prefix + texts[0],)
+            pad_to_max_length = False
+        else:
+            raise ValueError(
+                " `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`".format(
+                    texts[0]
+                )
+            )
+
+        with self.device_placement():
+            inputs = self._parse_and_tokenize(*texts, pad_to_max_length=pad_to_max_length)
+
+            if self.framework == "pt":
+                inputs = self.ensure_tensor_on_device(**inputs)
+                input_length = inputs["input_ids"].shape[-1]
+
+            elif self.framework == "tf":
+                input_length = tf.shape(inputs["input_ids"])[-1].numpy()
+
+            if input_length > 0.9 * self.model.config.max_length:
+                logger.warning(
+                    "Your input_length: {} is bigger than 0.9 * max_length: {}. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)".format(
+                        input_length, self.model.config.max_length
+                    )
+                )
+
+            translations = self.model.generate(
+                inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
+            )
+            results = []
+            for translation in translations:
+                record = {}
+                if return_tensors:
+                    record["translation_token_ids"] = translation
+                if return_text:
+                    record["translation_text"] = self.tokenizer.decode(
+                        translation,
+                        skip_special_tokens=True,
+                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,
                    )
                results.append(record)
            return results
@@ -1266,14 +1434,44 @@ SUPPORTED_TASKS = {
    },
    "summarization": {
        "impl": SummarizationPipeline,
-        "pt": BartForConditionalGeneration if is_torch_available() else None,
-        "tf": None,
+        "tf": TFAutoModelWithLMHead if is_tf_available() else None,
+        "pt": AutoModelWithLMHead if is_torch_available() else None,
        "default": {
            "model": {"pt": "bart-large-cnn", "tf": None},
            "config": None,
            "tokenizer": ("bart-large-cnn", {"use_fast": False}),
        },
    },
+    "translation_en_to_fr": {
+        "impl": TranslationPipeline,
+        "tf": TFAutoModelWithLMHead if is_tf_available() else None,
+        "pt": AutoModelWithLMHead if is_torch_available() else None,
+        "default": {
+            "model": {"pt": "t5-base", "tf": "t5-base"},
+            "config": None,
+            "tokenizer": ("t5-base", {"use_fast": False}),
+        },
+    },
+    "translation_en_to_de": {
+        "impl": TranslationPipeline,
+        "tf": TFAutoModelWithLMHead if is_tf_available() else None,
+        "pt": AutoModelWithLMHead if is_torch_available() else None,
+        "default": {
+            "model": {"pt": "t5-base", "tf": "t5-base"},
+            "config": None,
+            "tokenizer": ("t5-base", {"use_fast": False}),
+        },
+    },
+    "translation_en_to_ro": {
+        "impl": TranslationPipeline,
+        "tf": TFAutoModelWithLMHead if is_tf_available() else None,
+        "pt": AutoModelWithLMHead if is_torch_available() else None,
+        "default": {
+            "model": {"pt": "t5-base", "tf": "t5-base"},
+            "config": None,
+            "tokenizer": ("t5-base", {"use_fast": False}),
+        },
+    },
 }


@@ -1361,7 +1559,7 @@ def pipeline(
    framework = framework or get_framework(model)

    targeted_task = SUPPORTED_TASKS[task]
-    task, model_class = targeted_task["impl"], targeted_task[framework]
+    task_class, model_class = targeted_task["impl"], targeted_task[framework]

    # Use default model/config/tokenizer for the task if no model is provided
    if model is None:
@@ -1422,4 +1620,4 @@ def pipeline(
            )
        model = model_class.from_pretrained(model, config=config, **model_kwargs)

-    return task(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, **kwargs)
+    return task_class(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, task=task, **kwargs,)
--- a/src/transformers/tokenization_bart.py
+++ b/src/transformers/tokenization_bart.py
@@ -19,7 +19,7 @@ from .tokenization_roberta import RobertaTokenizer
 # vocab and merges same as roberta
 vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json"
 merges_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt"
-_all_bart_models = ["bart-large", "bart-large-mnli", "bart-large-cnn"]
+_all_bart_models = ["bart-large", "bart-large-mnli", "bart-large-cnn", "bart-large-xsum"]


 class BartTokenizer(RobertaTokenizer):
--- a/src/transformers/tokenization_t5.py
+++ b/src/transformers/tokenization_t5.py
@@ -61,14 +61,34 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {

 class T5Tokenizer(PreTrainedTokenizer):
    """
-        SentencePiece based tokenizer. Peculiarities:
+        Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .

-            - requires `SentencePiece <https://github.com/google/sentencepiece>`_
-            - `extra_ids` add a number of extra ids added to the end of the vocabulary for use as sentinels.
-                These tokens are accessible as `<extra_id_{%d}>` where `{%d}` is a number between 0 and extra_ids-1.
-                Extra tokens are indexed from the end of the vocabulary up to beginnning (<extra_id_0> is the last token in the vocabulary)
-                (like in T5 preprocessing
+        This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
+        should refer to the superclass for more information regarding methods.
+
+        Args:
+            vocab_file (:obj:`string`):
+                `SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
+                contains the vocabulary necessary to instantiate a tokenizer.
+            eos_token (:obj:`string`, `optional`, defaults to "</s>"):
+                The end of sequence token.
+
+                .. note::
+
+                    When building a sequence using special tokens, this is not the token that is used for the end
+                    of sequence. The token used is the :obj:`sep_token`.
+            unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
+                The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+                token instead.
+            pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
+                The token used for padding, for example when batching sequences of different lengths.
+            extra_ids (:obj:`List[str]`, `optional`, defaults to :obj:`100`):
+                Add a number of extra ids added to the end of the vocabulary for use as sentinels.
+                These tokens are accessible as "<extra_id_{%d}>" where "{%d}" is a number between 0 and extra_ids-1.
+                Extra tokens are indexed from the end of the vocabulary up to beginnning ("<extra_id_0>" is the last token in the vocabulary like in T5 preprocessing
                see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)
+            additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):
+                Additional special tokens used by the tokenizer.
    """

    vocab_files_names = VOCAB_FILES_NAMES
--- a/src/transformers/tokenization_utils.py
+++ b/src/transformers/tokenization_utils.py
@@ -1997,3 +1997,14 @@ class PreTrainedTokenizerFast(PreTrainedTokenizer):
            files = self._tokenizer.save(folder, name=file)

        return tuple(files)
+
+
+def trim_batch(
+    input_ids, pad_token_id, attention_mask=None,
+):
+    """Remove columns that are populated exclusively by pad_token_id"""
+    keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)
+    if attention_mask is None:
+        return input_ids[:, keep_column_mask]
+    else:
+        return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])
--- a/tests/test_modeling_auto.py
+++ b/tests/test_modeling_auto.py
@@ -37,6 +37,8 @@ if is_torch_available():
        BertForSequenceClassification,
        AutoModelForQuestionAnswering,
        BertForQuestionAnswering,
+        AutoModelForTokenClassification,
+        BertForTokenClassification,
    )
    from transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
    from transformers.modeling_auto import (
@@ -109,7 +111,7 @@ class AutoModelTest(unittest.TestCase):
            self.assertIsNotNone(model)
            self.assertIsInstance(model, BertForSequenceClassification)

-    # @slow
+    @slow
    def test_question_answering_model_from_pretrained(self):
        logging.basicConfig(level=logging.INFO)
        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
@@ -122,6 +124,19 @@ class AutoModelTest(unittest.TestCase):
            self.assertIsNotNone(model)
            self.assertIsInstance(model, BertForQuestionAnswering)

+    @slow
+    def test_token_classification_model_from_pretrained(self):
+        logging.basicConfig(level=logging.INFO)
+        for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
+            config = AutoConfig.from_pretrained(model_name)
+            self.assertIsNotNone(config)
+            self.assertIsInstance(config, BertConfig)
+
+            model = AutoModelForTokenClassification.from_pretrained(model_name)
+            model, loading_info = AutoModelForTokenClassification.from_pretrained(model_name, output_loading_info=True)
+            self.assertIsNotNone(model)
+            self.assertIsInstance(model, BertForTokenClassification)
+
    def test_from_pretrained_identifier(self):
        logging.basicConfig(level=logging.INFO)
        model = AutoModelWithLMHead.from_pretrained(SMALL_MODEL_IDENTIFIER)
--- a/tests/test_modeling_bart.py
+++ b/tests/test_modeling_bart.py
@@ -36,8 +36,8 @@ if is_torch_available():
    from transformers.modeling_bart import (
        BART_PRETRAINED_MODEL_ARCHIVE_MAP,
        shift_tokens_right,
+        invert_mask,
        _prepare_bart_decoder_inputs,
-        LARGE_NEGATIVE,
    )
    from transformers.tokenization_bart import BartTokenizer

@@ -113,7 +113,8 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
    test_pruning = False
    test_torchscript = False
    test_head_masking = False
-    test_resize_embeddings = False  # This requires inputs_dict['input_ids']
+    test_resize_embeddings = True  # This requires inputs_dict['input_ids']
+    test_missing_keys = False  # because BartForConditionalGeneration and BartModel now have identical state_dict

    def setUp(self):
        self.model_tester = ModelTester(self)
@@ -122,10 +123,9 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
    def test_config(self):
        self.config_tester.run_common_tests()

-    def test_advanced_inputs(self):
+    def test_initialization_more(self):
        # (config, input_ids, token_type_ids, input_mask, *unused) = \
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
-        decoder_input_ids, decoder_attn_mask = _prepare_bart_decoder_inputs(config, inputs_dict["input_ids"])
        model = BartModel(config)
        model.to(torch_device)
        model.eval()
@@ -141,9 +141,17 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
        _check_var(model.encoder.layers[0].fc1)
        _check_var(model.encoder.embed_positions)

+    def test_advanced_inputs(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        inputs_dict["input_ids"][:, -2:] = config.pad_token_id
+        decoder_input_ids, decoder_attn_mask, causal_mask = _prepare_bart_decoder_inputs(
+            config, inputs_dict["input_ids"]
+        )
+        model = BartModel(config).to(torch_device).eval()
+
        decoder_features_with_created_mask = model(**inputs_dict)[0]
        decoder_features_with_passed_mask = model(
-            decoder_attention_mask=decoder_attn_mask, decoder_input_ids=decoder_input_ids, **inputs_dict
+            decoder_attention_mask=invert_mask(decoder_attn_mask), decoder_input_ids=decoder_input_ids, **inputs_dict
        )[0]
        _assert_tensors_equal(decoder_features_with_passed_mask, decoder_features_with_created_mask)
        useless_mask = torch.zeros_like(decoder_attn_mask)
@@ -237,7 +245,7 @@ class BartHeadTests(unittest.TestCase):
        lm_labels = ids_tensor([batch_size, input_ids.shape[1]], self.vocab_size).to(torch_device)
        lm_model = BartForConditionalGeneration(config)
        lm_model.to(torch_device)
-        loss, logits, enc_features = lm_model(input_ids=input_ids, lm_labels=lm_labels, decoder_input_ids=input_ids)
+        loss, logits, enc_features = lm_model(input_ids=input_ids, lm_labels=lm_labels)
        expected_shape = (batch_size, input_ids.shape[1], config.vocab_size)
        self.assertEqual(logits.shape, expected_shape)
        self.assertIsInstance(loss.item(), float)
@@ -335,41 +343,39 @@ class BartHeadTests(unittest.TestCase):
        model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)

    def test_dummy_inputs(self):
-        config, *_ = self._get_config_and_data(output_past=True)
+        config, *_ = self._get_config_and_data()
        model = BartForConditionalGeneration(config).eval().to(torch_device)
        model(**model.dummy_inputs)

    def test_prepare_bart_decoder_inputs(self):
        config, *_ = self._get_config_and_data(output_past=False)
-        input_ids = _long_tensor(([4, 4, 2]))  # only used for .device if decoder_input_ids is passed
+        input_ids = _long_tensor(([4, 4, 2]))
        decoder_input_ids = _long_tensor([[26388, 2, config.pad_token_id]])
-        ignore = LARGE_NEGATIVE
-        decoder_input_ids, decoder_attn_mask = _prepare_bart_decoder_inputs(config, input_ids, decoder_input_ids)
-        expected_mask = torch.tensor(
-            [
-                [0, ignore, ignore],
-                [0, 0, ignore],
-                [ignore, ignore, ignore],  # never attend to the final token, because its pad
-            ]
-        ).to(input_ids.device)
-        self.assertEqual(decoder_attn_mask.size(), (1, 1, 3, 3))
-        self.assertTrue(torch.eq(expected_mask, decoder_attn_mask).all())
-
-        # Test no causal mask
-        config, *_ = self._get_config_and_data(output_past=True)
-        expected_just_padding_mask = torch.tensor(
-            [[0, 0, 0], [0, 0, 0], [ignore, ignore, ignore]]  # never attend to the final token, because its pad
-        ).to(input_ids.device)
-        _, decoder_attn_mask_no_causal_mask = _prepare_bart_decoder_inputs(config, input_ids, decoder_input_ids)
-        self.assertEqual(decoder_attn_mask_no_causal_mask.size(), (1, 1, 3, 3))
-        self.assertTrue(torch.eq(expected_just_padding_mask, decoder_attn_mask_no_causal_mask).all())
-
-        decoder_input_ids = _long_tensor([[0, 26388, 4133, 2]])
-        # Attend to everything if no pad tokens and no causal mask
-        _, decoder_attn_mask_no_padding_no_causal_mask = _prepare_bart_decoder_inputs(
+        ignore = float("-inf")
+        decoder_input_ids, decoder_attn_mask, causal_mask = _prepare_bart_decoder_inputs(
            config, input_ids, decoder_input_ids
        )
-        self.assertTrue(torch.eq(decoder_attn_mask_no_padding_no_causal_mask, 0).all())
+        expected_causal_mask = torch.tensor(
+            [[0, ignore, ignore], [0, 0, ignore], [0, 0, 0]]  # never attend to the final token, because its pad
+        ).to(input_ids.device)
+        self.assertEqual(decoder_attn_mask.size(), decoder_input_ids.size())
+        self.assertTrue(torch.eq(expected_causal_mask, causal_mask).all())
+
+    def test_resize_tokens_embeddings_more(self):
+        config, input_ids, _ = self._get_config_and_data()
+
+        def _get_embs(m):
+            return (m.get_input_embeddings().weight.data.clone(), m.get_output_embeddings().weight.data.clone())
+
+        model = BartForConditionalGeneration(config).eval().to(torch_device)
+        input, output = _get_embs(model)
+        self.assertTrue(torch.eq(input, output).all())
+        new_vocab_size = 45
+        model.resize_token_embeddings(new_vocab_size)
+        input_new, output_new = _get_embs(model)
+        self.assertEqual(input_new.shape, (new_vocab_size, config.d_model))
+        self.assertEqual(output_new.shape, (new_vocab_size, config.d_model))
+        self.assertTrue(torch.eq(input_new, output_new).all())


 def _assert_tensors_equal(a, b, atol=1e-12, prefix=""):
@@ -444,6 +450,38 @@ class BartModelIntegrationTests(unittest.TestCase):
            model = BartModel.from_pretrained(model_name, cache_dir=CACHE_DIR)
            self.assertIsNotNone(model)

+    @slow
+    def test_xsum_summarization_same_as_fairseq(self):
+        model = BartForConditionalGeneration.from_pretrained("bart-large-xsum").to(torch_device)
+        tok = BartTokenizer.from_pretrained("bart-large")
+
+        PGE_ARTICLE = """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
+        EXPECTED_SUMMARY = "California's largest power company has begun shutting off power to tens of thousands of homes and businesses in the state."
+        dct = tok.batch_encode_plus([PGE_ARTICLE], max_length=1024, pad_to_max_length=True, return_tensors="pt",)
+
+        hypotheses_batch = model.generate(
+            input_ids=dct["input_ids"].to(torch_device),
+            attention_mask=dct["attention_mask"].to(torch_device),
+            num_beams=2,
+            max_length=62,
+            min_length=11,
+            length_penalty=1.0,
+            no_repeat_ngram_size=3,
+            early_stopping=True,
+            decoder_start_token_id=model.config.eos_token_id,
+        )
+
+        decoded = [
+            tok.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in hypotheses_batch
+        ]
+        self.assertEqual(EXPECTED_SUMMARY, decoded[0])
+
+    def test_xsum_config_generation_params(self):
+        config = BartConfig.from_pretrained("bart-large-xsum")
+        expected_params = dict(num_beams=6, do_sample=False, early_stopping=True, length_penalty=1.0)
+        config_params = {k: getattr(config, k, "MISSING") for k, v in expected_params.items()}
+        self.assertDictEqual(expected_params, config_params)
+
    @slow
    def test_cnn_summarization_same_as_fairseq(self):
        hf = BartForConditionalGeneration.from_pretrained("bart-large-cnn", output_past=True,).to(torch_device)
--- a/tests/test_modeling_common.py
+++ b/tests/test_modeling_common.py
@@ -58,6 +58,7 @@ class ModelTesterMixin:
    test_pruning = True
    test_resize_embeddings = True
    test_head_masking = True
+    test_missing_keys = True
    is_encoder_decoder = False

    def test_save_load(self):
@@ -527,6 +528,8 @@ class ModelTesterMixin:
            self.assertTrue(x is None or isinstance(x, torch.nn.Linear))

    def test_correct_missing_keys(self):
+        if not self.test_missing_keys:
+            return
        config, _ = self.model_tester.prepare_config_and_inputs_for_common()

        for model_class in self.all_model_classes:
--- a/tests/test_modeling_t5.py
+++ b/tests/test_modeling_t5.py
@@ -24,6 +24,7 @@ from .utils import CACHE_DIR, require_torch, slow, torch_device


 if is_torch_available():
+    import torch
    from transformers import T5Config, T5Model, T5ForConditionalGeneration
    from transformers.modeling_t5 import T5_PRETRAINED_MODEL_ARCHIVE_MAP

@@ -57,8 +58,9 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
            relative_attention_num_buckets=8,
            dropout_rate=0.1,
            initializer_factor=0.002,
-            eos_token_ids=[1],
+            eos_token_id=1,
            pad_token_id=0,
+            decoder_start_token_id=0,
            scope=None,
        ):
            self.parent = parent
@@ -78,8 +80,9 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
            self.dropout_rate = dropout_rate
            self.initializer_factor = initializer_factor
            self.scope = scope
-            self.eos_token_ids = eos_token_ids
+            self.eos_token_id = eos_token_id
            self.pad_token_id = pad_token_id
+            self.decoder_start_token_id = decoder_start_token_id

        def prepare_config_and_inputs(self):
            input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
@@ -106,9 +109,10 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
                relative_attention_num_buckets=self.relative_attention_num_buckets,
                dropout_rate=self.dropout_rate,
                initializer_factor=self.initializer_factor,
-                eos_token_ids=self.eos_token_ids,
+                eos_token_id=self.eos_token_id,
                bos_token_id=self.pad_token_id,
                pad_token_id=self.pad_token_id,
+                decoder_start_token_id=self.decoder_start_token_id,
            )

            return (
@@ -123,6 +127,39 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
        def check_loss_output(self, result):
            self.parent.assertListEqual(list(result["loss"].size()), [])

+        def check_prepare_lm_labels_via_shift_left(
+            self, config, input_ids, decoder_input_ids, attention_mask, decoder_attention_mask, lm_labels,
+        ):
+            model = T5Model(config=config)
+            model.to(torch_device)
+            model.eval()
+
+            # make sure that lm_labels are correctly padded from the right
+            lm_labels.masked_fill_((lm_labels == self.decoder_start_token_id), self.eos_token_id)
+
+            # add casaul pad token mask
+            triangular_mask = torch.tril(lm_labels.new_ones(lm_labels.shape)).logical_not()
+            lm_labels.masked_fill_(triangular_mask, self.pad_token_id)
+            decoder_input_ids = model._shift_right(lm_labels)
+
+            for i, (decoder_input_ids_slice, lm_labels_slice) in enumerate(zip(decoder_input_ids, lm_labels)):
+                # first item
+                self.parent.assertEqual(decoder_input_ids_slice[0].item(), self.decoder_start_token_id)
+                if i < decoder_input_ids_slice.shape[-1]:
+                    if i < decoder_input_ids.shape[-1] - 1:
+                        # items before diagonal
+                        self.parent.assertListEqual(
+                            decoder_input_ids_slice[1 : i + 1].tolist(), lm_labels_slice[:i].tolist()
+                        )
+                    # pad items after diagonal
+                    if i < decoder_input_ids.shape[-1] - 2:
+                        self.parent.assertListEqual(
+                            decoder_input_ids_slice[i + 2 :].tolist(), lm_labels_slice[i + 1 : -1].tolist()
+                        )
+                else:
+                    # all items after square
+                    self.parent.assertListEqual(decoder_input_ids_slice[1:].tolist(), lm_labels_slice[:-1].tolist())
+
        def create_and_check_t5_model(
            self, config, input_ids, decoder_input_ids, attention_mask, decoder_attention_mask, lm_labels,
        ):
@@ -197,6 +234,10 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
    def test_config(self):
        self.config_tester.run_common_tests()

+    def test_shift_right(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.check_prepare_lm_labels_via_shift_left(*config_and_inputs)
+
    def test_t5_model(self):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_t5_model(*config_and_inputs)
--- a/tests/test_modeling_tf_t5.py
+++ b/tests/test_modeling_tf_t5.py
@@ -52,7 +52,7 @@ class TFT5ModelTest(TFModelTesterMixin, unittest.TestCase):
            relative_attention_num_buckets=8,
            dropout_rate=0.1,
            initializer_factor=0.002,
-            eos_token_ids=[1],
+            eos_token_id=1,
            pad_token_id=0,
            scope=None,
        ):
@@ -71,7 +71,7 @@ class TFT5ModelTest(TFModelTesterMixin, unittest.TestCase):
            self.relative_attention_num_buckets = relative_attention_num_buckets
            self.dropout_rate = dropout_rate
            self.initializer_factor = initializer_factor
-            self.eos_token_ids = eos_token_ids
+            self.eos_token_id = eos_token_id
            self.pad_token_id = pad_token_id
            self.scope = scope

@@ -97,7 +97,7 @@ class TFT5ModelTest(TFModelTesterMixin, unittest.TestCase):
                relative_attention_num_buckets=self.relative_attention_num_buckets,
                dropout_rate=self.dropout_rate,
                initializer_factor=self.initializer_factor,
-                eos_token_ids=self.eos_token_ids,
+                eos_token_id=self.eos_token_id,
                bos_token_id=self.pad_token_id,
                pad_token_id=self.pad_token_id,
            )
--- a/tests/test_modeling_xlm.py
+++ b/tests/test_modeling_xlm.py
@@ -29,6 +29,7 @@ if is_torch_available():
        XLMConfig,
        XLMModel,
        XLMWithLMHeadModel,
+        XLMForTokenClassification,
        XLMForQuestionAnswering,
        XLMForSequenceClassification,
        XLMForQuestionAnsweringSimple,
@@ -350,6 +351,32 @@ class XLMModelTest(ModelTesterMixin, unittest.TestCase):
                list(result["logits"].size()), [self.batch_size, self.type_sequence_label_size]
            )

+        def create_and_check_xlm_for_token_classification(
+            self,
+            config,
+            input_ids,
+            token_type_ids,
+            input_lengths,
+            sequence_labels,
+            token_labels,
+            is_impossible_labels,
+            input_mask,
+        ):
+            config.num_labels = self.num_labels
+            model = XLMForTokenClassification(config)
+            model.to(torch_device)
+            model.eval()
+
+            loss, logits = model(input_ids, attention_mask=input_mask, labels=token_labels)
+            result = {
+                "loss": loss,
+                "logits": logits,
+            }
+            self.parent.assertListEqual(
+                list(result["logits"].size()), [self.batch_size, self.seq_length, self.num_labels]
+            )
+            self.check_loss_output(result)
+
        def prepare_config_and_inputs_for_common(self):
            config_and_inputs = self.prepare_config_and_inputs()
            (
@@ -392,6 +419,10 @@ class XLMModelTest(ModelTesterMixin, unittest.TestCase):
        config_and_inputs = self.model_tester.prepare_config_and_inputs()
        self.model_tester.create_and_check_xlm_sequence_classif(*config_and_inputs)

+    def test_xlm_for_token_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_xlm_for_token_classification(*config_and_inputs)
+
    @slow
    def test_model_from_pretrained(self):
        for model_name in list(XLM_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
--- a/tests/test_pipelines.py
+++ b/tests/test_pipelines.py
@@ -78,6 +78,15 @@ TF_FILL_MASK_FINETUNED_MODELS = [
    (("distilroberta-base", {"use_fast": False}), "distilroberta-base", None),
 ]

+SUMMARIZATION_FINETUNED_MODELS = {("bart-large-cnn", "bart-large-cnn"), ("t5-small", "t5-small")}
+TF_SUMMARIZATION_FINETUNED_MODELS = {("t5-small", "t5-small")}
+
+TRANSLATION_FINETUNED_MODELS = {
+    ("t5-small", "t5-small", "translation_en_to_de"),
+    ("t5-small", "t5-small", "translation_en_to_ro"),
+}
+TF_TRANSLATION_FINETUNED_MODELS = {("t5-small", "t5-small", "translation_en_to_fr")}
+

 class MonoColumnInputTestCase(unittest.TestCase):
    def _test_mono_column_pipeline(
@@ -252,10 +261,44 @@ class MonoColumnInputTestCase(unittest.TestCase):
        valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
        invalid_inputs = [4, "<mask>"]
        mandatory_keys = ["summary_text"]
-        nlp = pipeline(task="summarization")
-        self._test_mono_column_pipeline(
-            nlp, valid_inputs, invalid_inputs, mandatory_keys,
-        )
+        for model, tokenizer in SUMMARIZATION_FINETUNED_MODELS:
+            nlp = pipeline(task="summarization", model=model, tokenizer=tokenizer)
+            self._test_mono_column_pipeline(
+                nlp, valid_inputs, invalid_inputs, mandatory_keys,
+            )
+
+    @require_tf
+    def test_tf_summarization(self):
+        valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
+        invalid_inputs = [4, "<mask>"]
+        mandatory_keys = ["summary_text"]
+        for model, tokenizer in TF_SUMMARIZATION_FINETUNED_MODELS:
+            nlp = pipeline(task="summarization", model=model, tokenizer=tokenizer, framework="tf")
+            self._test_mono_column_pipeline(
+                nlp, valid_inputs, invalid_inputs, mandatory_keys,
+            )
+
+    @require_torch
+    def test_translation(self):
+        valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
+        invalid_inputs = [4, "<mask>"]
+        mandatory_keys = ["translation_text"]
+        for model, tokenizer, task in TRANSLATION_FINETUNED_MODELS:
+            nlp = pipeline(task=task, model=model, tokenizer=tokenizer)
+            self._test_mono_column_pipeline(
+                nlp, valid_inputs, invalid_inputs, mandatory_keys,
+            )
+
+    @require_tf
+    def test_tf_translation(self):
+        valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
+        invalid_inputs = [4, "<mask>"]
+        mandatory_keys = ["translation_text"]
+        for model, tokenizer, task in TF_TRANSLATION_FINETUNED_MODELS:
+            nlp = pipeline(task=task, model=model, tokenizer=tokenizer, framework="tf")
+            self._test_mono_column_pipeline(
+                nlp, valid_inputs, invalid_inputs, mandatory_keys,
+            )


 class MultiColumnInputTestCase(unittest.TestCase):
Author	SHA1	Message	Date
LysandreJik	6f5a12a583	Release: v2.7.0 Some checks failed GitHub-hosted runner / check_code_quality (push) Has been cancelled Details	2020-03-30 08:49:24 -04:00
Patrick von Platen	296252c49e	fix lm lables in docstring (#3529 )	2020-03-30 14:26:24 +02:00
Patrick von Platen	75ec6c9e3a	[T5] make decoder input ids optional for t5 training (#3521 ) * make decoder input ids optional for t5 training * lm_lables should not be shifted in t5 * add tests * finish shift right functionality for PT T5 * move shift right to correct class * cleaner code * replace -100 values with pad token id * add assert statement * remove unnecessary for loop * make style	2020-03-30 13:45:26 +02:00
Patrick von Platen	5b44e0a31b	[T5] Add training documenation (#3507 ) * Add clear description of how to train T5 * correct docstring in T5 * correct typo * correct docstring format * update t5 model docs * implement collins feedback * fix typo and add more explanation for sentinal tokens * delete unnecessary todos	2020-03-30 13:35:53 +02:00
Sam Shleifer	33ef7002e1	[Docs] examples/summarization/bart: Simplify CNN/DM preprocessi… (#3516 )	2020-03-29 13:25:42 -04:00
Sam Shleifer	f6a23d1911	[BART] add bart-large-xsum weights (#3422 )	2020-03-29 10:51:13 -04:00
Stefan Schweter	601ac5b1dc	[model_cards]: use MIT license for all dbmdz models	2020-03-27 18:06:25 -04:00
Patrick von Platen	17dceae7a1	Fix circle ci flaky fail of wmt example (#3485 ) * force bleu * fix wrong file name * rename file * different filenames for each example test * test files should clean up after themselves * test files should clean up after themselves * do not force bleu * correct typo * fix isort	2020-03-27 13:01:28 -04:00
Patrick von Platen	00ea100e96	add summarization and translation to notebook (#3478 )	2020-03-27 11:05:37 -04:00
Funtowicz Morgan	b08259a120	run_ner.py / bert-base-multilingual-cased can output empty tokens (#2991 ) * Use tokenizer.num_added_tokens to count number of added special_tokens instead of hardcoded numbers. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * run_ner.py - Do not add a label to the labels_ids if word_tokens is empty. This can happen when using bert-base-multilingual-cased with an input containing an unique space. In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior over the labels_ids tokens adding one more tokens than tokens vector. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-03-27 10:59:55 -04:00
Patrick von Platen	f4f4946836	Rename `t5-large` to `t5-base` in README.md	2020-03-27 15:57:58 +01:00
Patrick von Platen	fa9af2468a	Add T5 to docs (#3461 ) * add t5 docs basis * improve docs * add t5 docs * improve t5 docstring * add t5 tokenizer docstring * finish docstring * make style * add pretrained models * correct typo * make examples work * finalize docs	2020-03-27 10:57:16 -04:00
Lysandre Debut	ff80b73157	Add option to choose T5 model size. (#3480 ) T5-small in test isort	2020-03-27 15:56:59 +01:00
LysandreJik	e2c05f06ef	Correct indentation in docstring For some reason Sphinx extremely dislikes this and crashes.	2020-03-27 09:28:52 -04:00
Sam Shleifer	3ee431dd4c	[Bart/Memory] Two separate, smaller decoder attention masks (#3371 )	2020-03-26 21:34:15 -04:00
Manuel Romero	53fe733805	Model Cards: Fix grammar error (#3467 )	2020-03-26 21:33:33 -04:00
Sam Shleifer	c10decf7a0	[Bart: example] drop columns that are exclusively pad_token_id… (#3400 ) * trim seq_len below 1024 if there are columns full of pad_token_id * Centralize trim_batch so SummarizationDataset can use it too	2020-03-26 19:33:54 -04:00
Sam Shleifer	63f4d8cad0	[Bart/Memory] SelfAttention only returns weights if config.outp… (#3369 )	2020-03-26 18:42:39 -04:00
Sam Shleifer	2b2a2f8df2	[Bart] Fix: put dummy_inputs on correct device (#3398 ) * Dummy inputs to model.device * Move self.device to ModuleUtilsMixin	2020-03-26 18:42:09 -04:00
Sam Shleifer	1a5aefc95c	[Seq2Seq Generation] Call encoder before expanding input_ids (#3370 )	2020-03-26 18:41:19 -04:00
Sam Shleifer	39371ee454	[Bart/Memory] don't create lm_head (#3323 ) * delete lm_head, skips weight tying * Fixed s3	2020-03-26 18:40:39 -04:00
Patrick von Platen	5ad2ea06af	Add wmt translation example (#3428 ) * add translation example * make style * adapt docstring * add gpu device as input for example * small renaming * better README	2020-03-26 19:07:59 +01:00
Patrick von Platen	b4fb94fe6d	revert unpin isort commit	2020-03-26 13:19:18 -04:00
Patrick von Platen	e703e923ca	Add t5 summarization example (#3411 ) * rebase to master * change tf to pytorch * change to pytorch * small fix * renaming * add gpu training possibility * renaming * improve README * incoorporate collins feedback * better Readme * better README.md	2020-03-26 18:17:55 +01:00
sakares saengkaew	1a6c546c6f	Add missing token classification for XLM (#3277 ) * Add the missing token classification for XLM * fix styling * Add XLMForTokenClassification to AutoModelForTokenClassification class * Fix docstring typo for non-existing class * Add the missing token classification for XLM * fix styling * fix styling * Add XLMForTokenClassification to AutoModelForTokenClassification class * Fix docstring typo for non-existing class * Add missing description for AlbertForTokenClassification * fix styling * Add missing docstring for AlBert * Slow tests should be slow Co-authored-by: Sakares Saengkaew <s.sakares@gmail.com> Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>	2020-03-26 10:22:13 -04:00
Patrick von Platen	311970546f	rename string in pipeline	2020-03-26 14:59:49 +01:00
Manuel Romero	7420a6a9cc	Create card for model GPT-2-finetuned-CORD19	2020-03-26 09:10:09 -04:00
Patrick von Platen	022e8fab97	Adds translation pipeline (#3419 ) * fix merge conflicts * add t5 summarization example * change parameters for t5 summarization * make style * add first code snippet for translation * only add prefixes * add prefix patterns * make style * renaming * fix conflicts * remove unused patterns * solve conflicts * fix merge conflicts * remove translation example * remove summarization example * make sure tensors are in numpy for float comparsion * re-add t5 config * fix t5 import config typo * make style * remove unused numpy statements * update doctstring * import translation pipeline	2020-03-26 13:50:58 +01:00
HUSEIN ZOLKEPLI	3c5c567507	Update model card huseinzol05/bert-base-bahasa-cased (#3425 ) * add bert bahasa readme * update readme * update readme * added xlnet	2020-03-26 07:50:27 -04:00
Patrick von Platen	9c683ef01e	Add t5 to pipeline(task='summarization') (#3413 ) * solve conflicts * move warnings below * incorporate changes * add pad_to_max_length to pipelines * add bug fix for T5 beam search * add prefix patterns * make style * fix conflicts * adapt pipelines for task specific parameters * improve docstring * remove unused patterns	2020-03-26 11:03:13 +01:00
Lysandre Debut	ffcffebe85	Force the return of token type IDs (#3439 )	2020-03-26 09:41:36 +01:00
Travis McGuire	010e0460b2	Updated/added model cards (#3435 )	2020-03-25 16:40:03 -04:00
Patrick von Platen	ffa17fe322	Extend config with task specific configs. (#3433 ) * add new default configs * change prefix default to None	2020-03-25 21:32:04 +01:00
Julien Chaumond	83272a3853	Experiment w/ dataclasses (including Py36) (#3423 ) * [ci] Also run test_examples in py37 (will revert at the end of the experiment) * InputExample: use immutable dataclass * [deps] Install dataclasses for Py<3.7 * [skip ci] Revert "[ci] Also run test_examples in py37" This reverts commit d29afd9959786b77759b0b8fa4e6b4335b952015.	2020-03-25 11:10:20 -04:00
Gabriele Sarti	ccbe839ee0	Added BioBERT-NLI model card (#3421 )	2020-03-24 21:15:55 -04:00
Andre Carrera	3d76df3a12	BART for summarization training with CNN/DM using pytorch-lightning	2020-03-24 21:00:24 -04:00
Julien Chaumond	eaabaaf750	[run_language_modeling] Fix: initialize a new model from a config object	2020-03-24 17:56:40 -04:00
Julien Chaumond	f8823bad9a	Expose missing mappings (see #3415 )	2020-03-24 17:46:25 -04:00
Julien Chaumond	d0c36a7b72	[ci] Partial revert of `18eec3a984` due to `fbc5bf10cf`	2020-03-24 12:10:43 -04:00