Compare commits

...

39 Commits

Author SHA1 Message Date
LysandreJik
6f5a12a583 Release: v2.7.0
Some checks failed
GitHub-hosted runner / check_code_quality (push) Has been cancelled
2020-03-30 08:49:24 -04:00
Patrick von Platen
296252c49e fix lm lables in docstring (#3529) 2020-03-30 14:26:24 +02:00
Patrick von Platen
75ec6c9e3a [T5] make decoder input ids optional for t5 training (#3521)
* make decoder input ids optional for t5 training

* lm_lables should not be shifted in t5

* add tests

* finish shift right functionality for PT T5

* move shift right to correct class

* cleaner code

* replace -100 values with pad token id

* add assert statement

* remove unnecessary for loop

* make style
2020-03-30 13:45:26 +02:00
Patrick von Platen
5b44e0a31b [T5] Add training documenation (#3507)
* Add clear description of how to train T5

* correct docstring in T5

* correct typo

* correct docstring format

* update t5 model docs

* implement collins feedback

* fix typo and add more explanation for sentinal tokens

* delete unnecessary todos
2020-03-30 13:35:53 +02:00
Sam Shleifer
33ef7002e1 [Docs] examples/summarization/bart: Simplify CNN/DM preprocessi… (#3516) 2020-03-29 13:25:42 -04:00
Sam Shleifer
f6a23d1911 [BART] add bart-large-xsum weights (#3422) 2020-03-29 10:51:13 -04:00
Stefan Schweter
601ac5b1dc [model_cards]: use MIT license for all dbmdz models 2020-03-27 18:06:25 -04:00
Patrick von Platen
17dceae7a1 Fix circle ci flaky fail of wmt example (#3485)
* force bleu

* fix wrong file name

* rename file

* different filenames for each example test

* test files should clean up after themselves

* test files should clean up after themselves

* do not force bleu

* correct typo

* fix isort
2020-03-27 13:01:28 -04:00
Patrick von Platen
00ea100e96 add summarization and translation to notebook (#3478) 2020-03-27 11:05:37 -04:00
Funtowicz Morgan
b08259a120 run_ner.py / bert-base-multilingual-cased can output empty tokens (#2991)
* Use tokenizer.num_added_tokens to count number of added special_tokens instead of hardcoded numbers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* run_ner.py - Do not add a label to the labels_ids if word_tokens is empty.

This can happen when using bert-base-multilingual-cased with an input containing an unique space.
In this case, the tokenizer will output just an empty word_tokens thus leading to an non-consistent behavior
over the labels_ids tokens adding one more tokens than tokens vector.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-03-27 10:59:55 -04:00
Patrick von Platen
f4f4946836 Rename t5-large to t5-base in README.md 2020-03-27 15:57:58 +01:00
Patrick von Platen
fa9af2468a Add T5 to docs (#3461)
* add t5 docs basis

* improve docs

* add t5 docs

* improve t5 docstring

* add t5 tokenizer docstring

* finish docstring

* make style

* add pretrained models

* correct typo

* make examples work

* finalize docs
2020-03-27 10:57:16 -04:00
Lysandre Debut
ff80b73157 Add option to choose T5 model size. (#3480)
T5-small in test


isort
2020-03-27 15:56:59 +01:00
LysandreJik
e2c05f06ef Correct indentation in docstring
For some reason Sphinx extremely dislikes this and crashes.
2020-03-27 09:28:52 -04:00
Sam Shleifer
3ee431dd4c [Bart/Memory] Two separate, smaller decoder attention masks (#3371) 2020-03-26 21:34:15 -04:00
Manuel Romero
53fe733805 Model Cards: Fix grammar error (#3467) 2020-03-26 21:33:33 -04:00
Sam Shleifer
c10decf7a0 [Bart: example] drop columns that are exclusively pad_token_id… (#3400)
* trim seq_len below 1024 if there are columns full of pad_token_id
* Centralize trim_batch so SummarizationDataset can use it too
2020-03-26 19:33:54 -04:00
Sam Shleifer
63f4d8cad0 [Bart/Memory] SelfAttention only returns weights if config.outp… (#3369) 2020-03-26 18:42:39 -04:00
Sam Shleifer
2b2a2f8df2 [Bart] Fix: put dummy_inputs on correct device (#3398)
* Dummy inputs to model.device

* Move self.device to ModuleUtilsMixin
2020-03-26 18:42:09 -04:00
Sam Shleifer
1a5aefc95c [Seq2Seq Generation] Call encoder before expanding input_ids (#3370) 2020-03-26 18:41:19 -04:00
Sam Shleifer
39371ee454 [Bart/Memory] don't create lm_head (#3323)
* delete lm_head, skips weight tying
* Fixed s3
2020-03-26 18:40:39 -04:00
Patrick von Platen
5ad2ea06af Add wmt translation example (#3428)
* add translation example

* make style

* adapt docstring

* add gpu device as input for example

* small renaming

* better README
2020-03-26 19:07:59 +01:00
Patrick von Platen
b4fb94fe6d revert unpin isort commit 2020-03-26 13:19:18 -04:00
Patrick von Platen
e703e923ca Add t5 summarization example (#3411)
* rebase to master

* change tf to pytorch

* change to pytorch

* small fix

* renaming

* add gpu training possibility

* renaming

* improve README

* incoorporate collins feedback

* better Readme

* better README.md
2020-03-26 18:17:55 +01:00
sakares saengkaew
1a6c546c6f Add missing token classification for XLM (#3277)
* Add the missing token classification for XLM

* fix styling

* Add XLMForTokenClassification to AutoModelForTokenClassification class

* Fix docstring typo for non-existing class

* Add the missing token classification for XLM

* fix styling

* fix styling

* Add XLMForTokenClassification to AutoModelForTokenClassification class

* Fix docstring typo for non-existing class

* Add missing description for AlbertForTokenClassification

* fix styling

* Add missing docstring for AlBert

* Slow tests should be slow

Co-authored-by: Sakares Saengkaew <s.sakares@gmail.com>
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-03-26 10:22:13 -04:00
Patrick von Platen
311970546f rename string in pipeline 2020-03-26 14:59:49 +01:00
Manuel Romero
7420a6a9cc Create card for model GPT-2-finetuned-CORD19 2020-03-26 09:10:09 -04:00
Patrick von Platen
022e8fab97 Adds translation pipeline (#3419)
* fix merge conflicts

* add t5 summarization example

* change parameters for t5 summarization

* make style

* add first code snippet for translation

* only add prefixes

* add prefix patterns

* make style

* renaming

* fix conflicts

* remove unused patterns

* solve conflicts

* fix merge conflicts

* remove translation example

* remove summarization example

* make sure tensors are in numpy for float comparsion

* re-add t5 config

* fix t5 import config typo

* make style

* remove unused numpy statements

* update doctstring

* import translation pipeline
2020-03-26 13:50:58 +01:00
HUSEIN ZOLKEPLI
3c5c567507 Update model card huseinzol05/bert-base-bahasa-cased (#3425)
* add bert bahasa readme

* update readme

* update readme

* added xlnet
2020-03-26 07:50:27 -04:00
Patrick von Platen
9c683ef01e Add t5 to pipeline(task='summarization') (#3413)
* solve conflicts

* move warnings below

* incorporate changes

* add pad_to_max_length to pipelines

* add bug fix for T5 beam search

* add prefix patterns

* make style

* fix conflicts

* adapt pipelines for task specific parameters

* improve docstring

* remove unused patterns
2020-03-26 11:03:13 +01:00
Lysandre Debut
ffcffebe85 Force the return of token type IDs (#3439) 2020-03-26 09:41:36 +01:00
Travis McGuire
010e0460b2 Updated/added model cards (#3435) 2020-03-25 16:40:03 -04:00
Patrick von Platen
ffa17fe322 Extend config with task specific configs. (#3433)
* add new default configs

* change prefix default to None
2020-03-25 21:32:04 +01:00
Julien Chaumond
83272a3853 Experiment w/ dataclasses (including Py36) (#3423)
* [ci] Also run test_examples in py37

(will revert at the end of the experiment)

* InputExample: use immutable dataclass

* [deps] Install dataclasses for Py<3.7

* [skip ci] Revert "[ci] Also run test_examples in py37"

This reverts commit d29afd9959786b77759b0b8fa4e6b4335b952015.
2020-03-25 11:10:20 -04:00
Gabriele Sarti
ccbe839ee0 Added BioBERT-NLI model card (#3421) 2020-03-24 21:15:55 -04:00
Andre Carrera
3d76df3a12 BART for summarization training with CNN/DM using pytorch-lightning 2020-03-24 21:00:24 -04:00
Julien Chaumond
eaabaaf750 [run_language_modeling] Fix: initialize a new model from a config object 2020-03-24 17:56:40 -04:00
Julien Chaumond
f8823bad9a Expose missing mappings (see #3415) 2020-03-24 17:46:25 -04:00
Julien Chaumond
d0c36a7b72 [ci] Partial revert of 18eec3a984 due to fbc5bf10cf 2020-03-24 12:10:43 -04:00
72 changed files with 5180 additions and 957 deletions

View File

@@ -85,6 +85,8 @@ jobs:
parallelism: 1
steps:
- checkout
# we need a version of isort with https://github.com/timothycrosley/isort/pull/1000
- run: sudo pip install git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort
- run: sudo pip install .[tf,torch,quality]
- run: black --check --line-length 119 --target-version py35 examples templates tests src utils
- run: isort --check-only --recursive examples templates tests src utils

View File

@@ -26,7 +26,7 @@ author = u'huggingface'
# The short X.Y version
version = u''
# The full version, including alpha/beta/rc tags
release = u'2.6.0'
release = u'2.7.0'
# -- General configuration ---------------------------------------------------

View File

@@ -103,3 +103,4 @@ The library currently contains PyTorch and Tensorflow implementations, pre-train
model_doc/xlmroberta
model_doc/flaubert
model_doc/bart
model_doc/t5

View File

@@ -0,0 +1,101 @@
T5
----------------------------------------------------
**DISCLAIMER:** This model is still a work in progress, if you see something strange,
file a `Github Issue <https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title>`_
Overview
~~~~~
The T5 model was presented in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer <https://arxiv.org/pdf/1910.10683.pdf>`_ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu in
Here the abstract:
*Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice.
In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format.
Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks.
By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.*
The Authors' code can be found `here <https://github.com/google-research/text-to-text-transfer-transformer>`_ .
Training
~~~~~~~~~~~~~~~~~~~~
T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing.
This means that for training we always need an input sequence and a target sequence.
The input sequence is fed to the model using ``input_ids``. The target sequence is shifted to the right, *i.e.* perprended by a start-sequence token and fed to the decoder using the `decoder_input_ids`. In teacher-forcing style, the target sequence is then appended by the EOS token and corresponds to the ``lm_labels``. The PAD token is hereby used as the start-sequence token.
T5 can be trained / fine-tuned both in a supervised and unsupervised fashion.
- Unsupervised denoising training
In this setup spans of the input sequence are masked by so-called sentinel tokens (*a.k.a* unique mask tokens)
and the output sequence is formed as a concatenation of the same sentinel tokens and the *real* masked tokens.
Each sentinel tokens represents a unique mask token for this sentence and should start with ``<extra_id_1>``, ``<extrac_id_2>``, ... up to ``<extra_id_100>``. As a default 100 sentinel tokens are available in ``T5Tokenizer``.
*E.g.* the sentence "The cute dog walks in the park" with the masks put on "cute dog" and "the" should be processed as follows:
::
input_ids = tokenizer.encode('The <extra_id_1> walks in <extra_id_2> park', return_tensors='pt')
lm_labels = tokenizer.encode('<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
- Supervised training
In this setup the input sequence and output sequence are standard sequence to sequence input output mapping.
In translation, *e.g.* the input sequence "The house is wonderful." and output sequence "Das Haus ist wunderbar." should
be processed as follows:
::
input_ids = tokenizer.encode('translate English to German: The house is wonderful. </s>', return_tensors='pt')
lm_labels = tokenizer.encode('Das Haus ist wunderbar. </s>', return_tensors='pt')
# the forward function automatically creates the correct decoder_input_ids
model(input_ids=input_ids, lm_labels=lm_labels)
Tips
~~~~~~~~~~~~~~~~~~~~
- T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised
and supervised tasks and for which each task is converted into a text-to-text format.
T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g.: for translation: *translate English to German: ..., summarize: ...*.
For more information about which prefix to use, it is easiest to look into Appendix D of the `paper <https://arxiv.org/pdf/1910.10683.pdf>`_ .
- For sequence to sequence generation, it is recommended to use ``T5ForConditionalGeneration.generate()``. The method takes care of feeding the encoded input via cross-attention layers to the decoder and auto-regressively generates the decoder output.
- T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right.
T5Config
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Config
:members:
T5Tokenizer
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Tokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
T5Model
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5Model
:members:
T5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.T5ForConditionalGeneration
:members:
TFT5Model
~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFT5Model
:members:
TFT5ForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.TFT5ForConditionalGeneration
:members:

View File

@@ -275,7 +275,6 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | | | FlauBERT large architecture |
| | | (see `details <https://github.com/getalp/Flaubert>`__) |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
| Bart | ``bart-large`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters |
| | | (see `details <https://github.com/pytorch/fairseq/tree/master/examples/bart>`_) |
| +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
@@ -285,6 +284,3 @@ For a list that includes community-uploaded models, refer to `https://huggingfac
| | ``bart-large-cnn`` | | 12-layer, 1024-hidden, 16-heads, 406M parameters (same as base) |
| | | | bart-large base architecture finetuned on cnn summarization task |
+-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+
.. <https://huggingface.co/transformers/examples.html>`__

View File

@@ -112,12 +112,15 @@ def convert_examples_to_features(
label_ids = []
for word, label in zip(example.words, example.labels):
word_tokens = tokenizer.tokenize(word)
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
# bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space.
if len(word_tokens) > 0:
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
# Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
special_tokens_count = 3 if sep_token_extra else 2
special_tokens_count = tokenizer.num_added_tokens()
if len(tokens) > max_seq_length - special_tokens_count:
tokens = tokens[: (max_seq_length - special_tokens_count)]
label_ids = label_ids[: (max_seq_length - special_tokens_count)]

View File

@@ -3,3 +3,6 @@ tensorboard
scikit-learn
seqeval
psutil
sacrebleu
rouge-score
tensorflow_datasets

View File

@@ -38,7 +38,6 @@ from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from transformers import (
CONFIG_MAPPING,
MODEL_WITH_LM_HEAD_MAPPING,
WEIGHTS_NAME,
AdamW,
@@ -679,7 +678,12 @@ def main():
elif args.model_name_or_path:
config = AutoConfig.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
else:
config = CONFIG_MAPPING[args.model_type]()
# When we release a pip version exposing CONFIG_MAPPING,
# we can do `config = CONFIG_MAPPING[args.model_type]()`.
raise ValueError(
"You are instantiating a new config instance from scratch. This is not supported, but you can do it from another script, save it,"
"and load it from here, using --config_name"
)
if args.tokenizer_name:
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
@@ -687,8 +691,8 @@ def main():
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
else:
raise ValueError(
"You are instantiating a new {} tokenizer. This is not supported, but you can do it from another script, save it,"
"and load it from here, using --tokenizer_name".format(AutoTokenizer.__name__)
"You are instantiating a new tokenizer from scratch. This is not supported, but you can do it from another script, save it,"
"and load it from here, using --tokenizer_name"
)
if args.block_size <= 0:
@@ -706,7 +710,7 @@ def main():
)
else:
logger.info("Training new model from scratch")
model = AutoModelWithLMHead(config=config)
model = AutoModelWithLMHead.from_config(config)
model.to(args.device)

View File

@@ -1,23 +1,29 @@
### Get the CNN Data
### Get Preprocessed CNN Data
To be able to reproduce the authors' results on the CNN/Daily Mail dataset you first need to download both CNN and Daily Mail datasets [from Kyunghyun Cho's website](https://cs.nyu.edu/~kcho/DMQA/) (the links next to "Stories") in the same folder. Then uncompress the archives by running:
```bash
tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz
```
this should make a directory called cnn_dm/ with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line.
### Usage
### Evaluation
To create summaries for each article in dataset, run:
```bash
python evaluate_cnn.py <path_to_test.source> cnn_test_summaries.txt
```
the default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
### Training
Run/modify `run_train.sh`
### Where is the code?
The core model is in `src/transformers/modeling_bart.py`. This directory only contains examples.
### (WIP) Rouge Scores
## (WIP) Rouge Scores
### Stanford CoreNLP Setup
```

View File

@@ -0,0 +1,172 @@
import argparse
import glob
import logging
import os
import time
import torch
from torch.utils.data import DataLoader
from transformer_base import BaseTransformer, add_generic_args, generic_train, get_linear_schedule_with_warmup
from utils import SummarizationDataset
logger = logging.getLogger(__name__)
class BartSystem(BaseTransformer):
mode = "language-modeling"
def __init__(self, hparams):
super(BartSystem, self).__init__(hparams, num_labels=None, mode=self.mode)
def forward(
self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
):
return self.model(
input_ids,
attention_mask=attention_mask,
decoder_input_ids=decoder_input_ids,
decoder_attention_mask=decoder_attention_mask,
lm_labels=lm_labels,
)
def _step(self, batch):
y = batch["target_ids"]
y_ids = y[:, :-1].contiguous()
lm_labels = y[:, 1:].clone()
lm_labels[y[:, 1:] == self.tokenizer.pad_token_id] = -100
outputs = self(
input_ids=batch["source_ids"],
attention_mask=batch["source_mask"],
decoder_input_ids=y_ids,
lm_labels=lm_labels,
)
loss = outputs[0]
return loss
def training_step(self, batch, batch_idx):
loss = self._step(batch)
tensorboard_logs = {"train_loss": loss}
return {"loss": loss, "log": tensorboard_logs}
def validation_step(self, batch, batch_idx):
loss = self._step(batch)
return {"val_loss": loss}
def validation_end(self, outputs):
avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
tensorboard_logs = {"val_loss": avg_loss}
return {"avg_val_loss": avg_loss, "log": tensorboard_logs}
def test_step(self, batch, batch_idx):
generated_ids = self.model.generate(
batch["source_ids"],
attention_mask=batch["source_mask"],
num_beams=1,
max_length=80,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True,
)
preds = [
self.tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True)
for g in generated_ids
]
target = [
self.tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)
for t in batch["target_ids"]
]
loss = self._step(batch)
return {"val_loss": loss, "preds": preds, "target": target}
def test_end(self, outputs):
return self.validation_end(outputs)
def test_epoch_end(self, outputs):
output_test_predictions_file = os.path.join(self.hparams.output_dir, "test_predictions.txt")
output_test_targets_file = os.path.join(self.hparams.output_dir, "test_targets.txt")
# write predictions and targets for later rouge evaluation.
with open(output_test_predictions_file, "w+") as p_writer, open(output_test_targets_file, "w+") as t_writer:
for output_batch in outputs:
p_writer.writelines(s + "\n" for s in output_batch["preds"])
t_writer.writelines(s + "\n" for s in output_batch["target"])
p_writer.close()
t_writer.close()
return self.test_end(outputs)
def train_dataloader(self):
train_dataset = SummarizationDataset(
self.tokenizer, data_dir=self.hparams.data_dir, type_path="train", block_size=self.hparams.max_seq_length
)
dataloader = DataLoader(train_dataset, batch_size=self.hparams.train_batch_size)
t_total = (
(len(dataloader.dataset) // (self.hparams.train_batch_size * max(1, self.hparams.n_gpu)))
// self.hparams.gradient_accumulation_steps
* float(self.hparams.num_train_epochs)
)
scheduler = get_linear_schedule_with_warmup(
self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=t_total
)
self.lr_scheduler = scheduler
return dataloader
def val_dataloader(self):
val_dataset = SummarizationDataset(
self.tokenizer, data_dir=self.hparams.data_dir, type_path="val", block_size=self.hparams.max_seq_length
)
return DataLoader(val_dataset, batch_size=self.hparams.eval_batch_size)
def test_dataloader(self):
test_dataset = SummarizationDataset(
self.tokenizer, data_dir=self.hparams.data_dir, type_path="test", block_size=self.hparams.max_seq_length
)
return DataLoader(test_dataset, batch_size=self.hparams.eval_batch_size)
@staticmethod
def add_model_specific_args(parser, root_dir):
BaseTransformer.add_model_specific_args(parser, root_dir)
# Add BART specific options
parser.add_argument(
"--max_seq_length",
default=1024,
type=int,
help="The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded.",
)
parser.add_argument(
"--data_dir",
default=None,
type=str,
required=True,
help="The input data dir. Should contain the dataset files for the CNN/DM summarization task.",
)
return parser
if __name__ == "__main__":
parser = argparse.ArgumentParser()
add_generic_args(parser, os.getcwd())
parser = BartSystem.add_model_specific_args(parser, os.getcwd())
args = parser.parse_args()
# If output_dir not provided, a folder will be generated in pwd
if args.output_dir is None:
args.output_dir = os.path.join("./results", f"{args.task}_{args.model_type}_{time.strftime('%Y%m%d_%H%M%S')}",)
os.makedirs(args.output_dir)
model = BartSystem(args)
trainer = generic_train(model, args)
# Optionally, predict on dev set and write to output_dir
if args.do_predict:
checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "checkpointepoch=*.ckpt"), recursive=True)))
BartSystem.load_from_checkpoint(checkpoints[-1])
trainer.test(model)

View File

@@ -0,0 +1,23 @@
# Install newest ptl.
pip install -U git+http://github.com/PyTorchLightning/pytorch-lightning/
export OUTPUT_DIR_NAME=bart_sum
export CURRENT_DIR=${PWD}
export OUTPUT_DIR=${CURRENT_DIR}/${OUTPUT_DIR_NAME}
# Make output directory if it doesn't exist
mkdir -p $OUTPUT_DIR
# Add parent directory to python path to access transformer_base.py
export PYTHONPATH="../../":"${PYTHONPATH}"
python run_bart_sum.py \
--data_dir=./cnn-dailymail/cnn_dm \
--model_type=bart \
--model_name_or_path=bart-large \
--learning_rate=3e-5 \
--train_batch_size=4 \
--eval_batch_size=4 \
--output_dir=$OUTPUT_DIR \
--do_train

View File

@@ -1,4 +1,5 @@
import logging
import os
import sys
import tempfile
import unittest
@@ -8,6 +9,8 @@ from unittest.mock import patch
from .evaluate_cnn import _run_generate
output_file_name = "output_bart_sum.txt"
articles = [" New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
logging.basicConfig(level=logging.DEBUG)
@@ -19,10 +22,11 @@ class TestBartExamples(unittest.TestCase):
def test_bart_cnn_cli(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
tmp = Path(tempfile.gettempdir()) / "utest_generations.hypo"
tmp = Path(tempfile.gettempdir()) / "utest_generations_bart_sum.hypo"
with tmp.open("w") as f:
f.write("\n".join(articles))
testargs = ["evaluate_cnn.py", str(tmp), "output.txt"]
testargs = ["evaluate_cnn.py", str(tmp), output_file_name]
with patch.object(sys, "argv", testargs):
_run_generate()
self.assertTrue(Path("output.txt").exists())
self.assertTrue(Path(output_file_name).exists())
os.remove(Path(output_file_name))

View File

@@ -0,0 +1,43 @@
import os
from torch.utils.data import Dataset
class SummarizationDataset(Dataset):
def __init__(self, tokenizer, data_dir="./cnn-dailymail/cnn_dm/", type_path="train", block_size=1024):
super(SummarizationDataset,).__init__()
self.tokenizer = tokenizer
self.source = []
self.target = []
print("loading " + type_path + " source.")
with open(os.path.join(data_dir, type_path + ".source"), "r") as f:
for text in f.readlines(): # each text is a line and a full story
tokenized = tokenizer.batch_encode_plus(
[text], max_length=block_size, pad_to_max_length=True, return_tensors="pt"
)
self.source.append(tokenized)
f.close()
print("loading " + type_path + " target.")
with open(os.path.join(data_dir, type_path + ".target"), "r") as f:
for text in f.readlines(): # each text is a line and a summary
tokenized = tokenizer.batch_encode_plus(
[text], max_length=56, pad_to_max_length=True, return_tensors="pt"
)
self.target.append(tokenized)
f.close()
def __len__(self):
return len(self.source)
def __getitem__(self, index):
source_ids = self.source[index]["input_ids"].squeeze()
target_ids = self.target[index]["input_ids"].squeeze()
src_mask = self.source[index]["attention_mask"].squeeze() # might need to squeeze
return {"source_ids": source_ids, "source_mask": src_mask, "target_ids": target_ids}

View File

@@ -0,0 +1,25 @@
***This script evaluates the the multitask pre-trained checkpoint for ``t5-base`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the CNN/Daily Mail test dataset. Please note that the results in the paper were attained using a model fine-tuned on summarization, so that results will be worse here by approx. 0.5 ROUGE points***
### Get the CNN Data
First, you need to download the CNN data. It's about ~400 MB and can be downloaded by
running
```bash
python download_cnn_daily_mail.py cnn_articles_input_data.txt cnn_articles_reference_summaries.txt
```
You should confirm that each file has 11490 lines:
```bash
wc -l cnn_articles_input_data.txt # should print 11490
wc -l cnn_articles_reference_summaries.txt # should print 11490
```
### Usage
To create summaries for each article in dataset, run:
```bash
python evaluate_cnn.py cnn_articles_input_data.txt cnn_generated_articles_summaries.txt cnn_articles_reference_summaries.txt rouge_score.txt
```
The default batch size, 8, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
The rouge scores "rouge1, rouge2, rougeL" are automatically created and saved in ``rouge_score.txt``.

View File

View File

@@ -0,0 +1,31 @@
import argparse
from pathlib import Path
import tensorflow_datasets as tfds
def main(input_path, reference_path, data_dir):
cnn_ds = tfds.load("cnn_dailymail", split="test", shuffle_files=False, data_dir=data_dir)
cnn_ds_iter = tfds.as_numpy(cnn_ds)
test_articles_file = Path(input_path).open("w")
test_summaries_file = Path(reference_path).open("w")
for example in cnn_ds_iter:
test_articles_file.write(example["article"].decode("utf-8") + "\n")
test_articles_file.flush()
test_summaries_file.write(example["highlights"].decode("utf-8").replace("\n", " ") + "\n")
test_summaries_file.flush()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("input_path", type=str, help="where to save the articles input data")
parser.add_argument(
"reference_path", type=str, help="where to save the reference summaries",
)
parser.add_argument(
"--data_dir", type=str, default="~/tensorflow_datasets", help="where to save the tensorflow datasets.",
)
args = parser.parse_args()
main(args.input_path, args.reference_path, args.data_dir)

View File

@@ -0,0 +1,101 @@
import argparse
from pathlib import Path
import torch
from tqdm import tqdm
from rouge_score import rouge_scorer, scoring
from transformers import T5ForConditionalGeneration, T5Tokenizer
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i : i + n]
def generate_summaries(lns, output_file_path, model_size, batch_size, device):
output_file = Path(output_file_path).open("w")
model = T5ForConditionalGeneration.from_pretrained(model_size)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(model_size)
# update config with summarization specific params
task_specific_params = model.config.task_specific_params
if task_specific_params is not None:
model.config.update(task_specific_params.get("summarization", {}))
for batch in tqdm(list(chunks(lns, batch_size))):
batch = [model.config.prefix + text for text in batch]
dct = tokenizer.batch_encode_plus(batch, max_length=512, return_tensors="pt", pad_to_max_length=True)
input_ids = dct["input_ids"].to(device)
attention_mask = dct["attention_mask"].to(device)
summaries = model.generate(input_ids=input_ids, attention_mask=attention_mask)
dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summaries]
for hypothesis in dec:
output_file.write(hypothesis + "\n")
output_file.flush()
def calculate_rouge(output_lns, reference_lns, score_path):
score_file = Path(score_path).open("w")
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
aggregator = scoring.BootstrapAggregator()
for reference_ln, output_ln in zip(reference_lns, output_lns):
scores = scorer.score(reference_ln, output_ln)
aggregator.add_scores(scores)
result = aggregator.aggregate()
score_file.write(
"ROUGE_1: \n{} \n\n ROUGE_2: \n{} \n\n ROUGE_L: \n{} \n\n".format(
result["rouge1"], result["rouge2"], result["rougeL"]
)
)
def run_generate():
parser = argparse.ArgumentParser()
parser.add_argument(
"model_size",
type=str,
help="T5 model size, either 't5-small', 't5-base' or 't5-large'. Defaults to base.",
default="t5-base",
)
parser.add_argument(
"input_path", type=str, help="like cnn_dm/test_articles_input.txt",
)
parser.add_argument(
"output_path", type=str, help="where to save summaries",
)
parser.add_argument("reference_path", type=str, help="like cnn_dm/test_reference_summaries.txt")
parser.add_argument(
"score_path", type=str, help="where to save the rouge score",
)
parser.add_argument(
"--batch_size", type=int, default=8, required=False, help="batch size: how many to summarize at a time",
)
parser.add_argument(
"--no_cuda", default=False, type=bool, help="Whether to force the execution on CPU.",
)
args = parser.parse_args()
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
source_lns = [x.rstrip() for x in open(args.input_path).readlines()]
generate_summaries(source_lns, args.output_path, args.model_size, args.batch_size, args.device)
output_lns = [x.rstrip() for x in open(args.output_path).readlines()]
reference_lns = [x.rstrip() for x in open(args.reference_path).readlines()]
calculate_rouge(output_lns, reference_lns, args.score_path)
if __name__ == "__main__":
run_generate()

View File

@@ -0,0 +1,35 @@
import logging
import os
import sys
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch
from .evaluate_cnn import run_generate
output_file_name = "output_t5_sum.txt"
score_file_name = "score_t5_sum.txt"
articles = ["New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County."]
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()
class TestT5Examples(unittest.TestCase):
def test_t5_cli(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
tmp = Path(tempfile.gettempdir()) / "utest_generations_t5_sum.hypo"
with tmp.open("w") as f:
f.write("\n".join(articles))
testargs = ["evaluate_cnn.py", "t5-small", str(tmp), output_file_name, str(tmp), score_file_name]
with patch.object(sys, "argv", testargs):
run_generate()
self.assertTrue(Path(output_file_name).exists())
self.assertTrue(Path(score_file_name).exists())
os.remove(Path(output_file_name))
os.remove(Path(score_file_name))

View File

@@ -53,10 +53,9 @@ class BaseTransformer(pl.LightningModule):
super(BaseTransformer, self).__init__()
self.hparams = hparams
self.hparams.model_type = self.hparams.model_type.lower()
config = AutoConfig.from_pretrained(
self.hparams.config_name if self.hparams.config_name else self.hparams.model_name_or_path,
num_labels=num_labels,
**({"num_labels": num_labels} if num_labels is not None else {}),
cache_dir=self.hparams.cache_dir if self.hparams.cache_dir else None,
)
tokenizer = AutoTokenizer.from_pretrained(

View File

@@ -0,0 +1,51 @@
***This script evaluates the multitask pre-trained checkpoint for ``t5-base`` (see paper [here](https://arxiv.org/pdf/1910.10683.pdf)) on the English to German WMT dataset. Please note that the results in the paper were attained using a model fine-tuned on translation, so that results will be worse here by approx. 1.5 BLEU points***
### Intro
This example shows how T5 (here the official [paper](https://arxiv.org/abs/1910.10683)) can be
evaluated on the WMT English-German dataset.
### Get the WMT Data
To be able to reproduce the authors' results on WMT English to German, you first need to download
the WMT14 en-de news datasets.
Go on Stanford's official NLP [website](https://nlp.stanford.edu/projects/nmt/) and find "newstest2013.en" and "newstest2013.de" under WMT'14 English-German data or download the dataset directly via:
```bash
curl https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.en > newstest2013.en
curl https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.de > newstest2013.de
```
You should have 3000 sentence in each file. You can verify this by running:
```bash
wc -l newstest2013.en # should give 3000
```
### Usage
Let's check the longest and shortest sentence in our file to find reasonable decoding hyperparameters:
Get the longest and shortest sentence:
```bash
awk '{print NF}' newstest2013.en | sort -n | head -1 # shortest sentence has 1 word
awk '{print NF}' newstest2013.en | sort -n | tail -1 # longest sentence has 106 words
```
We will set our `max_length` to ~3 times the longest sentence and leave `min_length` to its default value of 0.
We decode with beam search `num_beams=4` as proposed in the paper. Also as is common in beam search we set `early_stopping=True` and `length_penalty=2.0`.
To create translation for each in dataset and get a final BLEU score, run:
```bash
python evaluate_wmt.py <path_to_newstest2013.en> newstest2013_de_translations.txt <path_to_newstest2013.de> newsstest2013_en_de_bleu.txt
```
the default batch size, 16, fits in 16GB GPU memory, but may need to be adjusted to fit your system.
### Where is the code?
The core model is in `src/transformers/modeling_t5.py`. This directory only contains examples.
### BLEU Scores
The BLEU score is calculated using [sacrebleu](https://github.com/mjpost/sacreBLEU) by mjpost.
To get the BLEU score we used

View File

View File

@@ -0,0 +1,90 @@
import argparse
from pathlib import Path
import torch
from tqdm import tqdm
from sacrebleu import corpus_bleu
from transformers import T5ForConditionalGeneration, T5Tokenizer
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i : i + n]
def generate_translations(lns, output_file_path, batch_size, device):
output_file = Path(output_file_path).open("w")
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model.to(device)
tokenizer = T5Tokenizer.from_pretrained("t5-base")
# update config with summarization specific params
task_specific_params = model.config.task_specific_params
if task_specific_params is not None:
model.config.update(task_specific_params.get("translation_en_to_de", {}))
for batch in tqdm(list(chunks(lns, batch_size))):
batch = [model.config.prefix + text for text in batch]
dct = tokenizer.batch_encode_plus(batch, max_length=512, return_tensors="pt", pad_to_max_length=True)
input_ids = dct["input_ids"].to(device)
attention_mask = dct["attention_mask"].to(device)
translations = model.generate(input_ids=input_ids, attention_mask=attention_mask)
dec = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in translations]
for hypothesis in dec:
output_file.write(hypothesis + "\n")
output_file.flush()
def calculate_bleu_score(output_lns, refs_lns, score_path):
bleu = corpus_bleu(output_lns, [refs_lns])
result = "BLEU score: {}".format(bleu.score)
score_file = Path(score_path).open("w")
score_file.write(result)
def run_generate():
parser = argparse.ArgumentParser()
parser.add_argument(
"input_path", type=str, help="like wmt/newstest2013.en",
)
parser.add_argument(
"output_path", type=str, help="where to save translation",
)
parser.add_argument(
"reference_path", type=str, help="like wmt/newstest2013.de",
)
parser.add_argument(
"score_path", type=str, help="where to save the bleu score",
)
parser.add_argument(
"--batch_size", type=int, default=16, required=False, help="batch size: how many to summarize at a time",
)
parser.add_argument(
"--no_cuda", default=False, type=bool, help="Whether to force the execution on CPU.",
)
args = parser.parse_args()
args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
dash_pattern = (" ##AT##-##AT## ", "-")
input_lns = [x.strip().replace(dash_pattern[0], dash_pattern[1]) for x in open(args.input_path).readlines()]
generate_translations(input_lns, args.output_path, args.batch_size, args.device)
output_lns = [x.strip() for x in open(args.output_path).readlines()]
refs_lns = [x.strip().replace(dash_pattern[0], dash_pattern[1]) for x in open(args.reference_path).readlines()]
calculate_bleu_score(output_lns, refs_lns, args.score_path)
if __name__ == "__main__":
run_generate()

View File

@@ -0,0 +1,43 @@
import logging
import os
import sys
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch
from .evaluate_wmt import run_generate
text = ["When Liana Barrientos was 23 years old, she got married in Westchester County."]
translation = ["Als Liana Barrientos 23 Jahre alt war, heiratete sie in Westchester County."]
output_file_name = "output_t5_trans.txt"
score_file_name = "score_t5_trans.txt"
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()
class TestT5Examples(unittest.TestCase):
def test_t5_cli(self):
stream_handler = logging.StreamHandler(sys.stdout)
logger.addHandler(stream_handler)
tmp_source = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.hypo"
with tmp_source.open("w") as f:
f.write("\n".join(text))
tmp_target = Path(tempfile.gettempdir()) / "utest_generations_t5_trans.target"
with tmp_target.open("w") as f:
f.write("\n".join(translation))
testargs = ["evaluate_wmt.py", str(tmp_source), output_file_name, str(tmp_target), score_file_name]
with patch.object(sys, "argv", testargs):
run_generate()
self.assertTrue(Path(output_file_name).exists())
self.assertTrue(Path(score_file_name).exists())
os.remove(Path(output_file_name))
os.remove(Path(score_file_name))

View File

@@ -320,7 +320,9 @@ def convert_examples_to_features(
else:
text_b = example.question + " " + ending
inputs = tokenizer.encode_plus(text_a, text_b, add_special_tokens=True, max_length=max_length,)
inputs = tokenizer.encode_plus(
text_a, text_b, add_special_tokens=True, max_length=max_length, return_token_type_ids=True
)
if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0:
logger.info(
"Attention! you are cropping tokens (swag task is ok). "

View File

@@ -1,5 +1,6 @@
---
language: german
license: mit
---
# 🤗 + 📚 dbmdz German BERT models

View File

@@ -1,5 +1,6 @@
---
language: german
license: mit
tags:
- "historic german"
---

View File

@@ -1,5 +1,6 @@
---
language: german
license: mit
tags:
- "historic german"
---

View File

@@ -1,5 +1,6 @@
---
language: german
license: mit
---
# 🤗 + 📚 dbmdz German BERT models

View File

@@ -1,5 +1,6 @@
---
language: italian
license: mit
---
# 🤗 + 📚 dbmdz BERT models

View File

@@ -1,5 +1,6 @@
---
language: italian
license: mit
---
# 🤗 + 📚 dbmdz BERT models

View File

@@ -1,5 +1,6 @@
---
language: italian
license: mit
---
# 🤗 + 📚 dbmdz BERT models

View File

@@ -1,5 +1,6 @@
---
language: italian
license: mit
---
# 🤗 + 📚 dbmdz BERT models

View File

@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---
# 🤗 + 📚 dbmdz Turkish BERT model

View File

@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---
# 🤗 + 📚 dbmdz Turkish BERT model

View File

@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---
# 🤗 + 📚 dbmdz Turkish BERT model

View File

@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---
# 🤗 + 📚 dbmdz Turkish BERT model

View File

@@ -1,5 +1,6 @@
---
language: turkish
license: mit
---
# 🤗 + 📚 dbmdz Distilled Turkish BERT model

View File

@@ -0,0 +1,37 @@
# BioBERT-NLI
This is the model [BioBERT](https://github.com/dmis-lab/biobert) [1] fine-tuned on the [SNLI](https://nlp.stanford.edu/projects/snli/) and the [MultiNLI](https://www.nyu.edu/projects/bowman/multinli/) datasets using the [`sentence-transformers` library](https://github.com/UKPLab/sentence-transformers/) to produce universal sentence embeddings [2].
The model uses the original BERT wordpiece vocabulary and was trained using the **average pooling strategy** and a **softmax loss**.
**Base model**: `monologg/biobert_v1.1_pubmed` from HuggingFace's `AutoModel`.
**Training time**: ~6 hours on the NVIDIA Tesla P100 GPU provided in Kaggle Notebooks.
**Parameters**:
| Parameter | Value |
|------------------|-------|
| Batch size | 64 |
| Training steps | 30000 |
| Warmup steps | 1450 |
| Lowercasing | False |
| Max. Seq. Length | 128 |
**Performances**: The performance was evaluated on the test portion of the [STS dataset](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) using Spearman rank correlation and compared to the performances of a general BERT base model obtained with the same procedure to verify their similarity.
| Model | Score |
|-------------------------------|-------------|
| `biobert-nli` (this) | 73.40 |
| `gsarti/scibert-nli` | 74.50 |
| `bert-base-nli-mean-tokens`[3]| 77.12 |
An example usage for similarity-based scientific paper retrieval is provided in the [Covid Papers Browser](https://github.com/gsarti/covid-papers-browser) repository.
**References:**
[1] J. Lee et al, [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://academic.oup.com/bioinformatics/article/36/4/1234/5566506)
[2] A. Conneau et al., [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https://www.aclweb.org/anthology/D17-1070/)
[3] N. Reimers et I. Gurevych, [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://www.aclweb.org/anthology/D19-1410/)

View File

@@ -32,13 +32,54 @@ Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import XLNetTokenizer, BertModel
from transformers import AlbertTokenizer, BertModel
model = BertModel.from_pretrained('huseinzol05/bert-base-bahasa-cased')
tokenizer = XLNetTokenizer.from_pretrained('huseinzol05/bert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/bert-base-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
```
We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `XLNetTokenizer`.
We use [google/sentencepiece](https://github.com/google/sentencepiece) to train the tokenizer, so to use it, need to load from `AlbertTokenizer`.
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/bert-base-bahasa-cased')
tokenizer = AlbertTokenizer.from_pretrained(
'huseinzol05/bert-base-bahasa-cased',
unk_token = '[UNK]',
pad_token = '[PAD]',
do_lower_case = False,
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
Output is,
```text
[{'sequence': '[CLS] makan ayam dengan rendang[SEP]',
'score': 0.10812027007341385,
'token': 2446},
{'sequence': '[CLS] makan ayam dengan kicap[SEP]',
'score': 0.07653367519378662,
'token': 12928},
{'sequence': '[CLS] makan ayam dengan nasi[SEP]',
'score': 0.06839974224567413,
'token': 450},
{'sequence': '[CLS] makan ayam dengan ayam[SEP]',
'score': 0.059544261544942856,
'token': 638},
{'sequence': '[CLS] makan ayam dengan sayur[SEP]',
'score': 0.05294966697692871,
'token': 1639}]
```
## Results

View File

@@ -0,0 +1,64 @@
---
language: malay
---
# Bahasa XLNet Model
Pretrained XLNet base language model for Malay and Indonesian.
## Pretraining Corpus
`XLNET-base-bahasa-cased` model was pretrained on ~1.8 Billion words. We trained on both standard and social media language structures, and below is list of data we trained on,
1. [dumping wikipedia](https://github.com/huseinzol05/Malaya-Dataset#wikipedia-1).
2. [local instagram](https://github.com/huseinzol05/Malaya-Dataset#instagram).
3. [local twitter](https://github.com/huseinzol05/Malaya-Dataset#twitter-1).
4. [local news](https://github.com/huseinzol05/Malaya-Dataset#public-news).
5. [local parliament text](https://github.com/huseinzol05/Malaya-Dataset#parliament).
6. [local singlish/manglish text](https://github.com/huseinzol05/Malaya-Dataset#singlish-text).
7. [IIUM Confession](https://github.com/huseinzol05/Malaya-Dataset#iium-confession).
8. [Wattpad](https://github.com/huseinzol05/Malaya-Dataset#wattpad).
9. [Academia PDF](https://github.com/huseinzol05/Malaya-Dataset#academia-pdf).
Preprocessing steps can reproduce from here, [Malaya/pretrained-model/preprocess](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess).
## Pretraining details
- This model was trained using zihangdai XLNet's github [repository](https://github.com/zihangdai/xlnet) on 3 Titan V100 32GB VRAM.
- All steps can reproduce from here, [Malaya/pretrained-model/xlnet](https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/xlnet).
## Load Pretrained Model
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
```python
from transformers import XLNetTokenizer, XLNetModel
model = XLNetModel.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
tokenizer = XLNetTokenizer.from_pretrained(
'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
)
```
## Example using AutoModelWithLMHead
```python
from transformers import AlbertTokenizer, AutoModelWithLMHead, pipeline
model = AutoModelWithLMHead.from_pretrained('huseinzol05/xlnet-base-bahasa-cased')
tokenizer = XLNetTokenizer.from_pretrained(
'huseinzol05/xlnet-base-bahasa-cased', do_lower_case = False
)
fill_mask = pipeline('fill-mask', model = model, tokenizer = tokenizer)
print(fill_mask('makan ayam dengan [MASK]'))
```
## Results
For further details on the model performance, simply checkout accuracy page from Malaya, https://malaya.readthedocs.io/en/latest/Accuracy.html, we compared with traditional models.
## Acknowledgement
Thanks to [Im Big](https://www.facebook.com/imbigofficial/), [LigBlou](https://www.facebook.com/ligblou), [Mesolitica](https://mesolitica.com/) and [KeyReply](https://www.keyreply.com/) for sponsoring AWS, Google and GPU clouds to train XLNet for Bahasa.

View File

@@ -0,0 +1,60 @@
---
language: english
thumbnail:
---
# GPT-2 + CORD19 dataset : 🦠 ✍ ⚕
**GPT-2** fine-tuned on **biorxiv_medrxiv** and **comm_use_subset files** from [CORD-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) dataset.
## Datasets details:
| Dataset | # Files |
| ---------------------- | ----- |
| biorxiv_medrxiv | 885 |
| comm_use_subse | 9K |
## Model training
The model was trained on a Tesla P100 GPU and 25GB of RAM with the following command:
```bash
export TRAIN_FILE=/path/to/dataset/train.txt
python run_language_modeling.py \
--model_type gpt2 \
--model_name_or_path gpt2 \
--do_train \
--train_data_file $TRAIN_FILE \
--num_train_epochs 4 \
--output_dir model_output \
--overwrite_output_dir \
--save_steps 10000 \
--per_gpu_train_batch_size 3
```
<img alt="training loss" src="https://svgshare.com/i/JTf.svg' title='GTP-2-finetuned-CORDS19-loss" width="600" height="300" />
## Model in action / Example of usage: ✒
You can get the following script [here](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py)
```bash
python run_generation.py \
--model_type gpt2 \
--model_name_or_path mrm8488/GPT-2-finetuned-CORD19 \
--length 200
```
```txt
# Input: the effects of COVID-19 on the lungs
# Output: === GENERATED SEQUENCE 1 ===
the effects of COVID-19 on the lungs are currently debated (86). The role of this virus in the pathogenesis of pneumonia and lung cancer is still debated. MERS-CoV is also known to cause acute respiratory distress syndrome (87) and is associated with increased expression of pulmonary fibrosis markers (88). Thus, early airway inflammation may play an important role in the pathogenesis of coronavirus pneumonia and may contribute to the severe disease and/or mortality observed in coronavirus patients.
Pneumonia is an acute, often fatal disease characterized by severe edema, leakage of oxygen and bronchiolar inflammation. Viruses include coronaviruses, and the role of oxygen depletion is complicated by lung injury and fibrosis in the lung, in addition to susceptibility to other lung diseases. The progression of the disease may be variable, depending on the lung injury, pathologic role, prognosis, and the immune status of the patient. Inflammatory responses to respiratory viruses cause various pathologies of the respiratory
```
> Created by [Manuel Romero/@mrm8488](https://twitter.com/mrm8488) | [LinkedIn](https://www.linkedin.com/in/manuel-romero-cs/)
> Made with <span style="color: #e25555;">&hearts;</span> in Spain

View File

@@ -5,7 +5,7 @@ thumbnail: https://i.imgur.com/jgBdimh.png
# Spanish BERT (BETO) + POS
This model is a fine-tuned on [NER-C](https://www.kaggle.com/nltkdata/conll-corpora) Of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **POS** (Part of Speech tagging) downstream task.
This model is a fine-tuned on Spanish [CONLL CORPORA](https://www.kaggle.com/nltkdata/conll-corpora) version of the Spanish BERT cased [(BETO)](https://github.com/dccuchile/beto) for **POS** (Part of Speech tagging) downstream task.
## Details of the downstream task (POS) - Dataset

View File

@@ -1,22 +1,24 @@
This model is ALBERT base v2 trained on SQuAD v2 as:
This model is [ALBERT base v2](https://huggingface.co/albert-base-v2) trained on SQuAD v2 as:
```
python run_squad.py
--model_type albert
--model_name_or_path albert-base-v2
--do_train
--do_eval
--overwrite_cache
--do_lower_case
--version_2_with_negative
--train_file $SQUAD_DIR/train-v2.0.json
--predict_file $SQUAD_DIR/dev-v2.0.json
--per_gpu_train_batch_size 8
--num_train_epochs 3
--learning_rate 3e-5
--max_seq_length 384
--doc_stride 128
--output_dir ./tmp/albert_base_fine/
export SQUAD_DIR=../../squad2
python3 run_squad.py
--model_type albert
--model_name_or_path albert-base-v2
--do_train
--do_eval
--overwrite_cache
--do_lower_case
--version_2_with_negative
--save_steps 100000
--train_file $SQUAD_DIR/train-v2.0.json
--predict_file $SQUAD_DIR/dev-v2.0.json
--per_gpu_train_batch_size 8
--num_train_epochs 3
--learning_rate 3e-5
--max_seq_length 384
--doc_stride 128
--output_dir ./tmp/albert_fine/
```
Performance on a dev subset is close to the original paper:

View File

@@ -1,22 +1,24 @@
This model is BERT base uncased trained on SQuAD v2 as:
This model is [BERT base uncased](https://huggingface.co/bert-base-uncased) trained on SQuAD v2 as:
```
python run_squad.py
--model_type bert
--model_name_or_path bert-base-uncased
--do_train
--do_eval
--overwrite_cache
--do_lower_case
--version_2_with_negative
--train_file $SQUAD_DIR/train-v2.0.json
--predict_file $SQUAD_DIR/dev-v2.0.json
--per_gpu_train_batch_size 8
--num_train_epochs 3
--learning_rate 3e-5
--max_seq_length 384
--doc_stride 128
--output_dir ./tmp/bert_base_fine/
export SQUAD_DIR=../../squad2
python3 run_squad.py
--model_type bert
--model_name_or_path bert-base-uncased
--do_train
--do_eval
--overwrite_cache
--do_lower_case
--version_2_with_negative
--save_steps 100000
--train_file $SQUAD_DIR/train-v2.0.json
--predict_file $SQUAD_DIR/dev-v2.0.json
--per_gpu_train_batch_size 8
--num_train_epochs 3
--learning_rate 3e-5
--max_seq_length 384
--doc_stride 128
--output_dir ./tmp/bert_fine_tuned/
```
Performance on a dev subset is close to the original paper:

View File

@@ -0,0 +1,45 @@
This model is [Distilbert base uncased](https://huggingface.co/distilbert-base-uncased) trained on SQuAD v2 as:
```
export SQUAD_DIR=../../squad2
python3 run_squad.py
--model_type distilbert
--model_name_or_path distilbert-base-uncased
--do_train
--do_eval
--overwrite_cache
--do_lower_case
--version_2_with_negative
--save_steps 100000
--train_file $SQUAD_DIR/train-v2.0.json
--predict_file $SQUAD_DIR/dev-v2.0.json
--per_gpu_train_batch_size 8
--num_train_epochs 3
--learning_rate 3e-5
--max_seq_length 384
--doc_stride 128
--output_dir ./tmp/distilbert_fine_tuned/
```
Performance on a dev subset is close to the original paper:
```
Results:
{
'exact': 64.88976637051661,
'f1': 68.1776176526635,
'total': 6078,
'HasAns_exact': 69.7594501718213,
'HasAns_f1': 76.62665295288285,
'HasAns_total': 2910,
'NoAns_exact': 60.416666666666664,
'NoAns_f1': 60.416666666666664,
'NoAns_total': 3168,
'best_exact': 64.88976637051661,
'best_exact_thresh': 0.0,
'best_f1': 68.17761765266337,
'best_f1_thresh': 0.0
}
```
We are hopeful this might save you time, energy, and compute. Cheers!

View File

@@ -0,0 +1,44 @@
This model is [Distilroberta base](https://huggingface.co/distilroberta-base) trained on SQuAD v2 as:
```
export SQUAD_DIR=../../squad2
python3 run_squad.py
--model_type robberta
--model_name_or_path distilroberta-base
--do_train
--do_eval
--overwrite_cache
--do_lower_case
--version_2_with_negative
--save_steps 100000
--train_file $SQUAD_DIR/train-v2.0.json
--predict_file $SQUAD_DIR/dev-v2.0.json
--per_gpu_train_batch_size 8
--num_train_epochs 3
--learning_rate 3e-5
--max_seq_length 384
--doc_stride 128
--output_dir ./tmp/distilroberta_fine_tuned/
```
Performance on a dev subset is close to the original paper:
```
Results:
{
'exact': 70.9279368213228,
'f1': 74.60439802429168,
'total': 6078,
'HasAns_exact': 67.62886597938144,
'HasAns_f1': 75.30774267754136,
'HasAns_total': 2910,
'NoAns_exact': 73.95833333333333,
'NoAns_f1': 73.95833333333333, 'NoAns_total': 3168,
'best_exact': 70.94438960184272,
'best_exact_thresh': 0.0,
'best_f1': 74.62085080481161,
'best_f1_thresh': 0.0
}
```
We are hopeful this might save you time, energy, and compute. Cheers!

File diff suppressed because it is too large Load Diff

View File

@@ -76,14 +76,14 @@ extras["testing"] = ["pytest", "pytest-xdist"]
extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme"]
extras["quality"] = [
"black",
"isort",
"isort @ git+git://github.com/timothycrosley/isort.git@e63ae06ec7d70b06df9e528357650281a3d3ec22#egg=isort",
"flake8",
]
extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"]
setup(
name="transformers",
version="2.6.0",
version="2.7.0",
author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors",
author_email="thomas@huggingface.co",
description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch",
@@ -97,6 +97,8 @@ setup(
install_requires=[
"numpy",
"tokenizers == 0.5.2",
# dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'",
# accessing files from S3 directly
"boto3",
# filesystem locks e.g. to prevent parallel downloads

View File

@@ -2,7 +2,7 @@
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
__version__ = "2.6.0"
__version__ = "2.7.0"
# Work around to update TensorFlow's absl.logging threshold which alters the
# default Python logging output behavior when present.
@@ -32,7 +32,7 @@ from .benchmark_utils import (
stop_memory_tracing,
)
from .configuration_albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig
from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, CONFIG_MAPPING, AutoConfig
from .configuration_bart import BartConfig
from .configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from .configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
@@ -116,10 +116,11 @@ from .pipelines import (
SummarizationPipeline,
TextClassificationPipeline,
TokenClassificationPipeline,
TranslationPipeline,
pipeline,
)
from .tokenization_albert import AlbertTokenizer
from .tokenization_auto import AutoTokenizer
from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer
from .tokenization_bart import BartTokenizer
from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer
from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
@@ -221,6 +222,7 @@ if is_torch_available():
XLMModel,
XLMWithLMHeadModel,
XLMForSequenceClassification,
XLMForTokenClassification,
XLMForQuestionAnswering,
XLMForQuestionAnsweringSimple,
XLM_PRETRAINED_MODEL_ARCHIVE_MAP,

View File

@@ -26,6 +26,7 @@ BART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"bart-large": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/config.json",
"bart-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/config.json",
"bart-large-cnn": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/config.json",
"bart-large-xsum": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/config.json",
}

View File

@@ -78,9 +78,6 @@ class PretrainedConfig(object):
self.top_k = kwargs.pop("top_k", 50)
self.top_p = kwargs.pop("top_p", 1.0)
self.repetition_penalty = kwargs.pop("repetition_penalty", 1.0)
self.bos_token_id = kwargs.pop("bos_token_id", None)
self.pad_token_id = kwargs.pop("pad_token_id", None)
self.eos_token_id = kwargs.pop("eos_token_id", None)
self.length_penalty = kwargs.pop("length_penalty", 1.0)
self.no_repeat_ngram_size = kwargs.pop("no_repeat_ngram_size", 0)
self.num_return_sequences = kwargs.pop("num_return_sequences", 1)
@@ -94,6 +91,16 @@ class PretrainedConfig(object):
self.label2id = kwargs.pop("label2id", dict(zip(self.id2label.values(), self.id2label.keys())))
self.label2id = dict((key, int(value)) for key, value in self.label2id.items())
# Tokenizer arguments TODO: eventually tokenizer and models should share the same config
self.prefix = kwargs.pop("prefix", None)
self.bos_token_id = kwargs.pop("bos_token_id", None)
self.pad_token_id = kwargs.pop("pad_token_id", None)
self.eos_token_id = kwargs.pop("eos_token_id", None)
self.decoder_start_token_id = kwargs.pop("decoder_start_token_id", None)
# task specific arguments
self.task_specific_params = kwargs.pop("task_specific_params", None)
# Additional attributes without default values
for key, value in kwargs.items():
try:
@@ -373,3 +380,14 @@ class PretrainedConfig(object):
"""
with open(json_file_path, "w", encoding="utf-8") as writer:
writer.write(self.to_json_string())
def update(self, config_dict: Dict):
"""
Updates attributes of this class
with attributes from `config_dict`.
Args:
:obj:`Dict[str, any]`: Dictionary of attributes that shall be updated for this class.
"""
for key, value in config_dict.items():
setattr(self, key, value)

View File

@@ -17,6 +17,7 @@
import argparse
import logging
import os
from pathlib import Path
import fairseq
@@ -30,10 +31,11 @@ from transformers import (
BartModel,
BartTokenizer,
)
from transformers.modeling_bart import _make_linear_from_emb
FAIRSEQ_MODELS = ["bart.large", "bart.large.mnli", "bart.large.cnn"]
FAIRSEQ_MODELS = ["bart.large", "bart.large.mnli", "bart.large.cnn", "bart_xsum/model.pt"]
extra_arch = {"bart.large": BartModel, "bart.large.mnli": BartForSequenceClassification}
if version.parse(fairseq.__version__) < version.parse("0.9.0"):
raise Exception("requires fairseq >= 0.9.0")
@@ -57,62 +59,79 @@ def rename_key(dct, old, new):
dct[new] = val
def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path):
def load_xsum_checkpoint(checkpoint_path):
"""Checkpoint path should end in model.pt"""
sd = torch.load(checkpoint_path, map_location="cpu")
hub_interface = torch.hub.load("pytorch/fairseq", "bart.large.cnn").eval()
hub_interface.model.load_state_dict(sd["model"])
return hub_interface
@torch.no_grad()
def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkpoint_name=None):
"""
Copy/paste/tweak model's weights to our BERT structure.
"""
bart = torch.hub.load("pytorch/fairseq", checkpoint_path)
bart.eval() # disable dropout
if not os.path.exists(checkpoint_path):
bart = torch.hub.load("pytorch/fairseq", checkpoint_path).eval()
else:
bart = load_xsum_checkpoint(checkpoint_path)
bart.model.upgrade_state_dict(bart.model.state_dict())
hf_model_name = checkpoint_path.replace(".", "-")
config = BartConfig.from_pretrained(hf_model_name)
if hf_checkpoint_name is None:
hf_checkpoint_name = checkpoint_path.replace(".", "-")
config = BartConfig.from_pretrained(hf_checkpoint_name)
tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0)
tokens2 = BartTokenizer.from_pretrained(hf_model_name).encode(SAMPLE_TEXT, return_tensors="pt").unsqueeze(0)
tokens2 = BartTokenizer.from_pretrained(hf_checkpoint_name).encode(SAMPLE_TEXT, return_tensors="pt").unsqueeze(0)
assert torch.eq(tokens, tokens2).all()
if checkpoint_path in ["bart.large", "bart.large.cnn"]:
state_dict = bart.model.state_dict()
for k in IGNORE_KEYS:
state_dict.pop(k, None)
state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
model = BartModel(config)
their_output = bart.extract_features(tokens)
else: # MNLI Case
if checkpoint_path == "bart.large.mnli":
state_dict = bart.state_dict()
for k in IGNORE_KEYS:
state_dict.pop(k, None)
remove_ignore_keys_(state_dict)
state_dict["model.shared.weight"] = state_dict["model.decoder.embed_tokens.weight"]
for src, dest in rename_keys:
rename_key(state_dict, src, dest)
model = BartForSequenceClassification(config)
their_output = bart.predict("mnli", tokens, return_logits=True)
model = BartForSequenceClassification(config).eval()
model.load_state_dict(state_dict)
fairseq_output = bart.predict("mnli", tokens, return_logits=True)
new_model_outputs = model(tokens)[0] # logits
else: # no classification heads to worry about
state_dict = bart.model.state_dict()
remove_ignore_keys_(state_dict)
state_dict["shared.weight"] = state_dict["decoder.embed_tokens.weight"]
fairseq_output = bart.extract_features(tokens)
if hf_checkpoint_name == "bart-large":
model = BartModel(config).eval()
model.load_state_dict(state_dict)
new_model_outputs = model(tokens).model[0]
else:
model = BartForConditionalGeneration(config).eval() # an existing summarization ckpt
model.model.load_state_dict(state_dict)
if hasattr(model, "lm_head"):
model.lm_head = _make_linear_from_emb(model.model.shared)
new_model_outputs = model.model(tokens)[0]
# Load state dict
model.load_state_dict(state_dict)
model.eval()
# Check results
if checkpoint_path == "bart.large.cnn":
model = BartForConditionalGeneration(config, base_model=model)
assert "lm_head.weight" in model.state_dict()
assert model.lm_head.out_features == config.max_position_embeddings
model.eval()
our_outputs = model.model(tokens)[0]
else:
our_outputs = model(tokens)[0]
assert their_output.shape == our_outputs.shape
assert (their_output == our_outputs).all().item()
assert fairseq_output.shape == new_model_outputs.shape
assert (fairseq_output == new_model_outputs).all().item()
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
model.save_pretrained(pytorch_dump_folder_path)
def remove_ignore_keys_(state_dict):
for k in IGNORE_KEYS:
state_dict.pop(k, None)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument("fairseq_path", choices=FAIRSEQ_MODELS, type=str, help="")
parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
args = parser.parse_args()
convert_bart_checkpoint(
args.fairseq_path, args.pytorch_dump_folder_path,
parser.add_argument(
"fairseq_path", type=str, help="bart.large, bart.large.cnn or a path to a model.pt on local filesystem."
)
parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
parser.add_argument(
"--hf_config", default=None, type=str, help="Which huggingface architecture to use: bart-large-xsum"
)
args = parser.parse_args()
convert_bart_checkpoint(args.fairseq_path, args.pytorch_dump_folder_path, hf_checkpoint_name=args.hf_config)

View File

@@ -139,6 +139,7 @@ def squad_convert_example_to_features(example, max_seq_length, doc_stride, max_q
pad_to_max_length=True,
stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,
truncation_strategy="only_second" if tokenizer.padding_side == "right" else "only_first",
return_token_type_ids=True,
)
paragraph_len = min(

View File

@@ -16,8 +16,11 @@
import copy
import csv
import dataclasses
import json
import logging
from dataclasses import dataclass
from typing import Optional
from ...file_utils import is_tf_available, is_torch_available
@@ -25,7 +28,8 @@ from ...file_utils import is_tf_available, is_torch_available
logger = logging.getLogger(__name__)
class InputExample(object):
@dataclass(frozen=True)
class InputExample:
"""
A single training/test example for simple sequence classification.
@@ -39,23 +43,14 @@ class InputExample(object):
specified for train and dev examples, but not for test examples.
"""
def __init__(self, guid, text_a, text_b=None, label=None):
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
def __repr__(self):
return str(self.to_json_string())
def to_dict(self):
"""Serializes this instance to a Python dictionary."""
output = copy.deepcopy(self.__dict__)
return output
guid: str
text_a: str
text_b: Optional[str] = None
label: Optional[str] = None
def to_json_string(self):
"""Serializes this instance to a JSON string."""
return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"
return json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
class InputFeatures(object):

View File

@@ -99,6 +99,7 @@ from .modeling_xlm import (
XLM_PRETRAINED_MODEL_ARCHIVE_MAP,
XLMForQuestionAnsweringSimple,
XLMForSequenceClassification,
XLMForTokenClassification,
XLMModel,
XLMWithLMHeadModel,
)
@@ -235,6 +236,7 @@ MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING = OrderedDict(
[
(DistilBertConfig, DistilBertForTokenClassification),
(CamembertConfig, CamembertForTokenClassification),
(XLMConfig, XLMForTokenClassification),
(XLMRobertaConfig, XLMRobertaForTokenClassification),
(RobertaConfig, RobertaForTokenClassification),
(BertConfig, BertForTokenClassification),
@@ -418,12 +420,12 @@ class AutoModelForPreTraining(object):
config (:class:`~transformers.PretrainedConfig`):
The model class to instantiate is selected based on the configuration class:
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForMaskedLM` (DistilBERT model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForMaskedLM` (RoBERTa model)
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
- isInstance of `bert` configuration class: :class:`~transformers.BertForPreTraining` (Bert model)
- isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
- isInstance of `gpt2` configuration class: :class:`~transformers.GPT2ModelLMHeadModel` (OpenAI GPT-2 model)
- isInstance of `ctrl` configuration class: :class:`~transformers.CTRLModelLMHeadModel` (Salesforce CTRL model)
- isInstance of `gpt2` configuration class: :class:`~transformers.GPT2LMHeadModel` (OpenAI GPT-2 model)
- isInstance of `ctrl` configuration class: :class:`~transformers.CTRLLMHeadModel` (Salesforce CTRL model)
- isInstance of `transfo-xl` configuration class: :class:`~transformers.TransfoXLLMHeadModel` (Transformer-XL model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetLMHeadModel` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
@@ -559,12 +561,12 @@ class AutoModelWithLMHead(object):
config (:class:`~transformers.PretrainedConfig`):
The model class to instantiate is selected based on the configuration class:
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForMaskedLM` (DistilBERT model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForMaskedLM` (RoBERTa model)
- isInstance of `bert` configuration class: :class:`~transformers.BertModelForMaskedLM` (Bert model)
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaForMaskedLM` (RoBERTa model)
- isInstance of `bert` configuration class: :class:`~transformers.BertForMaskedLM` (Bert model)
- isInstance of `openai-gpt` configuration class: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model)
- isInstance of `gpt2` configuration class: :class:`~transformers.GPT2ModelLMHeadModel` (OpenAI GPT-2 model)
- isInstance of `ctrl` configuration class: :class:`~transformers.CTRLModelLMHeadModel` (Salesforce CTRL model)
- isInstance of `gpt2` configuration class: :class:`~transformers.GPT2LMHeadModel` (OpenAI GPT-2 model)
- isInstance of `ctrl` configuration class: :class:`~transformers.CTRLLMHeadModel` (Salesforce CTRL model)
- isInstance of `transfo-xl` configuration class: :class:`~transformers.TransfoXLLMHeadModel` (Transformer-XL model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetLMHeadModel` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMWithLMHeadModel` (XLM model)
@@ -701,14 +703,14 @@ class AutoModelForSequenceClassification(object):
config (:class:`~transformers.PretrainedConfig`):
The model class to instantiate is selected based on the configuration class:
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForSequenceClassification` (DistilBERT model)
- isInstance of `albert` configuration class: :class:`~transformers.AlbertModelForSequenceClassification` (ALBERT model)
- isInstance of `camembert` configuration class: :class:`~transformers.CamembertModelForSequenceClassification` (CamemBERT model)
- isInstance of `xlm roberta` configuration class: :class:`~transformers.XLMRobertaModelForSequenceClassification` (XLM-RoBERTa model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForSequenceClassification` (RoBERTa model)
- isInstance of `bert` configuration class: :class:`~transformers.BertModelForSequenceClassification` (Bert model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModelForSequenceClassification` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMModelForSequenceClassification` (XLM model)
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForSequenceClassification` (DistilBERT model)
- isInstance of `albert` configuration class: :class:`~transformers.AlbertForSequenceClassification` (ALBERT model)
- isInstance of `camembert` configuration class: :class:`~transformers.CamembertForSequenceClassification` (CamemBERT model)
- isInstance of `xlm roberta` configuration class: :class:`~transformers.XLMRobertaForSequenceClassification` (XLM-RoBERTa model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaForSequenceClassification` (RoBERTa model)
- isInstance of `bert` configuration class: :class:`~transformers.BertForSequenceClassification` (Bert model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetForSequenceClassification` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMForSequenceClassification` (XLM model)
- isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertForSequenceClassification` (Flaubert model)
@@ -848,11 +850,11 @@ class AutoModelForQuestionAnswering(object):
config (:class:`~transformers.PretrainedConfig`):
The model class to instantiate is selected based on the configuration class:
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForQuestionAnswering` (DistilBERT model)
- isInstance of `albert` configuration class: :class:`~transformers.AlbertModelForQuestionAnswering` (ALBERT model)
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertForQuestionAnswering` (DistilBERT model)
- isInstance of `albert` configuration class: :class:`~transformers.AlbertForQuestionAnswering` (ALBERT model)
- isInstance of `bert` configuration class: :class:`~transformers.BertModelForQuestionAnswering` (Bert model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModelForQuestionAnswering` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMModelForQuestionAnswering` (XLM model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetForQuestionAnswering` (XLNet model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMForQuestionAnswering` (XLM model)
- isInstance of `flaubert` configuration class: :class:`~transformers.FlaubertForQuestionAnswering` (XLM model)
Examples::
@@ -989,8 +991,10 @@ class AutoModelForTokenClassification:
The model class to instantiate is selected based on the configuration class:
- isInstance of `distilbert` configuration class: :class:`~transformers.DistilBertModelForTokenClassification` (DistilBERT model)
- isInstance of `xlm` configuration class: :class:`~transformers.XLMForTokenClassification` (XLM model)
- isInstance of `xlm roberta` configuration class: :class:`~transformers.XLMRobertaModelForTokenClassification` (XLMRoberta model)
- isInstance of `bert` configuration class: :class:`~transformers.BertModelForTokenClassification` (Bert model)
- isInstance of `albert` configuration class: :class:`~transformers.AlbertForTokenClassification` (AlBert model)
- isInstance of `xlnet` configuration class: :class:`~transformers.XLNetModelForTokenClassification` (XLNet model)
- isInstance of `camembert` configuration class: :class:`~transformers.CamembertModelForTokenClassification` (Camembert model)
- isInstance of `roberta` configuration class: :class:`~transformers.RobertaModelForTokenClassification` (Roberta model)
@@ -1025,6 +1029,7 @@ class AutoModelForTokenClassification:
The model class to instantiate is selected as the first pattern matching
in the `pretrained_model_name_or_path` string (in the following order):
- contains `distilbert`: :class:`~transformers.DistilBertForTokenClassification` (DistilBERT model)
- contains `xlm`: :class:`~transformers.XLMForTokenClassification` (XLM model)
- contains `xlm-roberta`: :class:`~transformers.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model)
- contains `camembert`: :class:`~transformers.CamembertForTokenClassification` (Camembert model)
- contains `bert`: :class:`~transformers.BertForTokenClassification` (Bert model)

View File

@@ -34,6 +34,7 @@ BART_PRETRAINED_MODEL_ARCHIVE_MAP = {
"bart-large": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large/pytorch_model.bin",
"bart-large-mnli": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-mnli/pytorch_model.bin",
"bart-large-cnn": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin",
"bart-large-xsum": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-xsum/pytorch_model.bin",
}
BART_START_DOCSTRING = r"""
@@ -72,47 +73,50 @@ BART_INPUTS_DOCSTRING = r"""
Mask to avoid performing attention on padding token indices in input_ids.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):
Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.
Used in the cross-attention of the decoder.
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
Provide for translation and summarization training. By default, the model will create this tensor by shifting the input_ids right, following the paper.
decoder_attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, 1, tgt_seq_len, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
Default behavior: generate a tensor that ignores pad tokens and future tokens, as in the paper.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
If you want to change padding behavior, you should read :func:`~transformers.modeling_bart._prepare_decoder_inputs` and modify.
See diagram 1 in the paper for more info on the default strategy
"""
LARGE_NEGATIVE = -1e8
def invert_mask(attention_mask):
assert attention_mask.dim() == 2
return attention_mask.eq(0)
def _prepare_bart_decoder_inputs(
config, input_ids, decoder_input_ids=None, decoder_attn_mask=None, mask_dtype=None,
config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32
):
"""Prepare masks that ignore padding tokens in the decoder and a causal lm mask for the decoder if
"""Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if
none are provided. This mimics the default behavior in fairseq. To override it pass in masks.
Note: this is not called during generation
"""
pad_token_id = config.pad_token_id
need_causal_mask = not config.output_past
if decoder_input_ids is None:
decoder_input_ids = shift_tokens_right(input_ids, pad_token_id)
bsz, tgt_len = decoder_input_ids.size()[:2]
if decoder_attn_mask is None:
bsz, tgt_len = decoder_input_ids.size()
if decoder_padding_mask is None:
decoder_padding_mask = make_padding_mask(decoder_input_ids, pad_token_id)
if need_causal_mask:
causal_lm_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1)
else:
causal_lm_mask = None
new_shape = (bsz, tgt_len, tgt_len)
# make it broadcastable so can just be added to the attention coefficients
decoder_attn_mask = _combine_masks(decoder_padding_mask, causal_lm_mask, new_shape).to(device=input_ids.device)
if mask_dtype is not None:
decoder_attn_mask = decoder_attn_mask.to(mask_dtype)
assert decoder_attn_mask is None or decoder_attn_mask.shape == (bsz, 1, tgt_len, tgt_len)
return decoder_input_ids, decoder_attn_mask
else:
decoder_padding_mask = invert_mask(decoder_padding_mask)
causal_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1).to(
dtype=causal_mask_dtype, device=decoder_input_ids.device
)
return decoder_input_ids, decoder_padding_mask, causal_mask
class PretrainedBartModel(PreTrainedModel):
config_class = BartConfig
base_model_prefix = "model"
pretrained_model_archive_map = BART_PRETRAINED_MODEL_ARCHIVE_MAP
encoder_outputs_batch_dim_idx = 1 # outputs shaped (seq_len, bs, ...)
def _init_weights(self, module):
std = self.config.init_std
@@ -128,13 +132,10 @@ class PretrainedBartModel(PreTrainedModel):
@property
def dummy_inputs(self):
pad_token = self.config.pad_token_id
input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]])
decoder_input_ids, decoder_attn_mask = _prepare_bart_decoder_inputs(self.config, input_ids,)
input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]], device=self.device)
dummy_inputs = {
"decoder_input_ids": decoder_input_ids,
"attention_mask": input_ids.ne(pad_token),
"input_ids": input_ids,
"decoder_attention_mask": decoder_attn_mask,
}
return dummy_inputs
@@ -152,21 +153,6 @@ def _check_shapes(shape_1, shape2):
raise AssertionError("shape mismatch: {} != {}".format(shape_1, shape2))
def _combine_masks(key_padding_mask, causal_lm_mask, targ_size):
"""Make one mask of shape (bsz, 1, tgt_len, src_len) """
a = torch.zeros(targ_size) # targ_size is(bsz, tgt_len, src_len)
b = torch.zeros(targ_size)
if key_padding_mask is not None: # (bsz, tgt_len) -> targ_size
_check_shapes(key_padding_mask.shape, targ_size[:2])
reshaped = key_padding_mask.unsqueeze(2).expand(*targ_size)
a[reshaped] = LARGE_NEGATIVE
if causal_lm_mask is not None: # (tgt_len, src_len) -> targ_size
_check_shapes(causal_lm_mask.shape, targ_size[-2:])
b = causal_lm_mask.unsqueeze(0).expand(*targ_size)
return (a + b).unsqueeze(1).clamp(LARGE_NEGATIVE,)
def shift_tokens_right(input_ids, pad_token_id):
"""Shift input ids one token to the right, and wrap the last non pad token (usually <eos>)."""
prev_output_tokens = input_ids.clone()
@@ -216,7 +202,9 @@ class EncoderLayer(nn.Module):
encoded output of shape `(seq_len, batch, embed_dim)`
"""
residual = x
x, attn_weights = self.self_attn(query=x, key=x, key_padding_mask=encoder_padding_mask,)
x, attn_weights = self.self_attn(
query=x, key=x, key_padding_mask=encoder_padding_mask, need_weights=self.output_attentions
)
x = F.dropout(x, p=self.dropout, training=self.training)
x = residual + x
x = self.self_attn_layer_norm(x)
@@ -278,8 +266,7 @@ class BartEncoder(nn.Module):
"""
# check attention mask and invert
if attention_mask is not None:
assert attention_mask.dim() == 2
attention_mask = attention_mask.eq(0)
attention_mask = invert_mask(attention_mask)
inputs_embeds = self.embed_tokens(input_ids)
embed_pos = self.embed_positions(input_ids)
@@ -315,6 +302,7 @@ class DecoderLayer(nn.Module):
def __init__(self, config: BartConfig):
super().__init__()
self.embed_dim = config.d_model
self.output_attentions = config.output_attentions
self.self_attn = SelfAttention(
embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout,
)
@@ -335,21 +323,34 @@ class DecoderLayer(nn.Module):
self.final_layer_norm = LayerNorm(self.embed_dim)
def forward(
self, x, encoder_hidden_states, encoder_attn_mask=None, layer_state=None, attention_mask=None,
self,
x,
encoder_hidden_states,
encoder_attn_mask=None,
layer_state=None,
causal_mask=None,
decoder_padding_mask=None,
):
residual = x
if layer_state is None:
layer_state = {}
# next line mutates layer state
x, self_attn_weights = self.self_attn(query=x, key=x, layer_state=layer_state, attn_mask=attention_mask,)
x, self_attn_weights = self.self_attn(
query=x,
key=x,
layer_state=layer_state,
key_padding_mask=decoder_padding_mask,
attn_mask=causal_mask,
need_weights=self.output_attentions,
)
x = F.dropout(x, p=self.dropout, training=self.training)
x = residual + x
x = self.self_attn_layer_norm(x)
residual = x
assert self.encoder_attn.cache_key != self.self_attn.cache_key
x, encoder_attn_weights = self.encoder_attn(
x, _ = self.encoder_attn(
query=x,
key=encoder_hidden_states,
key_padding_mask=encoder_attn_mask,
@@ -406,7 +407,8 @@ class BartDecoder(nn.Module):
input_ids,
encoder_hidden_states,
encoder_padding_mask,
combined_mask,
decoder_padding_mask,
decoder_causal_mask,
decoder_cached_states=None,
generation_mode=False,
**unused
@@ -431,8 +433,7 @@ class BartDecoder(nn.Module):
"""
# check attention mask and invert
if encoder_padding_mask is not None:
assert encoder_padding_mask.dim() == 2
encoder_padding_mask = encoder_padding_mask.eq(0)
encoder_padding_mask = invert_mask(encoder_padding_mask)
# embed positions
positions = self.embed_positions(input_ids, generation_mode=generation_mode)
@@ -452,7 +453,6 @@ class BartDecoder(nn.Module):
all_hidden_states = ()
all_self_attns = ()
next_decoder_cache = []
for i, decoder_layer in enumerate(self.layers):
decoder_layer # type: DecoderLayer
# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
@@ -462,7 +462,12 @@ class BartDecoder(nn.Module):
layer_state = decoder_cached_states[i] if decoder_cached_states is not None else None
x, layer_self_attn, layer_past = decoder_layer(
x, encoder_hidden_states, encoder_padding_mask, layer_state=layer_state, attention_mask=combined_mask,
x,
encoder_hidden_states,
encoder_attn_mask=encoder_padding_mask,
decoder_padding_mask=decoder_padding_mask,
layer_state=layer_state,
causal_mask=decoder_causal_mask,
)
if self.output_past:
@@ -526,6 +531,7 @@ class SelfAttention(nn.Module):
key_padding_mask: Optional[Tensor] = None,
layer_state: Optional[Dict[str, Optional[Tensor]]] = None,
attn_mask: Optional[Tensor] = None,
need_weights=False,
) -> Tuple[Tensor, Optional[Tensor]]:
"""Input shape: Time(SeqLen) x Batch x Channel"""
static_kv = self.encoder_decoder_attention # type: bool
@@ -597,7 +603,10 @@ class SelfAttention(nn.Module):
assert attn_output.size() == (bsz * self.num_heads, tgt_len, self.head_dim)
attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
attn_output = self.out_proj(attn_output)
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
if need_weights:
attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
else:
attn_weights = None
return attn_output, attn_weights
def _use_saved_state(self, k, v, saved_state, key_padding_mask, static_kv, bsz):
@@ -726,6 +735,8 @@ def _filter_out_falsey_values(tup) -> Tuple:
# Public API
def _get_shape(t):
return getattr(t, "shape", None)
@add_start_docstrings(
@@ -759,13 +770,16 @@ class BartModel(PretrainedBartModel):
# make masks if user doesn't supply
if not generation_mode:
decoder_input_ids, decoder_attention_mask = _prepare_bart_decoder_inputs(
decoder_input_ids, decoder_padding_mask, causal_mask = _prepare_bart_decoder_inputs(
self.config,
input_ids,
decoder_input_ids=decoder_input_ids,
decoder_attn_mask=decoder_attention_mask,
mask_dtype=self.shared.weight.dtype,
decoder_padding_mask=decoder_attention_mask,
causal_mask_dtype=self.shared.weight.dtype,
)
else:
decoder_padding_mask, causal_mask = None, None
assert decoder_input_ids is not None
if encoder_outputs is None:
encoder_outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
@@ -775,7 +789,8 @@ class BartModel(PretrainedBartModel):
decoder_input_ids,
encoder_outputs[0],
attention_mask,
decoder_attention_mask,
decoder_padding_mask,
decoder_causal_mask=causal_mask,
decoder_cached_states=decoder_cached_states,
generation_mode=generation_mode,
)
@@ -804,13 +819,8 @@ class BartForConditionalGeneration(PretrainedBartModel):
def __init__(self, config: BartConfig):
super().__init__(config)
# if base_model is None:
base_model = BartModel(config)
self.model = base_model
self.lm_head = _make_linear_from_emb(self.model.shared)
def tie_weights(self):
pass # hack to prevent changing lm_head.out_features. The input and output embeddings are still the same.
@add_start_docstrings_to_callable(BART_INPUTS_DOCSTRING)
def forward(
@@ -875,7 +885,7 @@ class BartForConditionalGeneration(PretrainedBartModel):
decoder_cached_states=decoder_cached_states,
generation_mode=generation_mode,
)
lm_logits = self.lm_head(outputs[0])
lm_logits = F.linear(outputs[0], self.model.shared.weight)
outputs = (lm_logits,) + outputs[1:] # Add hidden states and attention if they are here
if lm_labels is not None:
loss_fct = nn.CrossEntropyLoss()
@@ -893,7 +903,6 @@ class BartForConditionalGeneration(PretrainedBartModel):
encoder_outputs, decoder_cached_states = past, None
else:
encoder_outputs, decoder_cached_states = past
return {
"input_ids": None, # encoder_outputs is defined. input_ids not needed
"encoder_outputs": encoder_outputs,
@@ -932,7 +941,7 @@ class BartForConditionalGeneration(PretrainedBartModel):
return self.model.encoder
def get_output_embeddings(self):
return self.lm_head
return _make_linear_from_emb(self.model.shared) # make it on the fly
@add_start_docstrings(
@@ -968,7 +977,7 @@ class BartForSequenceClassification(PretrainedBartModel):
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.BartConfig`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label` is provided):
Classification loss (cross entropy)
Classification loss (cross entropy)
logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):

View File

@@ -27,7 +27,7 @@ from torch import nn
from torch.nn import CrossEntropyLoss
from .configuration_t5 import T5Config
from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings
from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable
from .modeling_utils import PreTrainedModel, prune_linear_layer
@@ -457,6 +457,7 @@ class T5PreTrainedModel(PreTrainedModel):
pretrained_model_archive_map = T5_PRETRAINED_MODEL_ARCHIVE_MAP
load_tf_weights = load_tf_weights_in_t5
base_model_prefix = "transformer"
encoder_outputs_batch_dim_idx = 0 # outputs shaped (bs, ...)
@property
def dummy_inputs(self):
@@ -501,6 +502,27 @@ class T5PreTrainedModel(PreTrainedModel):
if module.has_relative_attention_bias:
module.relative_attention_bias.weight.data.normal_(mean=0.0, std=factor * ((d_model) ** -0.5))
def _shift_right(self, input_ids):
decoder_start_token_id = self.config.decoder_start_token_id
pad_token_id = self.config.pad_token_id
assert (
decoder_start_token_id is not None
), "self.model.config.decoder_start_token_id has to be defined. In T5 it is usually set to the pad_token_id. See T5 docs for more information"
# shift inputs to the right
shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
shifted_input_ids[..., 0] = decoder_start_token_id
assert pad_token_id is not None, "self.model.config.pad_token_id has to be defined."
# replace possible -100 values in lm_labels by `pad_token_id`
shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
assert torch.all(shifted_input_ids >= 0).item(), "Verify that `lm_labels` has only positive values and -100"
return shifted_input_ids
class T5Stack(T5PreTrainedModel):
def __init__(self, config, embed_tokens=None):
@@ -695,30 +717,38 @@ T5_START_DOCSTRING = r""" The T5 model was proposed in
"""
T5_INPUTS_DOCSTRING = r"""
Inputs:
**input_ids**: ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Args:
input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
To match pre-training, T5 input sequence should be formatted with [CLS] and [SEP] tokens as follows:
(a) For sequence pairs:
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
(b) For single sequences:
``tokens: [CLS] the dog is hairy . [SEP]``
T5 is a model with relative position embeddings so you should be able to pad the inputs on
the right or the left.
Indices can be obtained using :class:`transformers.T5Tokenizer`.
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
**attention_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(batch_size, sequence_length)``:
To know more on how to prepare :obj:`input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`_ .
attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
**head_mask**: (`optional`) ``torch.FloatTensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`, defaults to :obj:`None`):
Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.
Used in the cross-attention of the decoder.
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`_ .
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
decoder_inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
head_mask: (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -728,31 +758,8 @@ T5_INPUTS_DOCSTRING = r"""
@add_start_docstrings(
"The bare T5 Model transformer outputting raw hidden-states" "without any specific head on top.",
T5_START_DOCSTRING,
T5_INPUTS_DOCSTRING,
)
class T5Model(T5PreTrainedModel):
r"""
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
Sequence of hidden-states at the output of the last layer of the model.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5Model.from_pretrained('t5-small')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids=input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
def __init__(self, config):
super().__init__(config)
self.shared = nn.Embedding(config.vocab_size, config.d_model)
@@ -782,6 +789,7 @@ class T5Model(T5PreTrainedModel):
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
@@ -793,6 +801,34 @@ class T5Model(T5PreTrainedModel):
decoder_inputs_embeds=None,
head_mask=None,
):
r"""
Return:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import T5Tokenizer, T5Model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="pt") # Batch size 1
outputs = model(input_ids=input_ids, decoder_input_ids=input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
# Encode if needed (training, first prediction pass)
if encoder_outputs is None:
@@ -815,38 +851,8 @@ class T5Model(T5PreTrainedModel):
return decoder_outputs + encoder_outputs
@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING)
class T5ForConditionalGeneration(T5PreTrainedModel):
r"""
**lm_labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``:
Labels for computing the masked language modeling loss.
Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).
Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels
in ``[0, ..., config.vocab_size]``.
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Masked language modeling loss.
**prediction_scores**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids=input_ids, lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
"""
def __init__(self, config):
super().__init__(config)
self.model_dim = config.d_model
@@ -878,6 +884,7 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
def get_encoder(self):
return self.encoder
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
@@ -890,6 +897,45 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
decoder_inputs_embeds=None,
head_mask=None,
):
r"""
lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`):
Labels for computing the sequence classification/regression loss.
Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`.
If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
All labels set to ``-100`` are ignored (masked), the loss is only
computed for labels in ``[0, ..., config.vocab_size]``
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):
Classification loss (cross entropy).
prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.
Examples::
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="pt") # Batch size 1
outputs = model(input_ids=input_ids, decoder_input_ids=input_ids, lm_labels=input_ids)
loss, prediction_scores = outputs[:2]
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="pt") # Batch size 1
outputs = model.generate(input_ids)
"""
# Encode if needed (training, first prediction pass)
if encoder_outputs is None:
@@ -900,6 +946,10 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
hidden_states = encoder_outputs[0]
if lm_labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
# get decoder inputs from shifting lm labels to the right
decoder_input_ids = self._shift_right(lm_labels)
# Decode
decoder_outputs = self.decoder(
input_ids=decoder_input_ids,
@@ -918,10 +968,8 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
decoder_outputs = (lm_logits,) + decoder_outputs[1:] # Add hidden states and attention if they are here
if lm_labels is not None:
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = lm_labels[..., 1:].contiguous()
loss_fct = CrossEntropyLoss(ignore_index=-100)
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), lm_labels.view(-1))
decoder_outputs = (
loss,
) + decoder_outputs # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666

View File

@@ -24,7 +24,7 @@ import math
import tensorflow as tf
from .configuration_t5 import T5Config
from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings
from .file_utils import DUMMY_INPUTS, DUMMY_MASK, add_start_docstrings, add_start_docstrings_to_callable
from .modeling_tf_utils import TFPreTrainedModel, TFSharedEmbeddings, shape_list
@@ -630,31 +630,41 @@ T5_START_DOCSTRING = r""" The T5 model was proposed in
"""
T5_INPUTS_DOCSTRING = r"""
Inputs:
**input_ids**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
Args:
decoder_input_ids are usually used as a `dict` (see T5 description above for more information) containing all the following.
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`):
Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation.
input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
To match pre-training, T5 input sequence should be formatted with [CLS] and [SEP] tokens as follows:
(a) For sequence pairs:
``tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]``
(b) For single sequences:
``tokens: [CLS] the dog is hairy . [SEP]``
T5 is a model with relative position embeddings so you should be able to pad the inputs on
the right or the left.
Indices can be obtained using :class:`transformers.T5Tokenizer`.
To know more on how to prepare :obj:`input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`_ .
See :func:`transformers.PreTrainedTokenizer.encode` and
:func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details.
**attention_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length)``:
attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Mask to avoid performing attention on padding token indices.
Mask values selected in ``[0, 1]``:
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
**head_mask**: (`optional`) ``Numpy array`` or ``tf.Tensor`` of shape ``(num_heads,)`` or ``(num_layers, num_heads)``:
encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`, defaults to :obj:`None`):
Tuple consists of (`last_hidden_state`, `optional`: `hidden_states`, `optional`: `attentions`)
`last_hidden_state` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`) is a sequence of hidden-states at the output of the last layer of the encoder.
Used in the cross-attention of the decoder.
decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`):
Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.
inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
decoder_inputs_embeds (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length, hidden_size)`, `optional`, defaults to :obj:`None`):
Optionally, instead of passing :obj:`decoder_input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at
`T5 Training <./t5.html#training>`_ .
head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`):
Mask to nullify selected heads of the self-attention modules.
Mask values selected in ``[0, 1]``:
``1`` indicates the head is **not masked**, ``0`` indicates the head is **masked**.
@@ -664,34 +674,8 @@ T5_INPUTS_DOCSTRING = r"""
@add_start_docstrings(
"The bare T5 Model transformer outputting raw hidden-states" "without any specific head on top.",
T5_START_DOCSTRING,
T5_INPUTS_DOCSTRING,
)
class TFT5Model(TFT5PreTrainedModel):
r"""
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**last_hidden_state**: ``tf.Tensor`` of shape ``(batch_size, sequence_length, hidden_size)``
Sequence of hidden-states at the output of the last layer of the model.
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import T5Tokenizer, TFT5Model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5Model.from_pretrained('t5-small')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids=input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
def __init__(self, config, *inputs, **kwargs):
super().__init__(config, *inputs, **kwargs)
self.shared = TFSharedEmbeddings(config.vocab_size, config.d_model, name="shared")
@@ -715,7 +699,36 @@ class TFT5Model(TFT5PreTrainedModel):
def get_output_embeddings(self):
return self.shared
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def call(self, decoder_input_ids, **kwargs):
r"""
Return:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import T5Tokenizer, TFT5Model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5Model.from_pretrained('t5-small')
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
outputs = model(input_ids, input_ids=input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
"""
if isinstance(decoder_input_ids, dict):
kwargs.update(decoder_input_ids)
@@ -753,33 +766,8 @@ class TFT5Model(TFT5PreTrainedModel):
return decoder_outputs + encoder_outputs
@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING, T5_INPUTS_DOCSTRING)
@add_start_docstrings("""T5 Model with a `language modeling` head on top. """, T5_START_DOCSTRING)
class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
r"""
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**prediction_scores**: ``Numpy array`` or ``tf.Tensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``Numpy array`` or ``tf.Tensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``Numpy array`` or ``tf.Tensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples::
import tensorflow as tf
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids=input_ids)
prediction_scores = outputs[0]
"""
def __init__(self, config, *inputs, **kwargs):
super().__init__(config, *inputs, **kwargs)
self.model_dim = config.d_model
@@ -808,7 +796,42 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel):
def get_encoder(self):
return self.encoder
@add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING)
def call(self, decoder_input_ids, **kwargs):
r"""
Return:
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs.
loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided):
Classification loss (cross entropy).
prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`tf.Tensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention.
Examples::
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="tf") # Batch size 1
outputs = model(input_ids, input_ids=input_ids)
prediction_scores = outputs[0]
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = TFT5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode("summarize: Hello, my dog is cute", return_tensors="tf") # Batch size 1
model.generate(input_ids)
"""
if isinstance(decoder_input_ids, dict):
kwargs.update(decoder_input_ids)

View File

@@ -231,7 +231,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
def save_pretrained(self, save_directory):
""" Save a model and its configuration file to a directory, so that it
can be re-loaded using the `:func:`~transformers.PreTrainedModel.from_pretrained`` class method.
can be re-loaded using the :func:`~transformers.PreTrainedModel.from_pretrained` class method.
"""
assert os.path.isdir(
save_directory
@@ -523,8 +523,8 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
pad_token_id: (`optional`) int
Pad token. Defaults to pad_token_id as defined in the models config.
eos_token_ids: (`optional`) int or list of int
End of sequence token or list of tokens to stop the generation. Default to 0.
eos_token_id: (`optional`) int
EOS token. Defaults to eos_token_id as defined in the models config.
length_penalty: (`optional`) float
Exponential penalty to the length. Default to 1.
@@ -541,7 +541,7 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
``1`` for tokens that are NOT MASKED, ``0`` for MASKED tokens.
Defaults to `None`.
`What are attention masks? <../glossary.html#attention-mask>`__
`What are attention masks? <../glossary.html#attention-mask>`__
decoder_start_token_id=None: (`optional`) int
If an encoder-decoder model starts decoding with a different token than BOS.
@@ -610,7 +610,9 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
num_return_sequences = (
num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
)
decoder_start_token_id = decoder_start_token_id if decoder_start_token_id is not None else bos_token_id
decoder_start_token_id = (
decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
)
if input_ids is not None:
batch_size = shape_list(input_ids)[0] # overriden by the input batch_size
@@ -635,9 +637,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
assert (eos_token_id is None) or (
isinstance(eos_token_id, int) and (eos_token_id >= 0)
), "`eos_token_id` should be a positive integer."
assert (
decoder_start_token_id is not None or self.config.is_encoder_decoder is False
), "`decoder_start_token_id` has to be defined if model is encoder-decoder model"
assert length_penalty > 0, "`length_penalty` should be strictely positive."
assert (
isinstance(num_return_sequences, int) and num_return_sequences > 0
@@ -708,8 +707,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
) # shape: (batch_size * num_return_sequences * num_beams, cur_len)
if self.config.is_encoder_decoder:
if decoder_start_token_id is None:
decoder_start_token_id = bos_token_id
assert bos_token_id is not None, "Encoder Decoder Models need to have a bos_token_id"
assert (
decoder_start_token_id is not None
), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
@@ -996,10 +999,12 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin):
# set eos token prob to zero if min_length is not reached
if eos_token_id is not None and cur_len < min_length:
# create eos_token_id boolean mask
num_batch_hypotheses = batch_size * num_beams
is_token_logit_eos_token = tf.convert_to_tensor(
[True if token is eos_token_id else False for token in range(vocab_size)], dtype=tf.bool
)
eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [batch_size, vocab_size])
eos_token_indices_mask = tf.broadcast_to(is_token_logit_eos_token, [num_batch_hypotheses, vocab_size])
scores = set_tensor_by_indices_to_value(scores, eos_token_indices_mask, -float("inf"))

View File

@@ -108,6 +108,10 @@ class ModuleUtilsMixin:
module.mem_rss_post_forward = 0
module.mem_rss_pre_forward = 0
@property
def device(self):
return next(self.parameters()).device
class PreTrainedModel(nn.Module, ModuleUtilsMixin):
r""" Base class for all models.
@@ -717,13 +721,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
Padding token. Default to specicic model pad_token_id or None if it does not exist.
bos_token_id: (`optional`) int
BOS token. Defaults to bos_token_id as defined in the models config.
BOS token. Defaults to `bos_token_id` as defined in the models config.
pad_token_id: (`optional`) int
Pad token. Defaults to pad_token_id as defined in the models config.
eos_token_ids: (`optional`) int or list of int
End of sequence token or list of tokens to stop the generation. Default to eos_token_ids as defined in the models config.
eos_token_id: (`optional`) int
EOS token. Defaults to `eos_token_id` as defined in the models config.
length_penalty: (`optional`) float
Exponential penalty to the length. Default to 1.
@@ -809,7 +810,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
num_return_sequences = (
num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
)
decoder_start_token_id = decoder_start_token_id if decoder_start_token_id is not None else bos_token_id
decoder_start_token_id = (
decoder_start_token_id if decoder_start_token_id is not None else self.config.decoder_start_token_id
)
if input_ids is not None:
batch_size = input_ids.shape[0] # overriden by the input batch_size
@@ -831,9 +834,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
assert pad_token_id is None or (
isinstance(pad_token_id, int) and (pad_token_id >= 0)
), "`pad_token_id` should be a positive integer."
assert (
decoder_start_token_id is not None or self.config.is_encoder_decoder is False
), "`decoder_start_token_id` has to be defined if model is encoder-decoder model"
assert (eos_token_id is None) or (
isinstance(eos_token_id, int) and (eos_token_id >= 0)
), "`eos_token_id` should be a positive integer."
@@ -896,6 +896,21 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
effective_batch_size = batch_size
effective_batch_mult = 1
if self.config.is_encoder_decoder:
if decoder_start_token_id is None:
decoder_start_token_id = bos_token_id
assert (
decoder_start_token_id is not None
), "decoder_start_token_id or bos_token_id has to be defined for encoder-decoder generation"
assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
# get encoder and store encoder outputs
encoder = self.get_encoder()
encoder_outputs = encoder(input_ids, attention_mask=attention_mask)
# Expand input ids if num_beams > 1 or num_return_sequences > 1
if num_return_sequences > 1 or num_beams > 1:
input_ids_len = input_ids.shape[-1]
@@ -912,15 +927,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
) # shape: (batch_size * num_return_sequences * num_beams, cur_len)
if self.config.is_encoder_decoder:
assert bos_token_id is not None, "Encoder Decoder Models need to have a bos_token_id"
assert hasattr(self, "get_encoder"), "{} should have a 'get_encoder' function defined".format(self)
assert callable(self.get_encoder), "{} should be a method".format(self.get_encoder)
# get encoder and store encoder outputs
encoder = self.get_encoder()
encoder_outputs = encoder(input_ids, attention_mask=attention_mask)
# create empty decoder_input_ids
input_ids = torch.full(
(effective_batch_size * num_beams, 1),
@@ -929,6 +935,18 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin):
device=next(self.parameters()).device,
)
cur_len = 1
batch_idx = self.encoder_outputs_batch_dim_idx
assert (
batch_size == encoder_outputs[0].shape[batch_idx]
), f"expected encoder_outputs[0] to have 1st dimension bs={batch_size}, got {encoder_outputs[0].shape[1]} "
expanded_idx = (
torch.arange(batch_size)
.view(-1, 1)
.repeat(1, num_beams * effective_batch_mult)
.view(-1)
.to(input_ids.device)
)
encoder_outputs = (encoder_outputs[0].index_select(batch_idx, expanded_idx), *encoder_outputs[1:])
else:
encoder_outputs = None
cur_len = input_ids.shape[-1]

View File

@@ -1040,3 +1040,98 @@ class XLMForQuestionAnswering(XLMPreTrainedModel):
outputs = outputs + transformer_outputs[1:] # Keep new_mems and attention/hidden states if they are here
return outputs
@add_start_docstrings(
"""XLM Model with a token classification head on top (a linear layer on top of
the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. """,
XLM_START_DOCSTRING,
)
class XLMForTokenClassification(XLMPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.transformer = XLMModel(config)
self.dropout = nn.Dropout(config.dropout)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.init_weights()
@add_start_docstrings_to_callable(XLM_INPUTS_DOCSTRING)
def forward(
self,
input_ids=None,
attention_mask=None,
langs=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
labels=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
Labels for computing the token classification loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
Returns:
:obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.XLMConfig`) and inputs:
loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when ``labels`` is provided) :
Classification loss.
scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.num_labels)`)
Classification scores (before SoftMax).
hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
of shape :obj:`(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
:obj:`(batch_size, num_heads, sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
Examples::
from transformers import XLMTokenizer, XLMForTokenClassification
import torch
tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-100-1280')
model = XLMForTokenClassification.from_pretrained('xlm-mlm-100-1280')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
labels = torch.tensor([1] * input_ids.size(1)).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, scores = outputs[:2]
"""
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
langs=langs,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
)
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.classifier(sequence_output)
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
loss_fct = CrossEntropyLoss()
# Only keep active parts of the loss
if attention_mask is not None:
active_loss = attention_mask.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)
active_labels = torch.where(
active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)
)
loss = loss_fct(active_logits, active_labels)
else:
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), scores, (hidden_states), (attentions)

View File

@@ -31,6 +31,7 @@ from .configuration_auto import ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, AutoConfig
from .configuration_bart import BartConfig
from .configuration_distilbert import DistilBertConfig
from .configuration_roberta import RobertaConfig
from .configuration_t5 import T5Config
from .configuration_utils import PretrainedConfig
from .configuration_xlm import XLMConfig
from .data import SquadExample, squad_convert_examples_to_features
@@ -60,7 +61,6 @@ if is_torch_available():
AutoModelForTokenClassification,
AutoModelWithLMHead,
)
from .modeling_bart import BartForConditionalGeneration
logger = logging.getLogger(__name__)
@@ -130,7 +130,9 @@ class PipelineDataFormat:
SUPPORTED_FORMATS = ["json", "csv", "pipe"]
def __init__(self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False):
def __init__(
self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
):
self.output_path = output_path
self.input_path = input_path
self.column = column.split(",") if column is not None else [""]
@@ -176,7 +178,7 @@ class PipelineDataFormat:
@staticmethod
def from_str(
format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False
format: str, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
):
if format == "json":
return JsonPipelineDataFormat(output_path, input_path, column, overwrite=overwrite)
@@ -189,7 +191,9 @@ class PipelineDataFormat:
class CsvPipelineDataFormat(PipelineDataFormat):
def __init__(self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False):
def __init__(
self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
):
super().__init__(output_path, input_path, column, overwrite=overwrite)
def __iter__(self):
@@ -210,7 +214,9 @@ class CsvPipelineDataFormat(PipelineDataFormat):
class JsonPipelineDataFormat(PipelineDataFormat):
def __init__(self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False):
def __init__(
self, output_path: Optional[str], input_path: Optional[str], column: Optional[str], overwrite=False,
):
super().__init__(output_path, input_path, column, overwrite=overwrite)
with open(input_path, "r") as f:
@@ -336,6 +342,7 @@ class Pipeline(_ScikitCompat):
tokenizer: PreTrainedTokenizer,
modelcard: Optional[ModelCard] = None,
framework: Optional[str] = None,
task: str = "",
args_parser: ArgumentHandler = None,
device: int = -1,
binary_output: bool = False,
@@ -356,6 +363,11 @@ class Pipeline(_ScikitCompat):
if self.framework == "pt" and self.device.type == "cuda":
self.model = self.model.to(self.device)
# Update config with task specific parameters
task_specific_params = self.model.config.task_specific_params
if task_specific_params is not None and task in task_specific_params:
self.model.config.update(task_specific_params.get(task))
def save_pretrained(self, save_directory):
"""
Save the pipeline's model and tokenizer to the specified save_directory
@@ -420,7 +432,7 @@ class Pipeline(_ScikitCompat):
"""
args = ["input_ids", "attention_mask"]
if not isinstance(self.model.config, (DistilBertConfig, XLMConfig, RobertaConfig, BartConfig)):
if not isinstance(self.model.config, (DistilBertConfig, XLMConfig, RobertaConfig, BartConfig, T5Config)):
args += ["token_type_ids"]
# PR #1548 (CLI) There is an issue with attention_mask
@@ -432,14 +444,18 @@ class Pipeline(_ScikitCompat):
else:
return {k: [feature[k] for feature in features] for k in args}
def _parse_and_tokenize(self, *texts, **kwargs):
def _parse_and_tokenize(self, *texts, pad_to_max_length=False, **kwargs):
"""
Parse arguments and tokenize
"""
# Parse arguments
inputs = self._args_parser(*texts, **kwargs)
inputs = self.tokenizer.batch_encode_plus(
inputs, add_special_tokens=True, return_tensors=self.framework, max_length=self.tokenizer.max_len
inputs,
add_special_tokens=True,
return_tensors=self.framework,
max_length=self.tokenizer.max_len,
pad_to_max_length=pad_to_max_length,
)
# Filter out features not available on specific models
@@ -520,6 +536,7 @@ class FeatureExtractionPipeline(Pipeline):
framework: Optional[str] = None,
args_parser: ArgumentHandler = None,
device: int = -1,
task: str = "",
):
super().__init__(
model=model,
@@ -529,6 +546,7 @@ class FeatureExtractionPipeline(Pipeline):
args_parser=args_parser,
device=device,
binary_output=True,
task=task,
)
def __call__(self, *args, **kwargs):
@@ -625,6 +643,7 @@ class FillMaskPipeline(Pipeline):
args_parser: ArgumentHandler = None,
device: int = -1,
topk=5,
task: str = "",
):
super().__init__(
model=model,
@@ -634,6 +653,7 @@ class FillMaskPipeline(Pipeline):
args_parser=args_parser,
device=device,
binary_output=True,
task=task,
)
self.topk = topk
@@ -725,6 +745,7 @@ class NerPipeline(Pipeline):
device: int = -1,
binary_output: bool = False,
ignore_labels=["O"],
task: str = "",
):
super().__init__(
model=model,
@@ -734,6 +755,7 @@ class NerPipeline(Pipeline):
args_parser=args_parser,
device=device,
binary_output=binary_output,
task=task,
)
self._basic_tokenizer = BasicTokenizer(do_lower_case=False)
@@ -896,6 +918,7 @@ class QuestionAnsweringPipeline(Pipeline):
modelcard: Optional[ModelCard] = None,
framework: Optional[str] = None,
device: int = -1,
task: str = "",
**kwargs
):
super().__init__(
@@ -905,6 +928,7 @@ class QuestionAnsweringPipeline(Pipeline):
framework=framework,
args_parser=QuestionAnsweringArgumentHandler(),
device=device,
task=task,
**kwargs,
)
@@ -1102,7 +1126,11 @@ class QuestionAnsweringPipeline(Pipeline):
chars_idx += len(word) + 1
# Join text with spaces
return {"answer": " ".join(words), "start": max(0, char_start_idx), "end": min(len(text), char_end_idx)}
return {
"answer": " ".join(words),
"start": max(0, char_start_idx),
"end": min(len(text), char_end_idx),
}
class SummarizationPipeline(Pipeline):
@@ -1111,12 +1139,16 @@ class SummarizationPipeline(Pipeline):
Usage::
# use bart in pytorch
summarizer = pipeline("summarization")
summarizer("Sam Shleifer writes the best docstring examples in the whole world.")
summarizer("Sam Shleifer writes the best docstring examples in the whole world.", min_length=5, max_length=20)
# use t5 in tf
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")
summarizer("Sam Shleifer writes the best docstring examples in the whole world.", min_length=5, max_length=20)
Supported Models:
The models that this pipeline can use are models that have been fine-tuned on a summarization task, which is
currently only ``BartForConditionalGeneration.from_pretrained('bart-large-cnn')``
The models that this pipeline can use are models that have been fine-tuned on a summarization task, which is currently, '`bart-large-cnn`', '`t5-small`', '`t5-base`', '`t5-large`', '`t5-3b`', '`t5-11b`'.
Arguments:
model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):
@@ -1147,17 +1179,8 @@ class SummarizationPipeline(Pipeline):
on the associated CUDA device id.
"""
task = "summarization"
def __call__(
self,
*documents,
return_tensors=False,
return_text=True,
max_length=142,
min_length=21,
clean_up_tokenization_spaces=False,
**generate_kwargs
self, *documents, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs
):
r"""
Args:
@@ -1165,10 +1188,6 @@ class SummarizationPipeline(Pipeline):
return_text: (bool, default=True) whether to add a decoded "summary_text" to each result
return_tensors: (bool, default=False) whether to return the raw "summary_token_ids" to each result
max_length: (`optional`) int
The max length of the sequence to be generated. Does not include tokens in input_ids.
min_len: (`optional`) int
no_repeat_ngram_size: (`optional`) int. ban ngrams of this length from being repeated in the generated text
clean_up_tokenization_spaces: (`optional`) bool whether to include extra spaces in the output
**generate_kwargs: extra kwargs passed to `self.model.generate`_
@@ -1180,19 +1199,60 @@ class SummarizationPipeline(Pipeline):
"""
assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True"
if self.framework == "tf":
raise NotImplementedError("Tensorflow not supported")
with self.device_placement():
inputs = self._parse_and_tokenize(*documents)
inputs = self.ensure_tensor_on_device(**inputs)
summaries = self.model.generate(
inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=max_length,
min_length=min_length,
do_sample=False,
**generate_kwargs,
assert len(documents) > 0, "Please provide a document to summarize"
if self.framework == "tf" and "BartForConditionalGeneration" in self.model.__class__.__name__:
raise NotImplementedError(
"Tensorflow is not yet supported for Bart. Please consider using T5, e.g. `t5-base`"
)
prefix = self.model.config.prefix if self.model.config.prefix is not None else ""
if isinstance(documents[0], list):
assert (
self.tokenizer.pad_token_id is not None
), "Please make sure that the tokenizer has a pad_token_id when using a batch input"
documents = ([prefix + document for document in documents[0]],)
pad_to_max_length = True
elif isinstance(documents[0], str):
documents = (prefix + documents[0],)
pad_to_max_length = False
else:
raise ValueError(
" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`".format(
documents[0]
)
)
with self.device_placement():
inputs = self._parse_and_tokenize(*documents, pad_to_max_length=pad_to_max_length)
if self.framework == "pt":
inputs = self.ensure_tensor_on_device(**inputs)
input_length = inputs["input_ids"].shape[-1]
elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy()
if input_length < self.model.config.min_length // 2:
logger.warning(
"Your min_length is set to {}, but you input_length is only {}. You might consider decreasing min_length manually, e.g. summarizer('...', min_length=10)".format(
self.model.config.min_length, input_length
)
)
if input_length < self.model.config.max_length:
logger.warning(
"Your max_length is set to {}, but you input_length is only {}. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)".format(
self.model.config.max_length, input_length
)
)
summaries = self.model.generate(
inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
)
results = []
for summary in summaries:
record = {}
@@ -1200,7 +1260,115 @@ class SummarizationPipeline(Pipeline):
record["summary_token_ids"] = summary
if return_text:
record["summary_text"] = self.tokenizer.decode(
summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces
summary, skip_special_tokens=True, clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
results.append(record)
return results
class TranslationPipeline(Pipeline):
"""
Translates from one language to another.
Usage::
en_fr_translator = pipeline("translation_en_to_fr")
en_fr_translator("How old are you?")
Supported Models: "t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"
Arguments:
model (:obj:`str` or :obj:`~transformers.PreTrainedModel` or :obj:`~transformers.TFPreTrainedModel`, `optional`, defaults to :obj:`None`):
The model that will be used by the pipeline to make predictions. This can be :obj:`None`, a string
checkpoint identifier or an actual pre-trained model inheriting from
:class:`~transformers.PreTrainedModel` for PyTorch and :class:`~transformers.TFPreTrainedModel` for
TensorFlow.
If :obj:`None`, the default of the pipeline will be loaded.
tokenizer (:obj:`str` or :obj:`~transformers.PreTrainedTokenizer`, `optional`, defaults to :obj:`None`):
The tokenizer that will be used by the pipeline to encode data for the model. This can be :obj:`None`,
a string checkpoint identifier or an actual pre-trained tokenizer inheriting from
:class:`~transformers.PreTrainedTokenizer`.
If :obj:`None`, the default of the pipeline will be loaded.
modelcard (:obj:`str` or :class:`~transformers.ModelCard`, `optional`, defaults to :obj:`None`):
Model card attributed to the model for this pipeline.
framework (:obj:`str`, `optional`, defaults to :obj:`None`):
The framework to use, either "pt" for PyTorch or "tf" for TensorFlow. The specified framework must be
installed.
If no framework is specified, will default to the one currently installed. If no framework is specified
and both frameworks are installed, will default to PyTorch.
args_parser (:class:`~transformers.pipelines.ArgumentHandler`, `optional`, defaults to :obj:`None`):
Reference to the object in charge of parsing supplied pipeline parameters.
device (:obj:`int`, `optional`, defaults to :obj:`-1`):
Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model
on the associated CUDA device id.
"""
def __call__(
self, *texts, return_tensors=False, return_text=True, clean_up_tokenization_spaces=False, **generate_kwargs
):
r"""
Args:
*texts: (list of strings) texts to be translated
return_text: (bool, default=True) whether to add a decoded "translation_text" to each result
return_tensors: (bool, default=False) whether to return the raw "translation_token_ids" to each result
**generate_kwargs: extra kwargs passed to `self.model.generate`_
Returns:
list of dicts with 'translation_text' and/or 'translation_token_ids' for each text_to_translate
.. _`self.model.generate`:
https://huggingface.co/transformers/model_doc/bart.html#transformers.BartForConditionalGeneration.generate
"""
assert return_tensors or return_text, "You must specify return_tensors=True or return_text=True"
prefix = self.model.config.prefix if self.model.config.prefix is not None else ""
if isinstance(texts[0], list):
assert (
self.tokenizer.pad_token_id is not None
), "Please make sure that the tokenizer has a pad_token_id when using a batch input"
texts = ([prefix + text for text in texts[0]],)
pad_to_max_length = True
elif isinstance(texts[0], str):
texts = (prefix + texts[0],)
pad_to_max_length = False
else:
raise ValueError(
" `documents[0]`: {} have the wrong format. The should be either of type `str` or type `list`".format(
texts[0]
)
)
with self.device_placement():
inputs = self._parse_and_tokenize(*texts, pad_to_max_length=pad_to_max_length)
if self.framework == "pt":
inputs = self.ensure_tensor_on_device(**inputs)
input_length = inputs["input_ids"].shape[-1]
elif self.framework == "tf":
input_length = tf.shape(inputs["input_ids"])[-1].numpy()
if input_length > 0.9 * self.model.config.max_length:
logger.warning(
"Your input_length: {} is bigger than 0.9 * max_length: {}. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)".format(
input_length, self.model.config.max_length
)
)
translations = self.model.generate(
inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
)
results = []
for translation in translations:
record = {}
if return_tensors:
record["translation_token_ids"] = translation
if return_text:
record["translation_text"] = self.tokenizer.decode(
translation,
skip_special_tokens=True,
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
)
results.append(record)
return results
@@ -1266,14 +1434,44 @@ SUPPORTED_TASKS = {
},
"summarization": {
"impl": SummarizationPipeline,
"pt": BartForConditionalGeneration if is_torch_available() else None,
"tf": None,
"tf": TFAutoModelWithLMHead if is_tf_available() else None,
"pt": AutoModelWithLMHead if is_torch_available() else None,
"default": {
"model": {"pt": "bart-large-cnn", "tf": None},
"config": None,
"tokenizer": ("bart-large-cnn", {"use_fast": False}),
},
},
"translation_en_to_fr": {
"impl": TranslationPipeline,
"tf": TFAutoModelWithLMHead if is_tf_available() else None,
"pt": AutoModelWithLMHead if is_torch_available() else None,
"default": {
"model": {"pt": "t5-base", "tf": "t5-base"},
"config": None,
"tokenizer": ("t5-base", {"use_fast": False}),
},
},
"translation_en_to_de": {
"impl": TranslationPipeline,
"tf": TFAutoModelWithLMHead if is_tf_available() else None,
"pt": AutoModelWithLMHead if is_torch_available() else None,
"default": {
"model": {"pt": "t5-base", "tf": "t5-base"},
"config": None,
"tokenizer": ("t5-base", {"use_fast": False}),
},
},
"translation_en_to_ro": {
"impl": TranslationPipeline,
"tf": TFAutoModelWithLMHead if is_tf_available() else None,
"pt": AutoModelWithLMHead if is_torch_available() else None,
"default": {
"model": {"pt": "t5-base", "tf": "t5-base"},
"config": None,
"tokenizer": ("t5-base", {"use_fast": False}),
},
},
}
@@ -1361,7 +1559,7 @@ def pipeline(
framework = framework or get_framework(model)
targeted_task = SUPPORTED_TASKS[task]
task, model_class = targeted_task["impl"], targeted_task[framework]
task_class, model_class = targeted_task["impl"], targeted_task[framework]
# Use default model/config/tokenizer for the task if no model is provided
if model is None:
@@ -1422,4 +1620,4 @@ def pipeline(
)
model = model_class.from_pretrained(model, config=config, **model_kwargs)
return task(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, **kwargs)
return task_class(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, task=task, **kwargs,)

View File

@@ -19,7 +19,7 @@ from .tokenization_roberta import RobertaTokenizer
# vocab and merges same as roberta
vocab_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-vocab.json"
merges_url = "https://s3.amazonaws.com/models.huggingface.co/bert/roberta-large-merges.txt"
_all_bart_models = ["bart-large", "bart-large-mnli", "bart-large-cnn"]
_all_bart_models = ["bart-large", "bart-large-mnli", "bart-large-cnn", "bart-large-xsum"]
class BartTokenizer(RobertaTokenizer):

View File

@@ -61,14 +61,34 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
class T5Tokenizer(PreTrainedTokenizer):
"""
SentencePiece based tokenizer. Peculiarities:
Constructs an XLNet tokenizer. Based on `SentencePiece <https://github.com/google/sentencepiece>`__ .
- requires `SentencePiece <https://github.com/google/sentencepiece>`_
- `extra_ids` add a number of extra ids added to the end of the vocabulary for use as sentinels.
These tokens are accessible as `<extra_id_{%d}>` where `{%d}` is a number between 0 and extra_ids-1.
Extra tokens are indexed from the end of the vocabulary up to beginnning (<extra_id_0> is the last token in the vocabulary)
(like in T5 preprocessing
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
should refer to the superclass for more information regarding methods.
Args:
vocab_file (:obj:`string`):
`SentencePiece <https://github.com/google/sentencepiece>`__ file (generally has a `.spm` extension) that
contains the vocabulary necessary to instantiate a tokenizer.
eos_token (:obj:`string`, `optional`, defaults to "</s>"):
The end of sequence token.
.. note::
When building a sequence using special tokens, this is not the token that is used for the end
of sequence. The token used is the :obj:`sep_token`.
unk_token (:obj:`string`, `optional`, defaults to "<unk>"):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`string`, `optional`, defaults to "<pad>"):
The token used for padding, for example when batching sequences of different lengths.
extra_ids (:obj:`List[str]`, `optional`, defaults to :obj:`100`):
Add a number of extra ids added to the end of the vocabulary for use as sentinels.
These tokens are accessible as "<extra_id_{%d}>" where "{%d}" is a number between 0 and extra_ids-1.
Extra tokens are indexed from the end of the vocabulary up to beginnning ("<extra_id_0>" is the last token in the vocabulary like in T5 preprocessing
see: https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)
additional_special_tokens (:obj:`List[str]`, `optional`, defaults to :obj:`None`):
Additional special tokens used by the tokenizer.
"""
vocab_files_names = VOCAB_FILES_NAMES

View File

@@ -1997,3 +1997,14 @@ class PreTrainedTokenizerFast(PreTrainedTokenizer):
files = self._tokenizer.save(folder, name=file)
return tuple(files)
def trim_batch(
input_ids, pad_token_id, attention_mask=None,
):
"""Remove columns that are populated exclusively by pad_token_id"""
keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)
if attention_mask is None:
return input_ids[:, keep_column_mask]
else:
return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])

View File

@@ -37,6 +37,8 @@ if is_torch_available():
BertForSequenceClassification,
AutoModelForQuestionAnswering,
BertForQuestionAnswering,
AutoModelForTokenClassification,
BertForTokenClassification,
)
from transformers.modeling_bert import BERT_PRETRAINED_MODEL_ARCHIVE_MAP
from transformers.modeling_auto import (
@@ -109,7 +111,7 @@ class AutoModelTest(unittest.TestCase):
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForSequenceClassification)
# @slow
@slow
def test_question_answering_model_from_pretrained(self):
logging.basicConfig(level=logging.INFO)
for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
@@ -122,6 +124,19 @@ class AutoModelTest(unittest.TestCase):
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForQuestionAnswering)
@slow
def test_token_classification_model_from_pretrained(self):
logging.basicConfig(level=logging.INFO)
for model_name in list(BERT_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:
config = AutoConfig.from_pretrained(model_name)
self.assertIsNotNone(config)
self.assertIsInstance(config, BertConfig)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model, loading_info = AutoModelForTokenClassification.from_pretrained(model_name, output_loading_info=True)
self.assertIsNotNone(model)
self.assertIsInstance(model, BertForTokenClassification)
def test_from_pretrained_identifier(self):
logging.basicConfig(level=logging.INFO)
model = AutoModelWithLMHead.from_pretrained(SMALL_MODEL_IDENTIFIER)

View File

@@ -36,8 +36,8 @@ if is_torch_available():
from transformers.modeling_bart import (
BART_PRETRAINED_MODEL_ARCHIVE_MAP,
shift_tokens_right,
invert_mask,
_prepare_bart_decoder_inputs,
LARGE_NEGATIVE,
)
from transformers.tokenization_bart import BartTokenizer
@@ -113,7 +113,8 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
test_pruning = False
test_torchscript = False
test_head_masking = False
test_resize_embeddings = False # This requires inputs_dict['input_ids']
test_resize_embeddings = True # This requires inputs_dict['input_ids']
test_missing_keys = False # because BartForConditionalGeneration and BartModel now have identical state_dict
def setUp(self):
self.model_tester = ModelTester(self)
@@ -122,10 +123,9 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
def test_config(self):
self.config_tester.run_common_tests()
def test_advanced_inputs(self):
def test_initialization_more(self):
# (config, input_ids, token_type_ids, input_mask, *unused) = \
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
decoder_input_ids, decoder_attn_mask = _prepare_bart_decoder_inputs(config, inputs_dict["input_ids"])
model = BartModel(config)
model.to(torch_device)
model.eval()
@@ -141,9 +141,17 @@ class BARTModelTest(ModelTesterMixin, unittest.TestCase):
_check_var(model.encoder.layers[0].fc1)
_check_var(model.encoder.embed_positions)
def test_advanced_inputs(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
inputs_dict["input_ids"][:, -2:] = config.pad_token_id
decoder_input_ids, decoder_attn_mask, causal_mask = _prepare_bart_decoder_inputs(
config, inputs_dict["input_ids"]
)
model = BartModel(config).to(torch_device).eval()
decoder_features_with_created_mask = model(**inputs_dict)[0]
decoder_features_with_passed_mask = model(
decoder_attention_mask=decoder_attn_mask, decoder_input_ids=decoder_input_ids, **inputs_dict
decoder_attention_mask=invert_mask(decoder_attn_mask), decoder_input_ids=decoder_input_ids, **inputs_dict
)[0]
_assert_tensors_equal(decoder_features_with_passed_mask, decoder_features_with_created_mask)
useless_mask = torch.zeros_like(decoder_attn_mask)
@@ -237,7 +245,7 @@ class BartHeadTests(unittest.TestCase):
lm_labels = ids_tensor([batch_size, input_ids.shape[1]], self.vocab_size).to(torch_device)
lm_model = BartForConditionalGeneration(config)
lm_model.to(torch_device)
loss, logits, enc_features = lm_model(input_ids=input_ids, lm_labels=lm_labels, decoder_input_ids=input_ids)
loss, logits, enc_features = lm_model(input_ids=input_ids, lm_labels=lm_labels)
expected_shape = (batch_size, input_ids.shape[1], config.vocab_size)
self.assertEqual(logits.shape, expected_shape)
self.assertIsInstance(loss.item(), float)
@@ -335,41 +343,39 @@ class BartHeadTests(unittest.TestCase):
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
def test_dummy_inputs(self):
config, *_ = self._get_config_and_data(output_past=True)
config, *_ = self._get_config_and_data()
model = BartForConditionalGeneration(config).eval().to(torch_device)
model(**model.dummy_inputs)
def test_prepare_bart_decoder_inputs(self):
config, *_ = self._get_config_and_data(output_past=False)
input_ids = _long_tensor(([4, 4, 2])) # only used for .device if decoder_input_ids is passed
input_ids = _long_tensor(([4, 4, 2]))
decoder_input_ids = _long_tensor([[26388, 2, config.pad_token_id]])
ignore = LARGE_NEGATIVE
decoder_input_ids, decoder_attn_mask = _prepare_bart_decoder_inputs(config, input_ids, decoder_input_ids)
expected_mask = torch.tensor(
[
[0, ignore, ignore],
[0, 0, ignore],
[ignore, ignore, ignore], # never attend to the final token, because its pad
]
).to(input_ids.device)
self.assertEqual(decoder_attn_mask.size(), (1, 1, 3, 3))
self.assertTrue(torch.eq(expected_mask, decoder_attn_mask).all())
# Test no causal mask
config, *_ = self._get_config_and_data(output_past=True)
expected_just_padding_mask = torch.tensor(
[[0, 0, 0], [0, 0, 0], [ignore, ignore, ignore]] # never attend to the final token, because its pad
).to(input_ids.device)
_, decoder_attn_mask_no_causal_mask = _prepare_bart_decoder_inputs(config, input_ids, decoder_input_ids)
self.assertEqual(decoder_attn_mask_no_causal_mask.size(), (1, 1, 3, 3))
self.assertTrue(torch.eq(expected_just_padding_mask, decoder_attn_mask_no_causal_mask).all())
decoder_input_ids = _long_tensor([[0, 26388, 4133, 2]])
# Attend to everything if no pad tokens and no causal mask
_, decoder_attn_mask_no_padding_no_causal_mask = _prepare_bart_decoder_inputs(
ignore = float("-inf")
decoder_input_ids, decoder_attn_mask, causal_mask = _prepare_bart_decoder_inputs(
config, input_ids, decoder_input_ids
)
self.assertTrue(torch.eq(decoder_attn_mask_no_padding_no_causal_mask, 0).all())
expected_causal_mask = torch.tensor(
[[0, ignore, ignore], [0, 0, ignore], [0, 0, 0]] # never attend to the final token, because its pad
).to(input_ids.device)
self.assertEqual(decoder_attn_mask.size(), decoder_input_ids.size())
self.assertTrue(torch.eq(expected_causal_mask, causal_mask).all())
def test_resize_tokens_embeddings_more(self):
config, input_ids, _ = self._get_config_and_data()
def _get_embs(m):
return (m.get_input_embeddings().weight.data.clone(), m.get_output_embeddings().weight.data.clone())
model = BartForConditionalGeneration(config).eval().to(torch_device)
input, output = _get_embs(model)
self.assertTrue(torch.eq(input, output).all())
new_vocab_size = 45
model.resize_token_embeddings(new_vocab_size)
input_new, output_new = _get_embs(model)
self.assertEqual(input_new.shape, (new_vocab_size, config.d_model))
self.assertEqual(output_new.shape, (new_vocab_size, config.d_model))
self.assertTrue(torch.eq(input_new, output_new).all())
def _assert_tensors_equal(a, b, atol=1e-12, prefix=""):
@@ -444,6 +450,38 @@ class BartModelIntegrationTests(unittest.TestCase):
model = BartModel.from_pretrained(model_name, cache_dir=CACHE_DIR)
self.assertIsNotNone(model)
@slow
def test_xsum_summarization_same_as_fairseq(self):
model = BartForConditionalGeneration.from_pretrained("bart-large-xsum").to(torch_device)
tok = BartTokenizer.from_pretrained("bart-large")
PGE_ARTICLE = """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
EXPECTED_SUMMARY = "California's largest power company has begun shutting off power to tens of thousands of homes and businesses in the state."
dct = tok.batch_encode_plus([PGE_ARTICLE], max_length=1024, pad_to_max_length=True, return_tensors="pt",)
hypotheses_batch = model.generate(
input_ids=dct["input_ids"].to(torch_device),
attention_mask=dct["attention_mask"].to(torch_device),
num_beams=2,
max_length=62,
min_length=11,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True,
decoder_start_token_id=model.config.eos_token_id,
)
decoded = [
tok.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in hypotheses_batch
]
self.assertEqual(EXPECTED_SUMMARY, decoded[0])
def test_xsum_config_generation_params(self):
config = BartConfig.from_pretrained("bart-large-xsum")
expected_params = dict(num_beams=6, do_sample=False, early_stopping=True, length_penalty=1.0)
config_params = {k: getattr(config, k, "MISSING") for k, v in expected_params.items()}
self.assertDictEqual(expected_params, config_params)
@slow
def test_cnn_summarization_same_as_fairseq(self):
hf = BartForConditionalGeneration.from_pretrained("bart-large-cnn", output_past=True,).to(torch_device)

View File

@@ -58,6 +58,7 @@ class ModelTesterMixin:
test_pruning = True
test_resize_embeddings = True
test_head_masking = True
test_missing_keys = True
is_encoder_decoder = False
def test_save_load(self):
@@ -527,6 +528,8 @@ class ModelTesterMixin:
self.assertTrue(x is None or isinstance(x, torch.nn.Linear))
def test_correct_missing_keys(self):
if not self.test_missing_keys:
return
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:

View File

@@ -24,6 +24,7 @@ from .utils import CACHE_DIR, require_torch, slow, torch_device
if is_torch_available():
import torch
from transformers import T5Config, T5Model, T5ForConditionalGeneration
from transformers.modeling_t5 import T5_PRETRAINED_MODEL_ARCHIVE_MAP
@@ -57,8 +58,9 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
relative_attention_num_buckets=8,
dropout_rate=0.1,
initializer_factor=0.002,
eos_token_ids=[1],
eos_token_id=1,
pad_token_id=0,
decoder_start_token_id=0,
scope=None,
):
self.parent = parent
@@ -78,8 +80,9 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
self.dropout_rate = dropout_rate
self.initializer_factor = initializer_factor
self.scope = scope
self.eos_token_ids = eos_token_ids
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.decoder_start_token_id = decoder_start_token_id
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
@@ -106,9 +109,10 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
relative_attention_num_buckets=self.relative_attention_num_buckets,
dropout_rate=self.dropout_rate,
initializer_factor=self.initializer_factor,
eos_token_ids=self.eos_token_ids,
eos_token_id=self.eos_token_id,
bos_token_id=self.pad_token_id,
pad_token_id=self.pad_token_id,
decoder_start_token_id=self.decoder_start_token_id,
)
return (
@@ -123,6 +127,39 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
def check_loss_output(self, result):
self.parent.assertListEqual(list(result["loss"].size()), [])
def check_prepare_lm_labels_via_shift_left(
self, config, input_ids, decoder_input_ids, attention_mask, decoder_attention_mask, lm_labels,
):
model = T5Model(config=config)
model.to(torch_device)
model.eval()
# make sure that lm_labels are correctly padded from the right
lm_labels.masked_fill_((lm_labels == self.decoder_start_token_id), self.eos_token_id)
# add casaul pad token mask
triangular_mask = torch.tril(lm_labels.new_ones(lm_labels.shape)).logical_not()
lm_labels.masked_fill_(triangular_mask, self.pad_token_id)
decoder_input_ids = model._shift_right(lm_labels)
for i, (decoder_input_ids_slice, lm_labels_slice) in enumerate(zip(decoder_input_ids, lm_labels)):
# first item
self.parent.assertEqual(decoder_input_ids_slice[0].item(), self.decoder_start_token_id)
if i < decoder_input_ids_slice.shape[-1]:
if i < decoder_input_ids.shape[-1] - 1:
# items before diagonal
self.parent.assertListEqual(
decoder_input_ids_slice[1 : i + 1].tolist(), lm_labels_slice[:i].tolist()
)
# pad items after diagonal
if i < decoder_input_ids.shape[-1] - 2:
self.parent.assertListEqual(
decoder_input_ids_slice[i + 2 :].tolist(), lm_labels_slice[i + 1 : -1].tolist()
)
else:
# all items after square
self.parent.assertListEqual(decoder_input_ids_slice[1:].tolist(), lm_labels_slice[:-1].tolist())
def create_and_check_t5_model(
self, config, input_ids, decoder_input_ids, attention_mask, decoder_attention_mask, lm_labels,
):
@@ -197,6 +234,10 @@ class T5ModelTest(ModelTesterMixin, unittest.TestCase):
def test_config(self):
self.config_tester.run_common_tests()
def test_shift_right(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.check_prepare_lm_labels_via_shift_left(*config_and_inputs)
def test_t5_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_t5_model(*config_and_inputs)

View File

@@ -52,7 +52,7 @@ class TFT5ModelTest(TFModelTesterMixin, unittest.TestCase):
relative_attention_num_buckets=8,
dropout_rate=0.1,
initializer_factor=0.002,
eos_token_ids=[1],
eos_token_id=1,
pad_token_id=0,
scope=None,
):
@@ -71,7 +71,7 @@ class TFT5ModelTest(TFModelTesterMixin, unittest.TestCase):
self.relative_attention_num_buckets = relative_attention_num_buckets
self.dropout_rate = dropout_rate
self.initializer_factor = initializer_factor
self.eos_token_ids = eos_token_ids
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.scope = scope
@@ -97,7 +97,7 @@ class TFT5ModelTest(TFModelTesterMixin, unittest.TestCase):
relative_attention_num_buckets=self.relative_attention_num_buckets,
dropout_rate=self.dropout_rate,
initializer_factor=self.initializer_factor,
eos_token_ids=self.eos_token_ids,
eos_token_id=self.eos_token_id,
bos_token_id=self.pad_token_id,
pad_token_id=self.pad_token_id,
)

View File

@@ -29,6 +29,7 @@ if is_torch_available():
XLMConfig,
XLMModel,
XLMWithLMHeadModel,
XLMForTokenClassification,
XLMForQuestionAnswering,
XLMForSequenceClassification,
XLMForQuestionAnsweringSimple,
@@ -350,6 +351,32 @@ class XLMModelTest(ModelTesterMixin, unittest.TestCase):
list(result["logits"].size()), [self.batch_size, self.type_sequence_label_size]
)
def create_and_check_xlm_for_token_classification(
self,
config,
input_ids,
token_type_ids,
input_lengths,
sequence_labels,
token_labels,
is_impossible_labels,
input_mask,
):
config.num_labels = self.num_labels
model = XLMForTokenClassification(config)
model.to(torch_device)
model.eval()
loss, logits = model(input_ids, attention_mask=input_mask, labels=token_labels)
result = {
"loss": loss,
"logits": logits,
}
self.parent.assertListEqual(
list(result["logits"].size()), [self.batch_size, self.seq_length, self.num_labels]
)
self.check_loss_output(result)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
(
@@ -392,6 +419,10 @@ class XLMModelTest(ModelTesterMixin, unittest.TestCase):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_xlm_sequence_classif(*config_and_inputs)
def test_xlm_for_token_classification(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_xlm_for_token_classification(*config_and_inputs)
@slow
def test_model_from_pretrained(self):
for model_name in list(XLM_PRETRAINED_MODEL_ARCHIVE_MAP.keys())[:1]:

View File

@@ -78,6 +78,15 @@ TF_FILL_MASK_FINETUNED_MODELS = [
(("distilroberta-base", {"use_fast": False}), "distilroberta-base", None),
]
SUMMARIZATION_FINETUNED_MODELS = {("bart-large-cnn", "bart-large-cnn"), ("t5-small", "t5-small")}
TF_SUMMARIZATION_FINETUNED_MODELS = {("t5-small", "t5-small")}
TRANSLATION_FINETUNED_MODELS = {
("t5-small", "t5-small", "translation_en_to_de"),
("t5-small", "t5-small", "translation_en_to_ro"),
}
TF_TRANSLATION_FINETUNED_MODELS = {("t5-small", "t5-small", "translation_en_to_fr")}
class MonoColumnInputTestCase(unittest.TestCase):
def _test_mono_column_pipeline(
@@ -252,10 +261,44 @@ class MonoColumnInputTestCase(unittest.TestCase):
valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
invalid_inputs = [4, "<mask>"]
mandatory_keys = ["summary_text"]
nlp = pipeline(task="summarization")
self._test_mono_column_pipeline(
nlp, valid_inputs, invalid_inputs, mandatory_keys,
)
for model, tokenizer in SUMMARIZATION_FINETUNED_MODELS:
nlp = pipeline(task="summarization", model=model, tokenizer=tokenizer)
self._test_mono_column_pipeline(
nlp, valid_inputs, invalid_inputs, mandatory_keys,
)
@require_tf
def test_tf_summarization(self):
valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
invalid_inputs = [4, "<mask>"]
mandatory_keys = ["summary_text"]
for model, tokenizer in TF_SUMMARIZATION_FINETUNED_MODELS:
nlp = pipeline(task="summarization", model=model, tokenizer=tokenizer, framework="tf")
self._test_mono_column_pipeline(
nlp, valid_inputs, invalid_inputs, mandatory_keys,
)
@require_torch
def test_translation(self):
valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
invalid_inputs = [4, "<mask>"]
mandatory_keys = ["translation_text"]
for model, tokenizer, task in TRANSLATION_FINETUNED_MODELS:
nlp = pipeline(task=task, model=model, tokenizer=tokenizer)
self._test_mono_column_pipeline(
nlp, valid_inputs, invalid_inputs, mandatory_keys,
)
@require_tf
def test_tf_translation(self):
valid_inputs = ["A string like this", ["list of strings entry 1", "list of strings v2"]]
invalid_inputs = [4, "<mask>"]
mandatory_keys = ["translation_text"]
for model, tokenizer, task in TF_TRANSLATION_FINETUNED_MODELS:
nlp = pipeline(task=task, model=model, tokenizer=tokenizer, framework="tf")
self._test_mono_column_pipeline(
nlp, valid_inputs, invalid_inputs, mandatory_keys,
)
class MultiColumnInputTestCase(unittest.TestCase):