[PyTorch Bart] Split Bart into different models (#9343)

* first try

* remove old template

* finish bart

* finish mbart

* delete unnecessary line

* init pegasus

* save intermediate

* correct pegasus

* finish pegasus

* remove cookie cutter leftover

* add marian

* finish blenderbot

* replace in file

* correctly split blenderbot

* delete "old" folder

* correct "add statement"

* adapt config for tf comp

* correct configs for tf

* remove ipdb

* fix more stuff

* fix mbart

* push pegasus fix

* fix mbart

* more fixes

* fix research projects code

* finish docs for bart, mbart, and marian

* delete unnecessary file

* correct attn typo

* correct configs

* remove pegasus for seq class

* correct peg docs

* correct peg docs

* finish configs

* further improve docs

* add copied from statements to mbart

* fix copied from in mbart

* add copy statements to marian

* add copied from to marian

* add pegasus copied from

* finish pegasus

* finish copied from

* Apply suggestions from code review

* make style

* backward comp blenderbot

* apply lysandres and sylvains suggestions

* apply suggestions

* push last fixes

* fix docs

* fix tok tests

* fix imports code style

* fix doc
This commit is contained in:
Patrick von Platen
2021-01-05 22:00:05 +01:00
committed by GitHub
parent 4eec5d0cf6
commit eef66035a2
59 changed files with 9273 additions and 2271 deletions

View File

@@ -220,6 +220,8 @@ TensorFlow and/or Flax.
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| Blenderbot | ✅ | ❌ | ✅ | ✅ | ❌ | | Blenderbot | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| BlenderbotSmall | ✅ | ❌ | ✅ | ❌ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ | | CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+ +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
| CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ |
@@ -361,6 +363,7 @@ TensorFlow and/or Flax.
model_doc/bertweet model_doc/bertweet
model_doc/bertgeneration model_doc/bertgeneration
model_doc/blenderbot model_doc/blenderbot
model_doc/blenderbot_small
model_doc/camembert model_doc/camembert
model_doc/ctrl model_doc/ctrl
model_doc/deberta model_doc/deberta

View File

@@ -64,7 +64,6 @@ Implementation Notes
summarization, see the example in that docstrings. summarization, see the example in that docstrings.
- Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform - Models that load the `facebook/bart-large-cnn` weights will not have a :obj:`mask_token_id`, or be able to perform
mask-filling tasks. mask-filling tasks.
- For training/forward passes that don't involve beam search, pass :obj:`use_cache=False`.
Mask Filling Mask Filling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -43,13 +43,10 @@ Implementation Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture. - Blenderbot uses a standard `seq2seq model transformer <https://arxiv.org/pdf/1706.03762.pdf>`__ based architecture.
- It inherits completely from :class:`~transformers.BartForConditionalGeneration`
- Even though blenderbot is one model, it uses two tokenizers :class:`~transformers.BlenderbotSmallTokenizer` for 90M
checkpoint and :class:`~transformers.BlenderbotTokenizer` for all other checkpoints.
- :class:`~transformers.BlenderbotSmallTokenizer` will always return :class:`~transformers.BlenderbotSmallTokenizer`,
regardless of checkpoint. To use the 3B parameter checkpoint, you must call
:class:`~transformers.BlenderbotTokenizer` directly.
- Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__. - Available checkpoints can be found in the `model hub <https://huggingface.co/models?search=blenderbot>`__.
- This is the `default` Blenderbot model class. However, some smaller checkpoints, such as
``facebook/blenderbot_small_90M``, have a different architecture and consequently should be used with
`BlenderbotSmall <https://huggingface.co/transformers/master/model_doc/blenderbot_small.html>`__.
Usage Usage
@@ -59,26 +56,15 @@ Here is an example of model usage:
.. code-block:: .. code-block::
>>> from transformers import BlenderbotSmallTokenizer, BlenderbotForConditionalGeneration >>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration
>>> mname = 'facebook/blenderbot-90M' >>> mname = 'facebook/blenderbot-400M-distill'
>>> model = BlenderbotForConditionalGeneration.from_pretrained(mname) >>> model = BlenderbotForConditionalGeneration.from_pretrained(mname)
>>> tokenizer = BlenderbotSmallTokenizer.from_pretrained(mname) >>> tokenizer = BlenderbotTokenizer.from_pretrained(mname)
>>> UTTERANCE = "My friends are cool but they eat too many carbs." >>> UTTERANCE = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer([UTTERANCE], return_tensors='pt') >>> inputs = tokenizer([UTTERANCE], return_tensors='pt')
>>> reply_ids = model.generate(**inputs) >>> reply_ids = model.generate(**inputs)
>>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in reply_ids]) >>> print(tokenizer.batch_decode(reply_ids))
["<s> That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?</s>"]
Here is how you can check out config values:
.. code-block::
>>> from transformers import BlenderbotConfig
>>> config_90 = BlenderbotConfig.from_pretrained("facebook/blenderbot-90M")
>>> config_90.to_diff_dict() # show interesting Values.
>>> configuration_3B = BlenderbotConfig("facebook/blenderbot-3B")
>>> configuration_3B.to_diff_dict()
BlenderbotConfig BlenderbotConfig
@@ -93,12 +79,6 @@ BlenderbotTokenizer
.. autoclass:: transformers.BlenderbotTokenizer .. autoclass:: transformers.BlenderbotTokenizer
:members: build_inputs_with_special_tokens :members: build_inputs_with_special_tokens
BlenderbotSmallTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallTokenizer
:members:
BlenderbotModel BlenderbotModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -106,7 +86,7 @@ BlenderbotModel
See :obj:`transformers.BartModel` for arguments to `forward` and `generate` See :obj:`transformers.BartModel` for arguments to `forward` and `generate`
.. autoclass:: transformers.BlenderbotModel .. autoclass:: transformers.BlenderbotModel
:members: :members: forward
BlenderbotForConditionalGeneration BlenderbotForConditionalGeneration
@@ -115,7 +95,7 @@ BlenderbotForConditionalGeneration
See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate` See :obj:`transformers.BartForConditionalGeneration` for arguments to `forward` and `generate`
.. autoclass:: transformers.BlenderbotForConditionalGeneration .. autoclass:: transformers.BlenderbotForConditionalGeneration
:members: :members: forward
TFBlenderbotForConditionalGeneration TFBlenderbotForConditionalGeneration

View File

@@ -0,0 +1,70 @@
..
Copyright 2020 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
Blenderbot Small
-----------------------------------------------------------------------------------------------------------------------
Note that :class:`~transformers.BlenderbotSmallModel` and
:class:`~transformers.BlenderbotSmallForConditionalGeneration` are only used in combination with the checkpoint
`facebook/blenderbot-90M <https://huggingface.co/facebook/blenderbot-90M>`__. Larger Blenderbot checkpoints should
instead be used with :class:`~transformers.BlenderbotModel` and
:class:`~transformers.BlenderbotForConditionalGeneration`
Overview
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Blender chatbot model was proposed in `Recipes for building an open-domain chatbot
<https://arxiv.org/pdf/2004.13637.pdf>`__ Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu,
Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston on 30 Apr 2020.
The abstract of the paper is the following:
*Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that
scaling neural models in the number of parameters and the size of the data they are trained on gives improved results,
we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of
skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to
their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent
persona. We show that large scale models can learn these skills when given appropriate training data and choice of
generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models
and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn
dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing
failure cases of our models.*
The authors' code can be found `here <https://github.com/facebookresearch/ParlAI>`__ .
BlenderbotSmallConfig
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallConfig
:members:
BlenderbotSmallTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
create_token_type_ids_from_sequences, save_vocabulary
BlenderbotSmallModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallModel
:members: forward
BlenderbotSmallForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.BlenderbotSmallForConditionalGeneration
:members: forward

View File

@@ -33,7 +33,6 @@ Implementation Notes
- The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications: - The modeling code is the same as :class:`~transformers.BartForConditionalGeneration` with a few minor modifications:
- static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`) - static (sinusoid) positional embeddings (:obj:`MarianConfig.static_position_embeddings=True`)
- a new final_logits_bias (:obj:`MarianConfig.add_bias_logits=True`)
- no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`) - no layernorm_embedding (:obj:`MarianConfig.normalize_embedding=False`)
- the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses - the model starts generating with :obj:`pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses
:obj:`<s/>`), :obj:`<s/>`),
@@ -56,9 +55,10 @@ Examples
- Since Marian models are smaller than many other translation models available in the library, they can be useful for - Since Marian models are smaller than many other translation models available in the library, they can be useful for
fine-tuning experiments and integration tests. fine-tuning experiments and integration tests.
- :prefix_link:`Fine-tune on TPU <examples/seq2seq/builtin_trainer/train_distil_marian_enro_tpu.sh>` - `Fine-tune on GPU
- :prefix_link:`Fine-tune on GPU <examples/seq2seq/builtin_trainer/train_distil_marian_enro.sh>` <https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_enro_teacher.sh>`__
- :prefix_link:`Fine-tune on GPU with pytorch-lightning <examples/seq2seq/distil_marian_no_teacher.sh>` - `Fine-tune on GPU with pytorch-lightning
<https://github.com/huggingface/transformers/blob/master/examples/research_projects/seq2seq-distillation/train_distil_marian_no_teacher.sh>`__
Multilingual Models Multilingual Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -179,10 +179,18 @@ MarianTokenizer
:members: prepare_seq2seq_batch :members: prepare_seq2seq_batch
MarianModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianModel
:members: forward
MarianMTModel MarianMTModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MarianMTModel .. autoclass:: transformers.MarianMTModel
:members: forward
TFMarianMTModel TFMarianMTModel

View File

@@ -111,6 +111,19 @@ MBartForConditionalGeneration
:members: :members:
MBartForQuestionAnswering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MBartForQuestionAnswering
:members:
MBartForSequenceClassification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.MBartForSequenceClassification
TFMBartForConditionalGeneration TFMBartForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@@ -65,7 +65,6 @@ Implementation Notes
- Some key configuration differences: - Some key configuration differences:
- static, sinusoidal position embeddings - static, sinusoidal position embeddings
- no :obj:`layernorm_embedding` (:obj:`PegasusConfig.normalize_embedding=False`)
- the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix.
- more beams are used (:obj:`num_beams=8`) - more beams are used (:obj:`num_beams=8`)
- All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum - All pretrained pegasus checkpoints are the same besides three attributes: :obj:`tokenizer.model_max_length` (maximum
@@ -122,12 +121,14 @@ PegasusModel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PegasusModel .. autoclass:: transformers.PegasusModel
:members: forward
PegasusForConditionalGeneration PegasusForConditionalGeneration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: transformers.PegasusForConditionalGeneration .. autoclass:: transformers.PegasusForConditionalGeneration
:members: forward
TFPegasusForConditionalGeneration TFPegasusForConditionalGeneration

View File

@@ -1,645 +0,0 @@
import itertools
import json
import linecache
import math
import os
import pickle
import socket
from logging import getLogger
from pathlib import Path
from typing import Callable, Dict, Iterable, List, Tuple, Union
import git
import numpy as np
import torch
import torch.distributed as dist
from rouge_score import rouge_scorer, scoring
from sacrebleu import corpus_bleu
from torch import nn
from torch.utils.data import Dataset, Sampler
from sentence_splitter import add_newline_to_end_of_each_sentence
from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer, T5Tokenizer
from transformers.file_utils import cached_property
from transformers.models.bart.modeling_bart import shift_tokens_right
try:
from fairseq.data.data_utils import batch_by_size
FAIRSEQ_AVAILABLE = True
except (ImportError, ModuleNotFoundError):
FAIRSEQ_AVAILABLE = False
def label_smoothed_nll_loss(lprobs, target, epsilon, ignore_index=-100):
"""From fairseq"""
if target.dim() == lprobs.dim() - 1:
target = target.unsqueeze(-1)
nll_loss = -lprobs.gather(dim=-1, index=target)
smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
if ignore_index is not None:
pad_mask = target.eq(ignore_index)
nll_loss.masked_fill_(pad_mask, 0.0)
smooth_loss.masked_fill_(pad_mask, 0.0)
else:
nll_loss = nll_loss.squeeze(-1)
smooth_loss = smooth_loss.squeeze(-1)
nll_loss = nll_loss.sum() # mean()? Scared to break other math.
smooth_loss = smooth_loss.sum()
eps_i = epsilon / lprobs.size(-1)
loss = (1.0 - epsilon) * nll_loss + eps_i * smooth_loss
return loss, nll_loss
def lmap(f: Callable, x: Iterable) -> List:
"""list(map(f, x))"""
return list(map(f, x))
def calculate_bleu(output_lns, refs_lns, **kwargs) -> dict:
"""Uses sacrebleu's corpus_bleu implementation."""
return {"bleu": round(corpus_bleu(output_lns, [refs_lns], **kwargs).score, 4)}
def build_compute_metrics_fn(task_name: str, tokenizer: PreTrainedTokenizer) -> Callable[[EvalPrediction], Dict]:
def non_pad_len(tokens: np.ndarray) -> int:
return np.count_nonzero(tokens != tokenizer.pad_token_id)
def decode_pred(pred: EvalPrediction) -> Tuple[List[str], List[str]]:
pred_str = tokenizer.batch_decode(pred.predictions, skip_special_tokens=True)
label_str = tokenizer.batch_decode(pred.label_ids, skip_special_tokens=True)
pred_str = lmap(str.strip, pred_str)
label_str = lmap(str.strip, label_str)
return pred_str, label_str
def summarization_metrics(pred: EvalPrediction) -> Dict:
pred_str, label_str = decode_pred(pred)
rouge: Dict = calculate_rouge(pred_str, label_str)
summ_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
rouge.update({"gen_len": summ_len})
return rouge
def translation_metrics(pred: EvalPrediction) -> Dict:
pred_str, label_str = decode_pred(pred)
bleu: Dict = calculate_bleu(pred_str, label_str)
gen_len = np.round(np.mean(lmap(non_pad_len, pred.predictions)), 1)
bleu.update({"gen_len": gen_len})
return bleu
compute_metrics_fn = summarization_metrics if "summarization" in task_name else translation_metrics
return compute_metrics_fn
def trim_batch(
input_ids,
pad_token_id,
attention_mask=None,
):
"""Remove columns that are populated exclusively by pad_token_id"""
keep_column_mask = input_ids.ne(pad_token_id).any(dim=0)
if attention_mask is None:
return input_ids[:, keep_column_mask]
else:
return (input_ids[:, keep_column_mask], attention_mask[:, keep_column_mask])
class AbstractSeq2SeqDataset(Dataset):
def __init__(
self,
tokenizer,
data_dir,
max_source_length,
max_target_length,
type_path="train",
n_obs=None,
prefix="",
**dataset_kwargs
):
super().__init__()
self.src_file = Path(data_dir).joinpath(type_path + ".source")
self.tgt_file = Path(data_dir).joinpath(type_path + ".target")
self.len_file = Path(data_dir).joinpath(type_path + ".len")
if os.path.exists(self.len_file):
self.src_lens = pickle_load(self.len_file)
self.used_char_len = False
else:
self.src_lens = self.get_char_lens(self.src_file)
self.used_char_len = True
self.max_source_length = max_source_length
self.max_target_length = max_target_length
assert min(self.src_lens) > 0, f"found empty line in {self.src_file}"
self.tokenizer = tokenizer
self.prefix = prefix if prefix is not None else ""
if n_obs is not None:
self.src_lens = self.src_lens[:n_obs]
self.pad_token_id = self.tokenizer.pad_token_id
self.dataset_kwargs = dataset_kwargs
dataset_kwargs.update({"add_prefix_space": True} if isinstance(self.tokenizer, BartTokenizer) else {})
def __len__(self):
return len(self.src_lens)
@staticmethod
def get_char_lens(data_file):
return [len(x) for x in Path(data_file).open().readlines()]
@cached_property
def tgt_lens(self):
"""Length in characters of target documents"""
return self.get_char_lens(self.tgt_file)
def make_sortish_sampler(self, batch_size, distributed=False, shuffle=True, **kwargs):
if distributed:
return DistributedSortishSampler(self, batch_size, shuffle=shuffle, **kwargs)
else:
return SortishSampler(self.src_lens, batch_size, shuffle=shuffle)
def make_dynamic_sampler(self, max_tokens_per_batch=1024, **kwargs):
assert FAIRSEQ_AVAILABLE, "Dynamic batch size requires `pip install fairseq`"
assert not self.used_char_len, "You must call python make_len_file.py before calling make_dynamic_sampler"
sorted_indices = list(self.make_sortish_sampler(1024, shuffle=False))
def num_tokens_in_example(i):
return min(self.src_lens[i], self.max_target_length)
# call fairseq cython function
batch_sampler: List[List[int]] = batch_by_size(
sorted_indices,
num_tokens_fn=num_tokens_in_example,
max_tokens=max_tokens_per_batch,
required_batch_size_multiple=64,
)
shuffled_batches = [batch_sampler[i] for i in np.random.permutation(range(len(batch_sampler)))]
# move the largest batch to the front to OOM quickly (uses an approximation for padding)
approximate_toks_per_batch = [max(self.src_lens[i] for i in batch) * len(batch) for batch in shuffled_batches]
largest_batch_idx = np.argmax(approximate_toks_per_batch)
shuffled_batches[0], shuffled_batches[largest_batch_idx] = (
shuffled_batches[largest_batch_idx],
shuffled_batches[0],
)
return shuffled_batches
def __getitem__(self, item):
raise NotImplementedError("You must implement this")
def collate_fn(self, batch):
raise NotImplementedError("You must implement this")
class LegacySeq2SeqDataset(AbstractSeq2SeqDataset):
def __getitem__(self, index) -> Dict[str, torch.Tensor]:
"""Call tokenizer on src and tgt_lines"""
index = index + 1 # linecache starts at 1
source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n")
tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
assert source_line, f"empty source line for index {index}"
assert tgt_line, f"empty tgt line for index {index}"
source_inputs = self.encode_line(self.tokenizer, source_line, self.max_source_length)
target_inputs = self.encode_line(self.tokenizer, tgt_line, self.max_target_length)
source_ids = source_inputs["input_ids"].squeeze()
target_ids = target_inputs["input_ids"].squeeze()
src_mask = source_inputs["attention_mask"].squeeze()
return {
"input_ids": source_ids,
"attention_mask": src_mask,
"labels": target_ids,
}
def encode_line(self, tokenizer, line, max_length, pad_to_max_length=True, return_tensors="pt"):
"""Only used by LegacyDataset"""
return tokenizer(
[line],
max_length=max_length,
padding="max_length" if pad_to_max_length else None,
truncation=True,
return_tensors=return_tensors,
**self.dataset_kwargs,
)
def collate_fn(self, batch) -> Dict[str, torch.Tensor]:
input_ids = torch.stack([x["input_ids"] for x in batch])
masks = torch.stack([x["attention_mask"] for x in batch])
target_ids = torch.stack([x["labels"] for x in batch])
pad_token_id = self.pad_token_id
y = trim_batch(target_ids, pad_token_id)
source_ids, source_mask = trim_batch(input_ids, pad_token_id, attention_mask=masks)
batch = {
"input_ids": source_ids,
"attention_mask": source_mask,
"labels": y,
}
return batch
class Seq2SeqDataset(AbstractSeq2SeqDataset):
"""A dataset that calls prepare_seq2seq_batch."""
def __getitem__(self, index) -> Dict[str, str]:
index = index + 1 # linecache starts at 1
source_line = self.prefix + linecache.getline(str(self.src_file), index).rstrip("\n")
tgt_line = linecache.getline(str(self.tgt_file), index).rstrip("\n")
assert source_line, f"empty source line for index {index}"
assert tgt_line, f"empty tgt line for index {index}"
return {"tgt_texts": tgt_line, "src_texts": source_line, "id": index - 1}
def collate_fn(self, batch) -> Dict[str, torch.Tensor]:
"""Call prepare_seq2seq_batch."""
batch_encoding: Dict[str, torch.Tensor] = self.tokenizer.prepare_seq2seq_batch(
[x["src_texts"] for x in batch],
tgt_texts=[x["tgt_texts"] for x in batch],
max_length=self.max_source_length,
max_target_length=self.max_target_length,
return_tensors="pt",
**self.dataset_kwargs,
).data
batch_encoding["ids"] = torch.tensor([x["id"] for x in batch])
return batch_encoding
class Seq2SeqDataCollator:
def __init__(self, tokenizer, data_args, tpu_num_cores=None):
self.tokenizer = tokenizer
self.pad_token_id = tokenizer.pad_token_id
assert (
self.pad_token_id is not None
), f"pad_token_id is not defined for ({self.tokenizer.__class__.__name__}), it must be defined."
self.data_args = data_args
self.tpu_num_cores = tpu_num_cores
self.dataset_kwargs = {"add_prefix_space": True} if isinstance(tokenizer, BartTokenizer) else {}
if data_args.src_lang is not None:
self.dataset_kwargs["src_lang"] = data_args.src_lang
if data_args.tgt_lang is not None:
self.dataset_kwargs["tgt_lang"] = data_args.tgt_lang
def __call__(self, batch) -> Dict[str, torch.Tensor]:
if hasattr(self.tokenizer, "prepare_seq2seq_batch"):
batch = self._encode(batch)
input_ids, attention_mask, labels = (
batch["input_ids"],
batch["attention_mask"],
batch["labels"],
)
else:
input_ids = torch.stack([x["input_ids"] for x in batch])
attention_mask = torch.stack([x["attention_mask"] for x in batch])
labels = torch.stack([x["labels"] for x in batch])
labels = trim_batch(labels, self.pad_token_id)
input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask)
if isinstance(self.tokenizer, T5Tokenizer):
decoder_input_ids = self._shift_right_t5(labels)
else:
decoder_input_ids = shift_tokens_right(labels, self.pad_token_id)
batch = {
"input_ids": input_ids,
"attention_mask": attention_mask,
"decoder_input_ids": decoder_input_ids,
"labels": labels,
}
return batch
def _shift_right_t5(self, input_ids):
# shift inputs to the right
shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[..., 1:] = input_ids[..., :-1].clone()
shifted_input_ids[..., 0] = self.pad_token_id
return shifted_input_ids
def _encode(self, batch) -> Dict[str, torch.Tensor]:
batch_encoding = self.tokenizer.prepare_seq2seq_batch(
[x["src_texts"] for x in batch],
tgt_texts=[x["tgt_texts"] for x in batch],
max_length=self.data_args.max_source_length,
max_target_length=self.data_args.max_target_length,
padding="max_length" if self.tpu_num_cores is not None else "longest", # TPU hack
return_tensors="pt",
**self.dataset_kwargs,
)
return batch_encoding.data
class SortishSampler(Sampler):
"Go through the text data by order of src length with a bit of randomness. From fastai repo."
def __init__(self, data, batch_size, shuffle=True):
self.data, self.bs, self.shuffle = data, batch_size, shuffle
def __len__(self) -> int:
return len(self.data)
def __iter__(self):
return iter(sortish_sampler_indices(self.data, self.bs, shuffle=self.shuffle))
def sortish_sampler_indices(data: List, bs: int, shuffle=True) -> np.array:
"Go through the text data by order of src length with a bit of randomness. From fastai repo."
if not shuffle:
return np.argsort(np.array(data) * -1)
def key_fn(i):
return data[i]
idxs = np.random.permutation(len(data))
sz = bs * 50
ck_idx = [idxs[i : i + sz] for i in range(0, len(idxs), sz)]
sort_idx = np.concatenate([sorted(s, key=key_fn, reverse=True) for s in ck_idx])
sz = bs
ck_idx = [sort_idx[i : i + sz] for i in range(0, len(sort_idx), sz)]
max_ck = np.argmax([key_fn(ck[0]) for ck in ck_idx]) # find the chunk with the largest key,
ck_idx[0], ck_idx[max_ck] = ck_idx[max_ck], ck_idx[0] # then make sure it goes first.
sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) if len(ck_idx) > 1 else np.array([], dtype=np.int)
sort_idx = np.concatenate((ck_idx[0], sort_idx))
return sort_idx
class DistributedSortishSampler(Sampler):
"""Copied from torch DistributedSampler"""
def __init__(self, dataset, batch_size, num_replicas=None, rank=None, add_extra_examples=True, shuffle=True):
if num_replicas is None:
if not dist.is_available():
raise RuntimeError("Requires distributed package to be available")
num_replicas = dist.get_world_size()
if rank is None:
if not dist.is_available():
raise RuntimeError("Requires distributed package to be available")
rank = dist.get_rank()
self.dataset = dataset
self.num_replicas = num_replicas
self.rank = rank
self.epoch = 0
if add_extra_examples:
self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
self.total_size = self.num_samples * self.num_replicas
else:
self.total_size = len(dataset)
self.num_samples = len(self.available_indices)
self.batch_size = batch_size
self.add_extra_examples = add_extra_examples
self.shuffle = shuffle
def __iter__(self) -> Iterable:
g = torch.Generator()
g.manual_seed(self.epoch)
sortish_data = [self.dataset.src_lens[i] for i in self.available_indices]
sortish_indices = sortish_sampler_indices(sortish_data, self.batch_size, shuffle=self.shuffle)
indices = [self.available_indices[i] for i in sortish_indices]
assert len(indices) == self.num_samples
return iter(indices)
@cached_property
def available_indices(self) -> np.array:
indices = list(range(len(self.dataset)))
# add extra samples to make it evenly divisible
indices += indices[: (self.total_size - len(indices))]
assert len(indices) == self.total_size
# subsample
available_indices = indices[self.rank : self.total_size : self.num_replicas]
return available_indices
def __len__(self):
return self.num_samples
def set_epoch(self, epoch):
self.epoch = epoch
logger = getLogger(__name__)
def use_task_specific_params(model, task):
"""Update config with summarization specific params."""
task_specific_params = model.config.task_specific_params
if task_specific_params is not None:
pars = task_specific_params.get(task, {})
logger.info(f"using task specific params for {task}: {pars}")
model.config.update(pars)
def pickle_load(path):
"""pickle.load(path)"""
with open(path, "rb") as f:
return pickle.load(f)
def pickle_save(obj, path):
"""pickle.dump(obj, path)"""
with open(path, "wb") as f:
return pickle.dump(obj, f)
def flatten_list(summary_ids: List[List]):
return [x for x in itertools.chain.from_iterable(summary_ids)]
def save_git_info(folder_path: str) -> None:
"""Save git information to output_dir/git_log.json"""
repo_infos = get_git_info()
save_json(repo_infos, os.path.join(folder_path, "git_log.json"))
def save_json(content, path, indent=4, **json_dump_kwargs):
with open(path, "w") as f:
json.dump(content, f, indent=indent, **json_dump_kwargs)
def load_json(path):
with open(path) as f:
return json.load(f)
def get_git_info():
try:
repo = git.Repo(search_parent_directories=True)
repo_infos = {
"repo_id": str(repo),
"repo_sha": str(repo.head.object.hexsha),
"repo_branch": str(repo.active_branch),
"hostname": str(socket.gethostname()),
}
return repo_infos
except TypeError:
return {
"repo_id": None,
"repo_sha": None,
"repo_branch": None,
"hostname": None,
}
ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
def extract_rouge_mid_statistics(dct):
new_dict = {}
for k1, v1 in dct.items():
mid = v1.mid
new_dict[k1] = {stat: round(getattr(mid, stat), 4) for stat in ["precision", "recall", "fmeasure"]}
return new_dict
def calculate_rouge(
pred_lns: List[str],
tgt_lns: List[str],
use_stemmer=True,
rouge_keys=ROUGE_KEYS,
return_precision_and_recall=False,
bootstrap_aggregation=True,
newline_sep=True,
) -> Dict:
"""Calculate rouge using rouge_scorer package.
Args:
pred_lns: list of summaries generated by model
tgt_lns: list of groundtruth summaries (e.g. contents of val.target)
use_stemmer: Bool indicating whether Porter stemmer should be used to
strip word suffixes to improve matching.
rouge_keys: which metrics to compute, defaults to rouge1, rouge2, rougeL, rougeLsum
return_precision_and_recall: (False) whether to also return precision and recall.
bootstrap_aggregation: whether to do the typical bootstrap resampling of scores. Defaults to True, if False
this function returns a collections.defaultdict[metric: list of values for each observation for each subscore]``
newline_sep:(default=True) whether to add newline between sentences. This is essential for calculation rougeL
on multi sentence summaries (CNN/DM dataset).
Returns:
Dict[score: value] if aggregate else defaultdict(list) keyed by rouge_keys
"""
scorer = rouge_scorer.RougeScorer(rouge_keys, use_stemmer=use_stemmer)
aggregator = scoring.BootstrapAggregator()
for pred, tgt in zip(tgt_lns, pred_lns):
# rougeLsum expects "\n" separated sentences within a summary
if newline_sep:
pred = add_newline_to_end_of_each_sentence(pred)
tgt = add_newline_to_end_of_each_sentence(tgt)
scores = scorer.score(pred, tgt)
aggregator.add_scores(scores)
if bootstrap_aggregation:
result = aggregator.aggregate()
if return_precision_and_recall:
return extract_rouge_mid_statistics(result) # here we return dict
else:
return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}
else:
return aggregator._scores # here we return defaultdict(list)
# Utilities for freezing parameters and checking whether they are frozen
def freeze_params(model: nn.Module):
"""Set requires_grad=False for each of model.parameters()"""
for par in model.parameters():
par.requires_grad = False
def freeze_embeds(model):
"""Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
model_type = model.config.model_type
if model_type == "t5":
freeze_params(model.shared)
for d in [model.encoder, model.decoder]:
freeze_params(d.embed_tokens)
elif model_type == "fsmt":
for d in [model.model.encoder, model.model.decoder]:
freeze_params(d.embed_positions)
freeze_params(d.embed_tokens)
else:
freeze_params(model.model.shared)
for d in [model.model.encoder, model.model.decoder]:
freeze_params(d.embed_positions)
freeze_params(d.embed_tokens)
def grad_status(model: nn.Module) -> Iterable:
return (par.requires_grad for par in model.parameters())
def any_requires_grad(model: nn.Module) -> bool:
return any(grad_status(model))
def assert_all_frozen(model):
model_grads: List[bool] = list(grad_status(model))
n_require_grad = sum(lmap(int, model_grads))
npars = len(model_grads)
assert not any(model_grads), f"{n_require_grad/npars:.1%} of {npars} weights require grad"
def assert_not_all_frozen(model):
model_grads: List[bool] = list(grad_status(model))
npars = len(model_grads)
assert any(model_grads), f"none of {npars} weights require grad"
def parse_numeric_n_bool_cl_kwargs(unparsed_args: List[str]) -> Dict[str, Union[int, float, bool]]:
"""
Parse an argv list of unspecified command line args to a dict.
Assumes all values are either numeric or boolean in the form of true/false.
"""
result = {}
assert len(unparsed_args) % 2 == 0, f"got odd number of unparsed args: {unparsed_args}"
num_pairs = len(unparsed_args) // 2
for pair_num in range(num_pairs):
i = 2 * pair_num
assert unparsed_args[i].startswith("--")
if unparsed_args[i + 1].lower() == "true":
value = True
elif unparsed_args[i + 1].lower() == "false":
value = False
else:
try:
value = int(unparsed_args[i + 1])
except ValueError:
value = float(unparsed_args[i + 1]) # this can raise another informative ValueError
result[unparsed_args[i][2:]] = value
return result
def write_txt_file(ordered_tgt, path):
f = Path(path).open("w")
for ln in ordered_tgt:
f.write(ln + "\n")
f.flush()
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i : i + n]
def check_output_dir(args, expected_items=0):
"""
Checks whether to bail out if output_dir already exists and has more than expected_items in it
`args`: needs to have the following attributes of `args`:
- output_dir
- do_train
- overwrite_output_dir
`expected_items`: normally 0 (default) - i.e. empty dir, but in some cases a few files are expected (e.g. recovery from OOM)
"""
if (
os.path.exists(args.output_dir)
and len(os.listdir(args.output_dir)) > expected_items
and args.do_train
and not args.overwrite_output_dir
):
raise ValueError(
f"Output directory ({args.output_dir}) already exists and "
f"has {len(os.listdir(args.output_dir))} items in it (expected {expected_items} items). "
"Use --overwrite_output_dir to overcome."
)

View File

@@ -23,7 +23,7 @@ from pack_dataset import pack_data_dir
from parameterized import parameterized from parameterized import parameterized
from save_len_file import save_len_file from save_len_file import save_len_file
from transformers import AutoTokenizer from transformers import AutoTokenizer
from transformers.models.bart.modeling_bart import shift_tokens_right from transformers.models.mbart.modeling_mbart import shift_tokens_right
from transformers.testing_utils import TestCasePlus, require_torch_non_multi_gpu_but_fix_me, slow from transformers.testing_utils import TestCasePlus, require_torch_non_multi_gpu_but_fix_me, slow
from utils import FAIRSEQ_AVAILABLE, DistributedSortishSampler, LegacySeq2SeqDataset, Seq2SeqDataset from utils import FAIRSEQ_AVAILABLE, DistributedSortishSampler, LegacySeq2SeqDataset, Seq2SeqDataset

View File

@@ -33,9 +33,8 @@ from torch import nn
from torch.utils.data import Dataset, Sampler from torch.utils.data import Dataset, Sampler
from sentence_splitter import add_newline_to_end_of_each_sentence from sentence_splitter import add_newline_to_end_of_each_sentence
from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer, T5Tokenizer from transformers import BartTokenizer, EvalPrediction, PreTrainedTokenizer
from transformers.file_utils import cached_property from transformers.file_utils import cached_property
from transformers.models.bart.modeling_bart import shift_tokens_right
try: try:
@@ -305,15 +304,9 @@ class Seq2SeqDataCollator:
labels = trim_batch(labels, self.pad_token_id) labels = trim_batch(labels, self.pad_token_id)
input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask) input_ids, attention_mask = trim_batch(input_ids, self.pad_token_id, attention_mask=attention_mask)
if isinstance(self.tokenizer, T5Tokenizer):
decoder_input_ids = self._shift_right_t5(labels)
else:
decoder_input_ids = shift_tokens_right(labels, self.pad_token_id)
batch = { batch = {
"input_ids": input_ids, "input_ids": input_ids,
"attention_mask": attention_mask, "attention_mask": attention_mask,
"decoder_input_ids": decoder_input_ids,
"labels": labels, "labels": labels,
} }
return batch return batch

View File

@@ -120,11 +120,11 @@ from .models.bert import (
from .models.bert_generation import BertGenerationConfig from .models.bert_generation import BertGenerationConfig
from .models.bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer from .models.bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
from .models.bertweet import BertweetTokenizer from .models.bertweet import BertweetTokenizer
from .models.blenderbot import ( from .models.blenderbot import BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotConfig, BlenderbotTokenizer
BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, from .models.blenderbot_small import (
BlenderbotConfig, BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP,
BlenderbotSmallConfig,
BlenderbotSmallTokenizer, BlenderbotSmallTokenizer,
BlenderbotTokenizer,
) )
from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer
@@ -415,6 +415,11 @@ if is_torch_available():
BlenderbotForConditionalGeneration, BlenderbotForConditionalGeneration,
BlenderbotModel, BlenderbotModel,
) )
from .models.blenderbot_small import (
BLENDERBOT_SMALL_PRETRAINED_MODEL_ARCHIVE_LIST,
BlenderbotSmallForConditionalGeneration,
BlenderbotSmallModel,
)
from .models.camembert import ( from .models.camembert import (
CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST, CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST,
CamembertForCausalLM, CamembertForCausalLM,
@@ -536,8 +541,13 @@ if is_torch_available():
LxmertVisualFeatureEncoder, LxmertVisualFeatureEncoder,
LxmertXLayer, LxmertXLayer,
) )
from .models.marian import MarianMTModel from .models.marian import MarianModel, MarianMTModel
from .models.mbart import MBartForConditionalGeneration, MBartModel from .models.mbart import (
MBartForConditionalGeneration,
MBartForQuestionAnswering,
MBartForSequenceClassification,
MBartModel,
)
from .models.mmbt import MMBTForClassification, MMBTModel, ModalEmbeddings from .models.mmbt import MMBTForClassification, MMBTModel, ModalEmbeddings
from .models.mobilebert import ( from .models.mobilebert import (
MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST, MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST,

View File

@@ -23,6 +23,10 @@ from ..bart.configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartCo
from ..bert.configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig from ..bert.configuration_bert import BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BertConfig
from ..bert_generation.configuration_bert_generation import BertGenerationConfig from ..bert_generation.configuration_bert_generation import BertGenerationConfig
from ..blenderbot.configuration_blenderbot import BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotConfig from ..blenderbot.configuration_blenderbot import BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotConfig
from ..blenderbot_small.configuration_blenderbot_small import (
BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP,
BlenderbotSmallConfig,
)
from ..camembert.configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig from ..camembert.configuration_camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig
from ..ctrl.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig from ..ctrl.configuration_ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig
from ..deberta.configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig from ..deberta.configuration_deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig
@@ -68,6 +72,7 @@ ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = dict(
for pretrained_map in [ for pretrained_map in [
# Add archive maps here # Add archive maps here
LED_PRETRAINED_CONFIG_ARCHIVE_MAP, LED_PRETRAINED_CONFIG_ARCHIVE_MAP,
BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP,
BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, BERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BART_PRETRAINED_CONFIG_ARCHIVE_MAP,
BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -108,6 +113,7 @@ CONFIG_MAPPING = OrderedDict(
[ [
# Add configs here # Add configs here
("led", LEDConfig), ("led", LEDConfig),
("blenderbot-small", BlenderbotSmallConfig),
("retribert", RetriBertConfig), ("retribert", RetriBertConfig),
("mt5", MT5Config), ("mt5", MT5Config),
("t5", T5Config), ("t5", T5Config),
@@ -154,6 +160,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
[ [
# Add full (and cased) model names here # Add full (and cased) model names here
("led", "LED"), ("led", "LED"),
("blenderbot-small", "BlenderbotSmall"),
("retribert", "RetriBERT"), ("retribert", "RetriBERT"),
("t5", "T5"), ("t5", "T5"),
("mobilebert", "MobileBERT"), ("mobilebert", "MobileBERT"),

View File

@@ -51,6 +51,7 @@ from ..bert.modeling_bert import (
) )
from ..bert_generation.modeling_bert_generation import BertGenerationDecoder, BertGenerationEncoder from ..bert_generation.modeling_bert_generation import BertGenerationDecoder, BertGenerationEncoder
from ..blenderbot.modeling_blenderbot import BlenderbotForConditionalGeneration, BlenderbotModel from ..blenderbot.modeling_blenderbot import BlenderbotForConditionalGeneration, BlenderbotModel
from ..blenderbot_small.modeling_blenderbot_small import BlenderbotSmallForConditionalGeneration, BlenderbotSmallModel
from ..camembert.modeling_camembert import ( from ..camembert.modeling_camembert import (
CamembertForCausalLM, CamembertForCausalLM,
CamembertForMaskedLM, CamembertForMaskedLM,
@@ -116,8 +117,13 @@ from ..longformer.modeling_longformer import (
LongformerModel, LongformerModel,
) )
from ..lxmert.modeling_lxmert import LxmertForPreTraining, LxmertForQuestionAnswering, LxmertModel from ..lxmert.modeling_lxmert import LxmertForPreTraining, LxmertForQuestionAnswering, LxmertModel
from ..marian.modeling_marian import MarianMTModel from ..marian.modeling_marian import MarianModel, MarianMTModel
from ..mbart.modeling_mbart import MBartForConditionalGeneration, MBartModel from ..mbart.modeling_mbart import (
MBartForConditionalGeneration,
MBartForQuestionAnswering,
MBartForSequenceClassification,
MBartModel,
)
from ..mobilebert.modeling_mobilebert import ( from ..mobilebert.modeling_mobilebert import (
MobileBertForMaskedLM, MobileBertForMaskedLM,
MobileBertForMultipleChoice, MobileBertForMultipleChoice,
@@ -215,6 +221,7 @@ from .configuration_auto import (
BertConfig, BertConfig,
BertGenerationConfig, BertGenerationConfig,
BlenderbotConfig, BlenderbotConfig,
BlenderbotSmallConfig,
CamembertConfig, CamembertConfig,
CTRLConfig, CTRLConfig,
DebertaConfig, DebertaConfig,
@@ -260,6 +267,7 @@ MODEL_MAPPING = OrderedDict(
[ [
# Base model mapping # Base model mapping
(LEDConfig, LEDModel), (LEDConfig, LEDModel),
(BlenderbotSmallConfig, BlenderbotSmallModel),
(RetriBertConfig, RetriBertModel), (RetriBertConfig, RetriBertModel),
(MT5Config, MT5Model), (MT5Config, MT5Model),
(T5Config, T5Model), (T5Config, T5Model),
@@ -297,6 +305,7 @@ MODEL_MAPPING = OrderedDict(
(ProphetNetConfig, ProphetNetModel), (ProphetNetConfig, ProphetNetModel),
(MPNetConfig, MPNetModel), (MPNetConfig, MPNetModel),
(TapasConfig, TapasModel), (TapasConfig, TapasModel),
(MarianConfig, MarianModel),
] ]
) )
@@ -336,6 +345,7 @@ MODEL_WITH_LM_HEAD_MAPPING = OrderedDict(
[ [
# Model with LM heads mapping # Model with LM heads mapping
(LEDConfig, LEDForConditionalGeneration), (LEDConfig, LEDForConditionalGeneration),
(BlenderbotSmallConfig, BlenderbotSmallForConditionalGeneration),
(LayoutLMConfig, LayoutLMForMaskedLM), (LayoutLMConfig, LayoutLMForMaskedLM),
(T5Config, T5ForConditionalGeneration), (T5Config, T5ForConditionalGeneration),
(DistilBertConfig, DistilBertForMaskedLM), (DistilBertConfig, DistilBertForMaskedLM),
@@ -417,6 +427,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING = OrderedDict(
[ [
# Model for Seq2Seq Causal LM mapping # Model for Seq2Seq Causal LM mapping
(LEDConfig, LEDForConditionalGeneration), (LEDConfig, LEDForConditionalGeneration),
(BlenderbotSmallConfig, BlenderbotSmallForConditionalGeneration),
(MT5Config, MT5ForConditionalGeneration), (MT5Config, MT5ForConditionalGeneration),
(T5Config, T5ForConditionalGeneration), (T5Config, T5ForConditionalGeneration),
(PegasusConfig, PegasusForConditionalGeneration), (PegasusConfig, PegasusForConditionalGeneration),
@@ -439,6 +450,7 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING = OrderedDict(
(AlbertConfig, AlbertForSequenceClassification), (AlbertConfig, AlbertForSequenceClassification),
(CamembertConfig, CamembertForSequenceClassification), (CamembertConfig, CamembertForSequenceClassification),
(XLMRobertaConfig, XLMRobertaForSequenceClassification), (XLMRobertaConfig, XLMRobertaForSequenceClassification),
(MBartConfig, MBartForSequenceClassification),
(BartConfig, BartForSequenceClassification), (BartConfig, BartForSequenceClassification),
(LongformerConfig, LongformerForSequenceClassification), (LongformerConfig, LongformerForSequenceClassification),
(RobertaConfig, RobertaForSequenceClassification), (RobertaConfig, RobertaForSequenceClassification),
@@ -469,6 +481,7 @@ MODEL_FOR_QUESTION_ANSWERING_MAPPING = OrderedDict(
(AlbertConfig, AlbertForQuestionAnswering), (AlbertConfig, AlbertForQuestionAnswering),
(CamembertConfig, CamembertForQuestionAnswering), (CamembertConfig, CamembertForQuestionAnswering),
(BartConfig, BartForQuestionAnswering), (BartConfig, BartForQuestionAnswering),
(MBartConfig, MBartForQuestionAnswering),
(LongformerConfig, LongformerForQuestionAnswering), (LongformerConfig, LongformerForQuestionAnswering),
(XLMRobertaConfig, XLMRobertaForQuestionAnswering), (XLMRobertaConfig, XLMRobertaForQuestionAnswering),
(RobertaConfig, RobertaForQuestionAnswering), (RobertaConfig, RobertaForQuestionAnswering),

View File

@@ -24,7 +24,7 @@ from ..bart.tokenization_bart import BartTokenizer
from ..bert.tokenization_bert import BertTokenizer from ..bert.tokenization_bert import BertTokenizer
from ..bert_japanese.tokenization_bert_japanese import BertJapaneseTokenizer from ..bert_japanese.tokenization_bert_japanese import BertJapaneseTokenizer
from ..bertweet.tokenization_bertweet import BertweetTokenizer from ..bertweet.tokenization_bertweet import BertweetTokenizer
from ..blenderbot.tokenization_blenderbot import BlenderbotSmallTokenizer from ..blenderbot_small.tokenization_blenderbot_small import BlenderbotSmallTokenizer
from ..ctrl.tokenization_ctrl import CTRLTokenizer from ..ctrl.tokenization_ctrl import CTRLTokenizer
from ..deberta.tokenization_deberta import DebertaTokenizer from ..deberta.tokenization_deberta import DebertaTokenizer
from ..distilbert.tokenization_distilbert import DistilBertTokenizer from ..distilbert.tokenization_distilbert import DistilBertTokenizer
@@ -197,12 +197,12 @@ TOKENIZER_MAPPING = OrderedDict(
(AlbertConfig, (AlbertTokenizer, AlbertTokenizerFast)), (AlbertConfig, (AlbertTokenizer, AlbertTokenizerFast)),
(CamembertConfig, (CamembertTokenizer, CamembertTokenizerFast)), (CamembertConfig, (CamembertTokenizer, CamembertTokenizerFast)),
(PegasusConfig, (PegasusTokenizer, PegasusTokenizerFast)), (PegasusConfig, (PegasusTokenizer, PegasusTokenizerFast)),
(MBartConfig, (BarthezTokenizer, BarthezTokenizerFast)),
(MBartConfig, (MBartTokenizer, MBartTokenizerFast)), (MBartConfig, (MBartTokenizer, MBartTokenizerFast)),
(XLMRobertaConfig, (XLMRobertaTokenizer, XLMRobertaTokenizerFast)), (XLMRobertaConfig, (XLMRobertaTokenizer, XLMRobertaTokenizerFast)),
(MarianConfig, (MarianTokenizer, None)), (MarianConfig, (MarianTokenizer, None)),
(BlenderbotConfig, (BlenderbotSmallTokenizer, None)), (BlenderbotConfig, (BlenderbotSmallTokenizer, None)),
(LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)), (LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)),
(BartConfig, (BarthezTokenizer, BarthezTokenizerFast)),
(BartConfig, (BartTokenizer, BartTokenizerFast)), (BartConfig, (BartTokenizer, BartTokenizerFast)),
(LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)), (LongformerConfig, (LongformerTokenizer, LongformerTokenizerFast)),
(RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)), (RobertaConfig, (RobertaTokenizer, RobertaTokenizerFast)),

View File

@@ -15,9 +15,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from ...file_utils import is_tf_available, is_tokenizers_available, is_torch_available from ...file_utils import is_tf_available, is_tokenizers_available, is_torch_available
from .configuration_bart import BartConfig from .configuration_bart import BART_PRETRAINED_CONFIG_ARCHIVE_MAP, BartConfig
from .tokenization_bart import BartTokenizer from .tokenization_bart import BartTokenizer

View File

@@ -1,5 +1,5 @@
# coding=utf-8 # coding=utf-8
# Copyright 2020 The Fairseq Authors and The HuggingFace Inc. team. # Copyright 2021 The Fairseq Authors and The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -12,7 +12,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" BART configuration """ """ BART model configuration """
from ...configuration_utils import PretrainedConfig from ...configuration_utils import PretrainedConfig
from ...utils import logging from ...utils import logging
@@ -21,34 +21,33 @@ from ...utils import logging
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
BART_PRETRAINED_CONFIG_ARCHIVE_MAP = { BART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/bart-base": "https://huggingface.co/facebook/bart-base/resolve/main/config.json",
"facebook/bart-large": "https://huggingface.co/facebook/bart-large/resolve/main/config.json", "facebook/bart-large": "https://huggingface.co/facebook/bart-large/resolve/main/config.json",
"facebook/bart-large-mnli": "https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json", # See all BART models at https://huggingface.co/models?filter=bart
"facebook/bart-large-cnn": "https://huggingface.co/facebook/bart-large-cnn/resolve/main/config.json",
"facebook/bart-large-xsum": "https://huggingface.co/facebook/bart-large-xsum/resolve/main/config.json",
"facebook/mbart-large-en-ro": "https://huggingface.co/facebook/mbart-large-en-ro/resolve/main/config.json",
"yjernite/bart_eli5": "https://huggingface.co/yjernite/bart_eli5/resolve/main/config.json",
} }
class BartConfig(PretrainedConfig): class BartConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.BartModel`. It is used to This is the configuration class to store the configuration of a :class:`~transformers.BartModel`. It is used to
instantiate a BART model according to the specified arguments, defining the model architecture. instantiate a BART model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the BART `facebook/bart-large
<https://huggingface.co/facebook/bart-large>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 50265): vocab_size (:obj:`int`, `optional`, defaults to 50265):
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the Vocabulary size of the BART model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.BartModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.BartModel` or
:class:`~transformers.TFBartModel`.
d_model (:obj:`int`, `optional`, defaults to 1024): d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the layers and the pooler layer. Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 12): encoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of encoder layers, 6 are used for the `bart-base` model. Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 12): decoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of decoder layers, 6 are used for the `bart-base` model. Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 16): encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16): decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
@@ -73,145 +72,113 @@ class BartConfig(PretrainedConfig):
just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02): init_std (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`): force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
This should be completed, specific to marian. Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
normalize_before (:obj:`bool`, `optional`, defaults to :obj:`False`): :obj:`True` for `bart-large-cnn`.
Call layernorm before attention ops.
normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
Call layernorm after embeddings.
static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
Don't learn positional embeddings, use sinusoidal.
add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Why not add another layernorm?
do_blenderbot_90_layernorm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Blenderbot-90m checkpoint uses `layernorm_embedding` one line earlier in the decoder.
scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Scale embeddings by diving by sqrt(d_model).
eos_token_id (:obj:`int`, `optional`, defaults to 2)
End of stream token id.
pad_token_id (:obj:`int`, `optional`, defaults to 1)
Padding token id.
bos_token_id (:obj:`int`, `optional`, defaults to 0)
Beginning of stream token id.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper <see The LayerDrop probability for the encoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper <see The LayerDrop probability for the decoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
How many extra learned positional embeddings to use. Should be set to :obj:`pad_token_id+1`. If True, use gradient checkpointing to save memory at the expense of slower backward pass.
num_labels: (:obj:`int`, `optional`, defaults to 3): scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
The number of labels to use in :class:`~transformers.BartForSequenceClassification`. Scale embeddings by diving by sqrt(d_model).
is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether this is an encoder/decoder model.
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), only
:obj:`True` for `bart-large-cnn`.
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`): use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models). Whether or not the model should return the last key/values attentions (not used by all models).
num_labels: (:obj:`int`, `optional`, defaults to 3):
The number of labels to use in :class:`~transformers.BartForSequenceClassification`.
Example::
>>> from transformers import BartModel, BartConfig
>>> # Initializing a BART facebook/bart-large style configuration
>>> configuration = BartConfig()
>>> # Initializing a model from the facebook/bart-large style configuration
>>> model = BartModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
""" """
model_type = "bart" model_type = "bart"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
def __init__( def __init__(
self, self,
activation_dropout=0.0,
extra_pos_embeddings=2,
activation_function="gelu",
vocab_size=50265, vocab_size=50265,
d_model=1024, max_position_embeddings=1024,
encoder_ffn_dim=4096,
encoder_layers=12, encoder_layers=12,
encoder_ffn_dim=4096,
encoder_attention_heads=16, encoder_attention_heads=16,
decoder_ffn_dim=4096,
decoder_layers=12, decoder_layers=12,
decoder_ffn_dim=4096,
decoder_attention_heads=16, decoder_attention_heads=16,
encoder_layerdrop=0.0, encoder_layerdrop=0.0,
decoder_layerdrop=0.0, decoder_layerdrop=0.0,
attention_dropout=0.0, activation_function="gelu",
d_model=1024,
dropout=0.1, dropout=0.1,
max_position_embeddings=1024, attention_dropout=0.0,
activation_dropout=0.0,
init_std=0.02, init_std=0.02,
classifier_dropout=0.0, classifier_dropout=0.0,
num_labels=3,
is_encoder_decoder=True,
normalize_before=False,
add_final_layer_norm=False,
do_blenderbot_90_layernorm=False,
scale_embedding=False, scale_embedding=False,
normalize_embedding=True, gradient_checkpointing=False,
static_position_embeddings=False,
add_bias_logits=False,
force_bos_token_to_be_generated=False, force_bos_token_to_be_generated=False,
use_cache=True, use_cache=True,
num_labels=3,
pad_token_id=1, pad_token_id=1,
bos_token_id=0, bos_token_id=0,
eos_token_id=2, eos_token_id=2,
**common_kwargs is_encoder_decoder=True,
decoder_start_token_id=2,
**kwargs
): ):
r"""
:class:`~transformers.BartConfig` is the configuration class for `BartModel`.
Examples::
>>> from transformers import BartConfig, BartModel
>>> config = BartConfig.from_pretrained('facebook/bart-large')
>>> model = BartModel(config)
"""
if "hidden_size" in common_kwargs:
raise ValueError("hidden size is called d_model")
super().__init__( super().__init__(
num_labels=num_labels, num_labels=num_labels,
pad_token_id=pad_token_id, pad_token_id=pad_token_id,
bos_token_id=bos_token_id, bos_token_id=bos_token_id,
eos_token_id=eos_token_id, eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder, is_encoder_decoder=is_encoder_decoder,
**common_kwargs, decoder_start_token_id=decoder_start_token_id,
**kwargs,
) )
self.vocab_size = vocab_size self.vocab_size = vocab_size
self.d_model = d_model # encoder_embed_dim and decoder_embed_dim self.max_position_embeddings = max_position_embeddings
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = self.num_hidden_layers = encoder_layers self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads self.encoder_attention_heads = encoder_attention_heads
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.decoder_ffn_dim = decoder_ffn_dim self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads self.decoder_attention_heads = decoder_attention_heads
self.max_position_embeddings = max_position_embeddings self.dropout = dropout
self.init_std = init_std # Normal(0, this parameter)
self.activation_function = activation_function
# Params introduced for Mbart
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
self.normalize_embedding = normalize_embedding # True for mbart, False otherwise
self.normalize_before = normalize_before # combo of fairseq's encoder_ and decoder_normalize_before
self.add_final_layer_norm = add_final_layer_norm
# Params introduced for Marian
self.add_bias_logits = add_bias_logits
self.static_position_embeddings = static_position_embeddings
# 3 Types of Dropout
self.attention_dropout = attention_dropout self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout self.activation_dropout = activation_dropout
self.dropout = dropout self.activation_function = activation_function
self.init_std = init_std
# Classifier stuff self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.classifier_dropout = classifier_dropout self.classifier_dropout = classifier_dropout
# pos embedding offset
self.extra_pos_embeddings = extra_pos_embeddings
# bart has a hack that offsets positional embeddings by 2, other models don't do this
self.force_bos_token_to_be_generated = force_bos_token_to_be_generated
self.do_blenderbot_90_layernorm = do_blenderbot_90_layernorm
self.use_cache = use_cache self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.gradient_checkpointing = gradient_checkpointing
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
self.force_bos_token_to_be_generated = force_bos_token_to_be_generated # only relevant for CNN
# IMPORTANT
# DELETE ALL OF THE FOLLOWING LINES AS SOON AS TF IS READY
self.extra_pos_embeddings = 2
self.normalize_before = False
self.add_final_layer_norm = False
self.do_blenderbot_90_layernorm = False
self.normalize_embedding = True
self.static_position_embeddings = False
self.add_bias_logits = False
@property @property
def num_attention_heads(self) -> int: def num_attention_heads(self) -> int:
@@ -220,11 +187,3 @@ class BartConfig(PretrainedConfig):
@property @property
def hidden_size(self) -> int: def hidden_size(self) -> int:
return self.d_model return self.d_model
def is_valid_mbart(self) -> bool:
"""Is the configuration aligned with the MBART paper."""
if self.normalize_before and self.add_final_layer_norm and self.scale_embedding:
return True
if self.normalize_before or self.add_final_layer_norm or self.scale_embedding:
logger.info("This configuration is a mixture of MBART and BART settings")
return False

348
src/transformers/models/bart/modeling_bart.py Normal file → Executable file
View File

@@ -1,5 +1,5 @@
# coding=utf-8 # coding=utf-8
# Copyright 2020 The Facebook AI Research Team Authors and The HuggingFace Inc. team. # Copyright 2021 The Fairseq Authors and The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -12,17 +12,18 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""PyTorch BART model, ported from the fairseq repo.""" """ PyTorch BART model. """
import math import math
import random import random
import warnings import warnings
from typing import Optional, Tuple from typing import Optional, Tuple
import numpy as np
import torch import torch
import torch.nn.functional as F import torch.nn.functional as F
from torch import nn from torch import nn
from torch.nn import CrossEntropyLoss, LayerNorm from torch.nn import CrossEntropyLoss
from ...activations import ACT2FN from ...activations import ACT2FN
from ...file_utils import ( from ...file_utils import (
@@ -52,32 +53,24 @@ _TOKENIZER_FOR_DOC = "BartTokenizer"
BART_PRETRAINED_MODEL_ARCHIVE_LIST = [ BART_PRETRAINED_MODEL_ARCHIVE_LIST = [
"facebook/bart-base",
"facebook/bart-large", "facebook/bart-large",
"facebook/bart-large-mnli", # See all BART models at https://huggingface.co/models?filter=bart
"facebook/bart-large-cnn",
"facebook/bart-large-xsum",
"facebook/mbart-large-en-ro",
] ]
# This list is incomplete. See all BART models at https://huggingface.co/models?filter=bart
def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int): def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int):
""" """
Shift input ids one token to the right, and wrap the last non pad token (usually <eos>). Shift input ids one token to the right.
""" """
prev_output_tokens = input_ids.clone() shifted_input_ids = input_ids.new_zeros(input_ids.shape)
shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
shifted_input_ids[:, 0] = decoder_start_token_id
assert pad_token_id is not None, "self.model.config.pad_token_id has to be defined." assert pad_token_id is not None, "self.model.config.pad_token_id has to be defined."
# replace possible -100 values in labels by `pad_token_id` # replace possible -100 values in labels by `pad_token_id`
prev_output_tokens.masked_fill_(prev_output_tokens == -100, pad_token_id) shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id)
index_of_eos = (prev_output_tokens.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1) return shifted_input_ids
decoder_start_tokens = prev_output_tokens.gather(1, index_of_eos).squeeze()
prev_output_tokens[:, 1:] = prev_output_tokens[:, :-1].clone()
prev_output_tokens[:, 0] = decoder_start_tokens
return prev_output_tokens
def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0): def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0):
@@ -111,18 +104,15 @@ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int]
class BartLearnedPositionalEmbedding(nn.Embedding): class BartLearnedPositionalEmbedding(nn.Embedding):
""" """
This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting This module learns positional embeddings up to a fixed maximum size.
based on padding_idx or by setting padding_idx to None and ensuring that the appropriate position ids are passed to
the forward function.
""" """
def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int, offset: int): def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int):
assert padding_idx is not None, "`padding_idx` should not be None, but of type int"
# Bart is set up so that if padding_idx is specified then offset the embedding ids by 2 # Bart is set up so that if padding_idx is specified then offset the embedding ids by 2
# and adjust num_embeddings appropriately. Other models dont have this hack # and adjust num_embeddings appropriately. Other models dont have this hack
self.offset = offset self.offset = 2
assert padding_idx is not None, "`padding_idx` should not be None, but of type int" super().__init__(num_embeddings + self.offset, embedding_dim, padding_idx=padding_idx)
num_embeddings += offset
super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)
def forward(self, input_ids_shape: torch.Size, past_key_values_length: int = 0): def forward(self, input_ids_shape: torch.Size, past_key_values_length: int = 0):
"""`input_ids_shape` is expected to be [bsz x seqlen].""" """`input_ids_shape` is expected to be [bsz x seqlen]."""
@@ -133,40 +123,6 @@ class BartLearnedPositionalEmbedding(nn.Embedding):
return super().forward(positions + self.offset) return super().forward(positions + self.offset)
class BartSinusoidalPositionalEmbedding(nn.Embedding):
"""This module produces sinusoidal positional embeddings of any length."""
def __init__(self, num_positions: int, embedding_dim: int, padding_idx: Optional[int] = None):
super().__init__(num_positions, embedding_dim)
self.weight = self._init_weight(self.weight)
@staticmethod
def _init_weight(out: nn.Parameter):
"""
Identical to the XLM create_sinusoidal_embeddings except features are not interleaved. The cos features are in
the 2nd half of the vector. [dim // 2:]
"""
n_pos, dim = out.shape
position_enc = np.array(
[[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)]
)
out.requires_grad = False # set early to avoid an error in pytorch-1.8+
sentinel = dim // 2 if dim % 2 == 0 else (dim // 2) + 1
out[:, 0:sentinel] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
out[:, sentinel:] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
out.detach_()
return out
@torch.no_grad()
def forward(self, input_ids_shape: torch.Size, past_key_values_length: int = 0):
"""`input_ids_shape` is expected to be [bsz x seqlen]."""
bsz, seq_len = input_ids_shape[:2]
positions = torch.arange(
past_key_values_length, past_key_values_length + seq_len, dtype=torch.long, device=self.weight.device
)
return super().forward(positions)
class BartAttention(nn.Module): class BartAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper""" """Multi-headed attention from 'Attention Is All You Need' paper"""
@@ -310,14 +266,13 @@ class BartEncoderLayer(nn.Module):
num_heads=config.encoder_attention_heads, num_heads=config.encoder_attention_heads,
dropout=config.attention_dropout, dropout=config.attention_dropout,
) )
self.normalize_before = config.normalize_before self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.self_attn_layer_norm = LayerNorm(self.embed_dim)
self.dropout = config.dropout self.dropout = config.dropout
self.activation_fn = ACT2FN[config.activation_function] self.activation_fn = ACT2FN[config.activation_function]
self.activation_dropout = config.activation_dropout self.activation_dropout = config.activation_dropout
self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim) self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim) self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
self.final_layer_norm = LayerNorm(self.embed_dim) self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool = False): def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool = False):
""" """
@@ -325,33 +280,36 @@ class BartEncoderLayer(nn.Module):
hidden_states (:obj:`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)` hidden_states (:obj:`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)`
attention_mask (:obj:`torch.FloatTensor`): attention mask of size attention_mask (:obj:`torch.FloatTensor`): attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
output_attentions (:obj:`bool`): Whether the base model outputs attentions. This requires the attentions tensor to be reshaped in this function. output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
returned tensors for more detail.
""" """
residual = hidden_states residual = hidden_states
if self.normalize_before:
hidden_states = self.self_attn_layer_norm(hidden_states)
hidden_states, attn_weights, _ = self.self_attn( hidden_states, attn_weights, _ = self.self_attn(
hidden_states=hidden_states, attention_mask=attention_mask, output_attentions=output_attentions hidden_states=hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
) )
hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training) hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states hidden_states = residual + hidden_states
if not self.normalize_before:
hidden_states = self.self_attn_layer_norm(hidden_states) hidden_states = self.self_attn_layer_norm(hidden_states)
residual = hidden_states residual = hidden_states
if self.normalize_before:
hidden_states = self.final_layer_norm(hidden_states)
hidden_states = self.activation_fn(self.fc1(hidden_states)) hidden_states = self.activation_fn(self.fc1(hidden_states))
hidden_states = F.dropout(hidden_states, p=self.activation_dropout, training=self.training) hidden_states = F.dropout(hidden_states, p=self.activation_dropout, training=self.training)
hidden_states = self.fc2(hidden_states) hidden_states = self.fc2(hidden_states)
hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training) hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states hidden_states = residual + hidden_states
if not self.normalize_before:
hidden_states = self.final_layer_norm(hidden_states) hidden_states = self.final_layer_norm(hidden_states)
if torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any(): if torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any():
clamp_value = torch.finfo(hidden_states.dtype).max - 1000 clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
return hidden_states, attn_weights
outputs = (hidden_states,)
if output_attentions:
outputs += (attn_weights,)
return outputs
class BartDecoderLayer(nn.Module): class BartDecoderLayer(nn.Module):
@@ -368,19 +326,18 @@ class BartDecoderLayer(nn.Module):
self.dropout = config.dropout self.dropout = config.dropout
self.activation_fn = ACT2FN[config.activation_function] self.activation_fn = ACT2FN[config.activation_function]
self.activation_dropout = config.activation_dropout self.activation_dropout = config.activation_dropout
self.normalize_before = config.normalize_before
self.self_attn_layer_norm = LayerNorm(self.embed_dim) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.encoder_attn = BartAttention( self.encoder_attn = BartAttention(
self.embed_dim, self.embed_dim,
config.decoder_attention_heads, config.decoder_attention_heads,
dropout=config.attention_dropout, dropout=config.attention_dropout,
is_decoder=True, is_decoder=True,
) )
self.encoder_attn_layer_norm = LayerNorm(self.embed_dim) self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim) self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)
self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim) self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)
self.final_layer_norm = LayerNorm(self.embed_dim) self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward( def forward(
self, self,
@@ -389,7 +346,8 @@ class BartDecoderLayer(nn.Module):
encoder_hidden_states: Optional[torch.Tensor] = None, encoder_hidden_states: Optional[torch.Tensor] = None,
encoder_attention_mask: Optional[torch.Tensor] = None, encoder_attention_mask: Optional[torch.Tensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None, past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[torch.Tensor] = False, output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = True,
): ):
""" """
Args: Args:
@@ -400,11 +358,11 @@ class BartDecoderLayer(nn.Module):
encoder_attention_mask (:obj:`torch.FloatTensor`): encoder attention mask of size encoder_attention_mask (:obj:`torch.FloatTensor`): encoder attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
past_key_value (:obj:`Tuple(torch.FloatTensor)`): cached past key and value projection states past_key_value (:obj:`Tuple(torch.FloatTensor)`): cached past key and value projection states
output_attentions (:obj:`bool`): Whether the base model outputs attentions. This requires the attentions tensor to be reshaped in this function. output_attentions (:obj:`bool`, `optional`):
Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
returned tensors for more detail.
""" """
residual = hidden_states residual = hidden_states
if self.normalize_before:
hidden_states = self.self_attn_layer_norm(hidden_states)
# Self Attention # Self Attention
# decoder uni-directional self-attention cached key/values tuple is at positions 1,2 # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
@@ -418,7 +376,6 @@ class BartDecoderLayer(nn.Module):
) )
hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training) hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states hidden_states = residual + hidden_states
if not self.normalize_before:
hidden_states = self.self_attn_layer_norm(hidden_states) hidden_states = self.self_attn_layer_norm(hidden_states)
# Cross-Attention Block # Cross-Attention Block
@@ -426,8 +383,6 @@ class BartDecoderLayer(nn.Module):
cross_attn_weights = None cross_attn_weights = None
if encoder_hidden_states is not None: if encoder_hidden_states is not None:
residual = hidden_states residual = hidden_states
if self.normalize_before:
hidden_states = self.encoder_attn_layer_norm(hidden_states)
# cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple
cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None
@@ -440,7 +395,6 @@ class BartDecoderLayer(nn.Module):
) )
hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training) hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states hidden_states = residual + hidden_states
if not self.normalize_before:
hidden_states = self.encoder_attn_layer_norm(hidden_states) hidden_states = self.encoder_attn_layer_norm(hidden_states)
# add cross-attn to positions 3,4 of present_key_value tuple # add cross-attn to positions 3,4 of present_key_value tuple
@@ -448,22 +402,22 @@ class BartDecoderLayer(nn.Module):
# Fully Connected # Fully Connected
residual = hidden_states residual = hidden_states
if self.normalize_before:
hidden_states = self.final_layer_norm(hidden_states)
hidden_states = self.activation_fn(self.fc1(hidden_states)) hidden_states = self.activation_fn(self.fc1(hidden_states))
hidden_states = F.dropout(hidden_states, p=self.activation_dropout, training=self.training) hidden_states = F.dropout(hidden_states, p=self.activation_dropout, training=self.training)
hidden_states = self.fc2(hidden_states) hidden_states = self.fc2(hidden_states)
hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training) hidden_states = F.dropout(hidden_states, p=self.dropout, training=self.training)
hidden_states = residual + hidden_states hidden_states = residual + hidden_states
if not self.normalize_before:
hidden_states = self.final_layer_norm(hidden_states) hidden_states = self.final_layer_norm(hidden_states)
return ( outputs = (hidden_states,)
hidden_states,
self_attn_weights, if output_attentions:
present_key_value, outputs += (self_attn_weights, cross_attn_weights)
cross_attn_weights,
) if use_cache:
outputs += (present_key_value,)
return outputs
class BartClassificationHead(nn.Module): class BartClassificationHead(nn.Module):
@@ -500,8 +454,6 @@ class BartPretrainedModel(PreTrainedModel):
module.weight.data.normal_(mean=0.0, std=std) module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None: if module.bias is not None:
module.bias.data.zero_() module.bias.data.zero_()
elif isinstance(module, BartSinusoidalPositionalEmbedding):
pass
elif isinstance(module, nn.Embedding): elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std) module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None: if module.padding_idx is not None:
@@ -536,10 +488,10 @@ BART_START_DOCSTRING = r"""
general usage and behavior. general usage and behavior.
Parameters: Parameters:
config (:class:`~transformers.BartConfig`): Model configuration class with all the parameters of the model. config (:class:`~transformers.BartConfig`):
Initializing with a config file does not load the weights associated with the model, only the Model configuration class with all the parameters of the model. Initializing with a config file does not
configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model load the weights associated with the model, only the configuration. Check out the
weights. :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights.
""" """
BART_GENERATION_EXAMPLE = r""" BART_GENERATION_EXAMPLE = r"""
@@ -547,9 +499,8 @@ BART_GENERATION_EXAMPLE = r"""
>>> from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig >>> from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
>>> # see ``examples/summarization/bart/run_eval.py`` for a longer example >>> model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
>>> model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn') >>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
>>> ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs." >>> ARTICLE_TO_SUMMARIZE = "My friends are cool but they eat too many carbs."
>>> inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt') >>> inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
@@ -557,6 +508,22 @@ BART_GENERATION_EXAMPLE = r"""
>>> # Generate Summary >>> # Generate Summary
>>> summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True) >>> summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=5, early_stopping=True)
>>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]) >>> print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
Mask filling example::
>>> from transformers import BartTokenizer, BartForConditionalGeneration
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> TXT = "My friends are <mask> but they eat too many carbs."
>>> model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
>>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
>>> logits = model(input_ids).logits
>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
>>> probs = logits[0, masked_index].softmax(dim=0)
>>> values, predictions = probs.topk(5)
>>> tokenizer.decode(predictions).split()
""" """
BART_INPUTS_DOCSTRING = r""" BART_INPUTS_DOCSTRING = r"""
@@ -578,9 +545,22 @@ BART_INPUTS_DOCSTRING = r"""
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for translation and summarization training. By default, the model will create this tensor by Indices of decoder input sequence tokens in the vocabulary.
shifting the :obj:`input_ids` to the right, following the paper.
decoder_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): Indices can be obtained using :class:`~transformers.BartTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
Bart uses the :obj:`eos_token_id` as the starting token for :obj:`decoder_input_ids` generation. If
:obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
:obj:`past_key_values`).
For translation and summarization training, :obj:`decoder_input_ids` should be provided. If no
:obj:`decoder_input_ids` is provided, the model will create this tensor by shifting the :obj:`input_ids` to
the right for denoising pre-training following the paper.
decoder_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default. also be used by default.
@@ -641,30 +621,22 @@ class BartEncoder(BartPretrainedModel):
self.layerdrop = config.encoder_layerdrop self.layerdrop = config.encoder_layerdrop
embed_dim = config.d_model embed_dim = config.d_model
self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0
self.padding_idx = config.pad_token_id self.padding_idx = config.pad_token_id
self.max_source_positions = config.max_position_embeddings self.max_source_positions = config.max_position_embeddings
self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0
if embed_tokens is not None: if embed_tokens is not None:
self.embed_tokens = embed_tokens self.embed_tokens = embed_tokens
else: else:
self.embed_tokens = nn.Embedding(config.vocab_size, embed_dim, self.padding_idx) self.embed_tokens = nn.Embedding(config.vocab_size, embed_dim, self.padding_idx)
if config.static_position_embeddings:
self.embed_positions = BartSinusoidalPositionalEmbedding(
config.max_position_embeddings, embed_dim, self.padding_idx
)
else:
self.embed_positions = BartLearnedPositionalEmbedding( self.embed_positions = BartLearnedPositionalEmbedding(
config.max_position_embeddings, config.max_position_embeddings,
embed_dim, embed_dim,
self.padding_idx, self.padding_idx,
config.extra_pos_embeddings,
) )
self.layers = nn.ModuleList([BartEncoderLayer(config) for _ in range(config.encoder_layers)]) self.layers = nn.ModuleList([BartEncoderLayer(config) for _ in range(config.encoder_layers)])
self.layernorm_embedding = LayerNorm(embed_dim) if config.normalize_embedding else nn.Identity() self.layernorm_embedding = nn.LayerNorm(embed_dim)
# mbart has one extra layer_norm
self.layer_norm = LayerNorm(config.d_model) if config.add_final_layer_norm else None
self.init_weights() self.init_weights()
@@ -747,15 +719,28 @@ class BartEncoder(BartPretrainedModel):
# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
dropout_probability = random.uniform(0, 1) dropout_probability = random.uniform(0, 1)
if self.training and (dropout_probability < self.layerdrop): # skip the layer if self.training and (dropout_probability < self.layerdrop): # skip the layer
attn = None layer_outputs = (None, None)
else: else:
hidden_states, attn = encoder_layer(hidden_states, attention_mask, output_attentions=output_attentions) if getattr(self.config, "gradient_checkpointing", False):
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, output_attentions)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(encoder_layer),
hidden_states,
attention_mask,
)
else:
layer_outputs = encoder_layer(hidden_states, attention_mask, output_attentions=output_attentions)
hidden_states = layer_outputs[0]
if output_attentions: if output_attentions:
all_attentions = all_attentions + (attn,) all_attentions = all_attentions + (layer_outputs[1],)
if self.layer_norm:
hidden_states = self.layer_norm(hidden_states)
if output_hidden_states: if output_hidden_states:
encoder_states = encoder_states + (hidden_states,) encoder_states = encoder_states + (hidden_states,)
@@ -780,7 +765,6 @@ class BartDecoder(BartPretrainedModel):
super().__init__(config) super().__init__(config)
self.dropout = config.dropout self.dropout = config.dropout
self.layerdrop = config.decoder_layerdrop self.layerdrop = config.decoder_layerdrop
self.do_blenderbot_90_layernorm = config.do_blenderbot_90_layernorm # layernorm variant
self.padding_idx = config.pad_token_id self.padding_idx = config.pad_token_id
self.max_target_positions = config.max_position_embeddings self.max_target_positions = config.max_position_embeddings
self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0 self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
@@ -790,20 +774,13 @@ class BartDecoder(BartPretrainedModel):
else: else:
self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model, self.padding_idx) self.embed_tokens = nn.Embedding(config.vocab_size, config.d_model, self.padding_idx)
if config.static_position_embeddings:
self.embed_positions = BartSinusoidalPositionalEmbedding(
config.max_position_embeddings, config.d_model, config.pad_token_id
)
else:
self.embed_positions = BartLearnedPositionalEmbedding( self.embed_positions = BartLearnedPositionalEmbedding(
config.max_position_embeddings, config.max_position_embeddings,
config.d_model, config.d_model,
self.padding_idx, self.padding_idx,
config.extra_pos_embeddings,
) )
self.layers = nn.ModuleList([BartDecoderLayer(config) for _ in range(config.decoder_layers)]) self.layers = nn.ModuleList([BartDecoderLayer(config) for _ in range(config.decoder_layers)])
self.layernorm_embedding = LayerNorm(config.d_model) if config.normalize_embedding else nn.Identity() self.layernorm_embedding = nn.LayerNorm(config.d_model)
self.layer_norm = LayerNorm(config.d_model) if config.add_final_layer_norm else None
self.init_weights() self.init_weights()
@@ -902,33 +879,6 @@ class BartDecoder(BartPretrainedModel):
input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length
).to(self.device) ).to(self.device)
# create decoder_padding_mask if not provided and needed
# 4.12.20 (PVP): Not a fan of this "magical" function that
# automatically creates attention_mask for padded tokens
# => this is inconsistent with other models
# => Pegasus uses the pad_token as decoder_start_token_id, so that this could
# pose some problems.
if (
attention_mask is None
and input_ids is not None
and input_shape[-1] > 1
and self.config.pad_token_id in input_ids
):
# should be kept for backwards compatibility
attention_mask = input_ids.ne(self.config.pad_token_id).to(torch.long)
# never mask leading token, even if it is pad
attention_mask[:, 0] = attention_mask[:, 1]
if past_key_values_length > 0:
attention_mask = torch.cat(
[
torch.ones(
(input_shape[0], past_key_values_length), dtype=torch.long, device=input_ids.device
),
attention_mask,
],
dim=-1,
)
if attention_mask is not None and combined_attention_mask is not None: if attention_mask is not None and combined_attention_mask is not None:
# [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
combined_attention_mask = combined_attention_mask + _expand_mask( combined_attention_mask = combined_attention_mask + _expand_mask(
@@ -943,10 +893,6 @@ class BartDecoder(BartPretrainedModel):
# embed positions # embed positions
positions = self.embed_positions(input_shape, past_key_values_length) positions = self.embed_positions(input_shape, past_key_values_length)
if self.do_blenderbot_90_layernorm:
hidden_states = self.layernorm_embedding(inputs_embeds)
hidden_states += positions
else:
hidden_states = inputs_embeds + positions hidden_states = inputs_embeds + positions
hidden_states = self.layernorm_embedding(hidden_states) hidden_states = self.layernorm_embedding(hidden_states)
@@ -967,30 +913,51 @@ class BartDecoder(BartPretrainedModel):
past_key_value = past_key_values[idx] if past_key_values is not None else None past_key_value = past_key_values[idx] if past_key_values is not None else None
hidden_states, layer_self_attn, present_key_value, layer_cross_attn = decoder_layer( if getattr(self.config, "gradient_checkpointing", False):
if use_cache:
raise ValueError(
"When using `gradient_checkpointing, make sure that `use_cache=False` and `config.use_cache=False`."
)
def create_custom_forward(module):
def custom_forward(*inputs):
# None for past_key_value
return module(*inputs, output_attentions, use_cache)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(decoder_layer),
hidden_states,
combined_attention_mask,
encoder_hidden_states,
encoder_attention_mask,
None,
)
else:
layer_outputs = decoder_layer(
hidden_states, hidden_states,
attention_mask=combined_attention_mask, attention_mask=combined_attention_mask,
encoder_hidden_states=encoder_hidden_states, encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask, encoder_attention_mask=encoder_attention_mask,
past_key_value=past_key_value, past_key_value=past_key_value,
output_attentions=output_attentions, output_attentions=output_attentions,
use_cache=use_cache,
) )
hidden_states = layer_outputs[0]
if use_cache: if use_cache:
next_decoder_cache += (present_key_value,) next_decoder_cache += (layer_outputs[3 if output_attentions else 1],)
if output_attentions: if output_attentions:
all_self_attns += (layer_self_attn,) all_self_attns += (layer_outputs[1],)
all_cross_attentions += (layer_cross_attn,) all_cross_attentions += (layer_outputs[2],)
# add hidden states from the last decoder layer # add hidden states from the last decoder layer
if output_hidden_states: if output_hidden_states:
all_hidden_states += (hidden_states,) all_hidden_states += (hidden_states,)
# if config.add_final_layer_norm (mBART)
if self.layer_norm:
hidden_states = self.layer_norm(hidden_states)
next_cache = next_decoder_cache if use_cache else None next_cache = next_decoder_cache if use_cache else None
if not return_dict: if not return_dict:
return tuple( return tuple(
@@ -1060,12 +1027,12 @@ class BartModel(BartPretrainedModel):
return_dict=None, return_dict=None,
): ):
# 4.12.20 (PVP): Not a fan of this "magical" function and # different to other models, Bart automatically creates decoder_input_ids from
# also wonder how often it's actually used ... keep now # input_ids if no decoder_input_ids are provided
# for backward compatibility
# -> is this used for backward compatibility
if decoder_input_ids is None and decoder_inputs_embeds is None: if decoder_input_ids is None and decoder_inputs_embeds is None:
decoder_input_ids = shift_tokens_right(input_ids, self.config.pad_token_id) decoder_input_ids = shift_tokens_right(
input_ids, self.config.pad_token_id, self.config.decoder_start_token_id
)
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = ( output_hidden_states = (
@@ -1083,7 +1050,7 @@ class BartModel(BartPretrainedModel):
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
) )
# If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=False # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=True
elif return_dict and not isinstance(encoder_outputs, BaseModelOutput): elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
encoder_outputs = BaseModelOutput( encoder_outputs = BaseModelOutput(
last_hidden_state=encoder_outputs[0], last_hidden_state=encoder_outputs[0],
@@ -1192,31 +1159,14 @@ class BartForConditionalGeneration(BartPretrainedModel):
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``. (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``.
Returns: Returns:
Conditional generation example::
>>> # Mask filling only works for bart-large
>>> from transformers import BartTokenizer, BartForConditionalGeneration
>>> tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
>>> TXT = "My friends are <mask> but they eat too many carbs."
>>> model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
>>> input_ids = tokenizer([TXT], return_tensors='pt')['input_ids']
>>> logits = model(input_ids).logits
>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
>>> probs = logits[0, masked_index].softmax(dim=0)
>>> values, predictions = probs.topk(5)
>>> tokenizer.decode(predictions).split()
>>> # ['good', 'great', 'all', 'really', 'very']
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if labels is not None: if labels is not None:
use_cache = False
if decoder_input_ids is None: if decoder_input_ids is None:
decoder_input_ids = shift_tokens_right(labels, self.config.pad_token_id) decoder_input_ids = shift_tokens_right(
labels, self.config.pad_token_id, self.config.decoder_start_token_id
)
outputs = self.model( outputs = self.model(
input_ids, input_ids,
@@ -1237,7 +1187,6 @@ class BartForConditionalGeneration(BartPretrainedModel):
masked_lm_loss = None masked_lm_loss = None
if labels is not None: if labels is not None:
loss_fct = CrossEntropyLoss() loss_fct = CrossEntropyLoss()
# TODO(SS): do we need to ignore pad tokens in labels?
masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1)) masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
if not return_dict: if not return_dict:
@@ -1288,7 +1237,10 @@ class BartForConditionalGeneration(BartPretrainedModel):
def _reorder_cache(past, beam_idx): def _reorder_cache(past, beam_idx):
reordered_past = () reordered_past = ()
for layer_past in past: for layer_past in past:
reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) # cached cross_attention states don't have to be reordered -> they are always the same
reordered_past += (
tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:],
)
return reordered_past return reordered_past

View File

@@ -545,7 +545,7 @@ BART_INPUTS_DOCSTRING = r"""
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for translation and summarization training. By default, the model will create this tensor by Provide for translation and summarization training. By default, the model will create this tensor by
shifting the input_ids right, following the paper. shifting the input_ids right, following the paper.
decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
will be made by default and ignore pad tokens. It is not recommended to set this for most use cases. will be made by default and ignore pad tokens. It is not recommended to set this for most use cases.
encoder_outputs (:obj:`tf.FloatTensor`, `optional`): encoder_outputs (:obj:`tf.FloatTensor`, `optional`):
hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.

View File

@@ -18,7 +18,7 @@
from ...file_utils import is_tf_available, is_torch_available from ...file_utils import is_tf_available, is_torch_available
from .configuration_blenderbot import BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotConfig from .configuration_blenderbot import BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotConfig
from .tokenization_blenderbot import BlenderbotSmallTokenizer, BlenderbotTokenizer from .tokenization_blenderbot import BlenderbotTokenizer
if is_torch_available(): if is_torch_available():
@@ -26,7 +26,9 @@ if is_torch_available():
BLENDERBOT_PRETRAINED_MODEL_ARCHIVE_LIST, BLENDERBOT_PRETRAINED_MODEL_ARCHIVE_LIST,
BlenderbotForConditionalGeneration, BlenderbotForConditionalGeneration,
BlenderbotModel, BlenderbotModel,
BlenderbotPreTrainedModel,
) )
if is_tf_available(): if is_tf_available():
from .modeling_tf_blenderbot import TFBlenderbotForConditionalGeneration from .modeling_tf_blenderbot import TFBlenderbotForConditionalGeneration

View File

@@ -1,8 +1,7 @@
#!/usr/bin/env python3
# coding=utf-8 # coding=utf-8
# Copyright (c) Facebook, Inc. and Huggingface, 2020 # Copyright 2021 The Facebook, Inc. and The HuggingFace Inc. team. All rights reserved.
# #
# This source code is licensed under the MIT license found in the; # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
# You may obtain a copy of the License at # You may obtain a copy of the License at
# #
@@ -13,46 +12,49 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# LICENSE file in the root directory of this source tree. """ Blenderbot model configuration """
"""
BlenderbotConfig has the same signature as BartConfig. We only rewrite the signature in order to document
blenderbot-90M defaults.
"""
from ..bart.configuration_bart import BartConfig
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP = { BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/blenderbot-3B": "https://cdn.huggingface.co/facebook/blenderbot-3B/config.json", "facebook/blenderbot-3B": "https://huggingface.co/facebook/blenderbot-3B/resolve/main/config.json",
"facebook/blenderbot-90M": "https://cdn.huggingface.co/facebook/blenderbot-90M/config.json", # See all Blenderbot models at https://huggingface.co/models?filter=blenderbot
} }
class BlenderbotConfig(BartConfig): class BlenderbotConfig(PretrainedConfig):
r""" r"""
This is the configuration class to store the configuration of a This is the configuration class to store the configuration of a :class:`~transformers.BlenderbotModel`. It is used
:class:`~transformers.BlenderbotForConditionalGeneration`. It inherits from :class:`~transformers.BartConfig` and to instantiate an Blenderbot model according to the specified arguments, defining the model architecture.
has the same signature with different defaults. Instantiating a configuration with the defaults will yield a similar configuration to that of the Blenderbot
`facebook/blenderbot-3B <https://huggingface.co/facebook/blenderbot-3B>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 54944): vocab_size (:obj:`int`, `optional`, defaults to 50265):
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the Vocabulary size of the Blenderbot model. Defines the number of different tokens that can be represented by
:obj:`inputs_ids` passed when calling :class:`~transformers.BlenderbotForConditionalGeneration`. the :obj:`inputs_ids` passed when calling :class:`~transformers.BlenderbotModel` or
d_model (:obj:`int`, `optional`, defaults to 512): :class:`~transformers.TFBlenderbotModel`.
d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the layers and the pooler layer. Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 8): encoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of encoder layers, 6 are used for the `blenderbot-90M` model. Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 8): decoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of decoder layers, 6 are used for the `blenderbot-90M` model. Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 16): encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16): decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder. Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048): decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048): encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string,
@@ -65,117 +67,115 @@ class BlenderbotConfig(BartConfig):
The dropout ratio for activations inside the fully connected layer. The dropout ratio for activations inside the fully connected layer.
classifier_dropout (:obj:`float`, `optional`, defaults to 0.0): classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
The dropout ratio for classifier. The dropout ratio for classifier.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
The maximum sequence length that this model might ever be used with. Typically set this to something large The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02): init_std (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
This should be completed, specific to marian.
normalize_before (:obj:`bool`, `optional`, defaults to :obj:`False`):
Call layernorm before attention ops.
normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
Call layernorm after embeddings.
static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
Don't learn positional embeddings, use sinusoidal.
add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Why not add another layernorm?
do_blenderbot_90_layernorm (:obj:`bool`, `optional`, defaults to :obj:`True`):
Blenderbot-90m checkpoint uses `layernorm_embedding` one line earlier in the decoder.
scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Scale embeddings by diving by sqrt(d_model).
eos_token_id (:obj:`int`, `optional`, defaults to 2)
End of stream token id.
pad_token_id (:obj:`int`, `optional`, defaults to 1)
Padding token id.
bos_token_id (:obj:`int`, `optional`, defaults to 0)
Beginning of stream token id.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper <see The LayerDrop probability for the encoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper <see The LayerDrop probability for the decoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
How many extra learned positional embeddings to use. Should be set to :obj:`pad_token_id+1`. If True, use gradient checkpointing to save memory at the expense of slower backward pass.
is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`): scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether this is an encoder/decoder model. Scale embeddings by diving by sqrt(d_model).
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`): use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``), Whether or not the model should return the last key/values attentions (not used by all models)
Example::
>>> from transformers import BlenderbotModel, BlenderbotConfig
>>> # Initializing a Blenderbot facebook/blenderbot-3B style configuration
>>> configuration = BlenderbotConfig()
>>> # Initializing a model from the facebook/blenderbot-3B style configuration
>>> model = BlenderbotModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
""" """
model_type = "blenderbot" model_type = "blenderbot"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__( def __init__(
self, self,
activation_dropout=0.0, vocab_size=8008,
extra_pos_embeddings=0, max_position_embeddings=128,
activation_function="gelu", encoder_layers=2,
vocab_size=54944, encoder_ffn_dim=10240,
d_model=512, encoder_attention_heads=32,
encoder_ffn_dim=2048, decoder_layers=24,
encoder_layers=8, decoder_ffn_dim=10240,
encoder_attention_heads=16, decoder_attention_heads=32,
decoder_ffn_dim=2048,
decoder_layers=8,
decoder_attention_heads=16,
encoder_layerdrop=0.0, encoder_layerdrop=0.0,
decoder_layerdrop=0.0, decoder_layerdrop=0.0,
attention_dropout=0.0, use_cache=True,
dropout=0.1,
max_position_embeddings=512,
classifier_dropout=0.0,
is_encoder_decoder=True, is_encoder_decoder=True,
pad_token_id=1, activation_function="gelu",
bos_token_id=0, d_model=2560,
eos_token_id=2, dropout=0.1,
normalize_before=False, attention_dropout=0.0,
add_final_layer_norm=False, activation_dropout=0.0,
do_blenderbot_90_layernorm=True, init_std=0.02,
decoder_start_token_id=1,
classifier_dropout=0.0,
scale_embedding=False, scale_embedding=False,
normalize_embedding=True, gradient_checkpointing=False,
static_position_embeddings=False, pad_token_id=0,
add_bias_logits=False, bos_token_id=1,
force_bos_token_to_be_generated=False, eos_token_id=2,
**common_kwargs **kwargs
): ):
r"""
Examples::
>>> from transformers import BlenderbotConfig
>>> config = BlenderbotConfig.from_pretrained('facebook/blenderbot-90M')
"""
if "hidden_size" in common_kwargs:
raise ValueError("hidden size is called d_model")
super().__init__( super().__init__(
pad_token_id=pad_token_id, pad_token_id=pad_token_id,
bos_token_id=bos_token_id, bos_token_id=bos_token_id,
eos_token_id=eos_token_id, eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder, is_encoder_decoder=is_encoder_decoder,
vocab_size=vocab_size, decoder_start_token_id=decoder_start_token_id,
d_model=d_model, **kwargs,
encoder_ffn_dim=encoder_ffn_dim,
encoder_layers=encoder_layers,
encoder_layerdrop=encoder_layerdrop,
encoder_attention_heads=encoder_attention_heads,
decoder_layerdrop=decoder_layerdrop,
decoder_ffn_dim=decoder_ffn_dim,
decoder_layers=decoder_layers,
normalize_before=normalize_before,
normalize_embedding=normalize_embedding,
static_position_embeddings=static_position_embeddings,
add_bias_logits=add_bias_logits,
force_bos_token_to_be_generated=force_bos_token_to_be_generated,
do_blenderbot_90_layernorm=do_blenderbot_90_layernorm,
add_final_layer_norm=add_final_layer_norm,
scale_embedding=scale_embedding,
attention_dropout=attention_dropout,
dropout=dropout,
classifier_dropout=classifier_dropout,
activation_dropout=activation_dropout,
max_position_embeddings=max_position_embeddings,
extra_pos_embeddings=extra_pos_embeddings,
activation_function=activation_function,
decoder_attention_heads=decoder_attention_heads,
**common_kwargs,
) )
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.init_std = init_std
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.classifier_dropout = classifier_dropout
self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.gradient_checkpointing = gradient_checkpointing
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
# IMPORTANT
# DELETE ALL OF THE FOLLOWING LINES AS SOON AS TF IS READY
self.extra_pos_embeddings = 0
self.normalize_before = True
self.add_final_layer_norm = True
self.do_blenderbot_90_layernorm = True
self.normalize_embedding = False
self.static_position_embeddings = False
self.add_bias_logits = False
self.force_bos_token_to_be_generated = False
@property
def num_attention_heads(self) -> int:
return self.encoder_attention_heads
@property
def hidden_size(self) -> int:
return self.d_model

1213
src/transformers/models/blenderbot/modeling_blenderbot.py Normal file → Executable file

File diff suppressed because it is too large Load Diff

View File

@@ -1,8 +1,7 @@
#!/usr/bin/env python3
# coding=utf-8 # coding=utf-8
# Copyright (c) Facebook, Inc. and its affiliates. # Copyright 2021 The Facebook Inc. and The HuggingFace Inc. team. All rights reserved.
# #
# This source code is licensed under the MIT license found in the; # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
# You may obtain a copy of the License at # You may obtain a copy of the License at
# #
@@ -13,15 +12,10 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# LICENSE file in the root directory of this source tree. """Tokenization class for Blenderbot."""
""""BlenderbotTokenizer and BlenderbotSmallTokenizer"""
import json
import os
from typing import Dict, List, Optional, Tuple
import regex as re from typing import List
from ...tokenization_utils import PreTrainedTokenizer
from ...utils import logging from ...utils import logging
from ..roberta.tokenization_roberta import RobertaTokenizer from ..roberta.tokenization_roberta import RobertaTokenizer
@@ -93,177 +87,3 @@ def get_pairs(word):
pairs = set(pairs) pairs = set(pairs)
return pairs return pairs
class BlenderbotSmallTokenizer(PreTrainedTokenizer):
"""
Constructs a Blenderbot-90M tokenizer based on BPE (Byte-Pair-Encoding)
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to the superclass for more information regarding methods.
Args:
vocab_file (:obj:`str`):
File containing the vocabulary.
merges_file (:obj:`str`):
Path to the merges file.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"__start__"`):
The beginning of sentence token.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"__end__"`):
The end of sentence token.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"__unk__"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"__pad__"`):
The token used for padding, for example when batching sequences of different lengths.
**kwargs
Additional keyword arguments passed along to :class:`~transformers.PreTrainedTokenizer`
"""
vocab_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
pretrained_vocab_files_map = {
"vocab_file": {"facebook/blenderbot-90M": "https://cdn.huggingface.co/facebook/blenderbot-90M/vocab.json"},
"merges_file": {"facebook/blenderbot-90M": "https://cdn.huggingface.co/facebook/blenderbot-90M/merges.txt"},
}
max_model_input_sizes = {"facebook/blenderbot-90M": 512}
def __init__(
self,
vocab_file,
merges_file,
bos_token="__start__",
eos_token="__end__",
unk_token="__unk__",
pad_token="__null__",
**kwargs
):
super().__init__(unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, pad_token=pad_token, **kwargs)
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
with open(merges_file, encoding="utf-8") as merges_handle:
merges = merges_handle.read().split("\n")[1:-1]
merges = [tuple(merge.split()) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {}
@property
def vocab_size(self) -> int:
return len(self.encoder)
def get_vocab(self) -> Dict:
return dict(self.encoder, **self.added_tokens_encoder)
def bpe(self, token: str) -> str:
if token in self.cache:
return self.cache[token]
token = re.sub("([.,!?()])", r" \1", token)
token = re.sub("(')", r" \1 ", token)
token = re.sub(r"\s{2,}", " ", token)
if "\n" in token:
token = token.replace("\n", " __newln__")
tokens = token.split(" ")
words = []
for token in tokens:
if not len(token):
continue
token = token.lower()
word = tuple(token)
word = tuple(list(word[:-1]) + [word[-1] + "</w>"])
pairs = get_pairs(word)
if not pairs:
words.append(token)
continue
while True:
bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except ValueError:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
word = "@@ ".join(word)
word = word[:-4]
self.cache[token] = word
words.append(word)
return " ".join(words)
def _tokenize(self, text: str) -> List[str]:
""" Split a string into tokens using BPE."""
split_tokens = []
words = re.findall(r"\S+\n?", text)
for token in words:
split_tokens.extend([t for t in self.bpe(token).split(" ")])
return split_tokens
def _convert_token_to_id(self, token: str) -> int:
""" Converts a token to an id using the vocab. """
token = token.lower()
return self.encoder.get(token, self.encoder.get(self.unk_token))
def _convert_id_to_token(self, index: int) -> str:
"""Converts an index (integer) in a token (str) using the vocab."""
return self.decoder.get(index, self.unk_token)
def convert_tokens_to_string(self, tokens: List[str]) -> str:
""" Converts a sequence of tokens in a single string. """
out_string = " ".join(tokens).replace("@@ ", "").strip()
return out_string
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
return
vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
merge_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
)
with open(vocab_file, "w", encoding="utf-8") as f:
f.write(json.dumps(self.encoder, ensure_ascii=False))
index = 0
with open(merge_file, "w", encoding="utf-8") as writer:
writer.write("#version: 0.2\n")
for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
if index != token_index:
logger.warning(
"Saving vocabulary to {}: BPE merge indices are not consecutive."
" Please check that the tokenizer is not corrupted!".format(merge_file)
)
index = token_index
writer.write(" ".join(bpe_tokens) + "\n")
index += 1
return vocab_file, merge_file

View File

@@ -0,0 +1,29 @@
# flake8: noqa
# There's no way to ignore "F401 '...' imported but unused" warnings in this
# module, but to preserve other warnings. So, don't check this module at all.
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from ...file_utils import is_torch_available
from .configuration_blenderbot_small import BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotSmallConfig
from .tokenization_blenderbot_small import BlenderbotSmallTokenizer
if is_torch_available():
from .modeling_blenderbot_small import (
BLENDERBOT_SMALL_PRETRAINED_MODEL_ARCHIVE_LIST,
BlenderbotSmallForConditionalGeneration,
BlenderbotSmallModel,
BlenderbotSmallPreTrainedModel,
)

View File

@@ -0,0 +1,170 @@
# coding=utf-8
# Copyright 2021 The Facebook, Inc. and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BlenderbotSmall model configuration """
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/blenderbot_small-90M": "https://huggingface.co/facebook/blenderbot_small-90M/resolve/main/config.json",
# See all BlenderbotSmall models at https://huggingface.co/models?filter=blenderbot_small
}
class BlenderbotSmallConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a :class:`~transformers.BlenderbotSmallModel`. It is
used to instantiate an BlenderbotSmall model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the BlenderbotSmall
`facebook/blenderbot_small-90M <https://huggingface.co/facebook/blenderbot_small-90M>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args:
vocab_size (:obj:`int`, `optional`, defaults to 50265):
Vocabulary size of the BlenderbotSmall model. Defines the number of different tokens that can be
represented by the :obj:`inputs_ids` passed when calling :class:`~transformers.BlenderbotSmallModel` or
:class:`~transformers.TFBlenderbotSmallModel`.
d_model (:obj:`int`, `optional`, defaults to 512):
Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 8):
Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 8):
Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048):
Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
dropout (:obj:`float`, `optional`, defaults to 0.1):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_dropout (:obj:`float`, `optional`, defaults to 0.0):
The dropout ratio for the attention probabilities.
activation_dropout (:obj:`float`, `optional`, defaults to 0.0):
The dropout ratio for activations inside the fully connected layer.
classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
The dropout ratio for classifier.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details.
gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
If True, use gradient checkpointing to save memory at the expense of slower backward pass.
scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Scale embeddings by diving by sqrt(d_model).
use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not the model should return the last key/values attentions (not used by all models)
Example::
>>> from transformers import BlenderbotSmallModel, BlenderbotSmallConfig
>>> # Initializing a BlenderbotSmall facebook/blenderbot_small-90M style configuration
>>> configuration = BlenderbotSmallConfig()
>>> # Initializing a model from the facebook/blenderbot_small-90M style configuration
>>> model = BlenderbotSmallModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "blenderbot-small"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=50265,
max_position_embeddings=512,
encoder_layers=8,
encoder_ffn_dim=2048,
encoder_attention_heads=16,
decoder_layers=8,
decoder_ffn_dim=2048,
decoder_attention_heads=16,
encoder_layerdrop=0.0,
decoder_layerdrop=0.0,
use_cache=True,
is_encoder_decoder=True,
activation_function="gelu",
d_model=512,
dropout=0.1,
attention_dropout=0.0,
activation_dropout=0.0,
init_std=0.02,
decoder_start_token_id=1,
classifier_dropout=0.0,
scale_embedding=False,
gradient_checkpointing=False,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
**kwargs
):
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder,
decoder_start_token_id=decoder_start_token_id,
**kwargs,
)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.init_std = init_std
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.classifier_dropout = classifier_dropout
self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.gradient_checkpointing = gradient_checkpointing
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
@property
def num_attention_heads(self) -> int:
return self.encoder_attention_heads
@property
def hidden_size(self) -> int:
return self.d_model

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,228 @@
# coding=utf-8
# Copyright 2021 The Facebook Inc. and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization class for BlenderbotSmall."""
import json
import os
from typing import Dict, List, Optional, Tuple
import regex as re
from ...tokenization_utils import PreTrainedTokenizer
from ...utils import logging
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {
"vocab_file": "vocab.json",
"merges_file": "merges.txt",
# "tokenizer_config_file": "tokenizer_config.json",
}
def get_pairs(word):
"""
Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
pairs = set(pairs)
return pairs
class BlenderbotSmallTokenizer(PreTrainedTokenizer):
"""
Constructs a Blenderbot-90M tokenizer based on BPE (Byte-Pair-Encoding)
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to the superclass for more information regarding methods.
Args:
vocab_file (:obj:`str`):
File containing the vocabulary.
merges_file (:obj:`str`):
Path to the merges file.
bos_token (:obj:`str`, `optional`, defaults to :obj:`"__start__"`):
The beginning of sentence token.
eos_token (:obj:`str`, `optional`, defaults to :obj:`"__end__"`):
The end of sentence token.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"__unk__"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"__pad__"`):
The token used for padding, for example when batching sequences of different lengths.
**kwargs
Additional keyword arguments passed along to :class:`~transformers.PreTrainedTokenizer`
"""
vocab_files_names = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
pretrained_vocab_files_map = {
"vocab_file": {
"facebook/blenderbot_small-90M": "https://cdn.huggingface.co/facebook/blenderbot_small-90M/vocab.json"
},
"merges_file": {
"facebook/blenderbot_small-90M": "https://cdn.huggingface.co/facebook/blenderbot_small-90M/merges.txt"
},
}
max_model_input_sizes = {"facebook/blenderbot_small-90M": 512}
def __init__(
self,
vocab_file,
merges_file,
bos_token="__start__",
eos_token="__end__",
unk_token="__unk__",
pad_token="__null__",
**kwargs
):
super().__init__(unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, pad_token=pad_token, **kwargs)
with open(vocab_file, encoding="utf-8") as vocab_handle:
self.encoder = json.load(vocab_handle)
self.decoder = {v: k for k, v in self.encoder.items()}
with open(merges_file, encoding="utf-8") as merges_handle:
merges = merges_handle.read().split("\n")[1:-1]
merges = [tuple(merge.split()) for merge in merges]
self.bpe_ranks = dict(zip(merges, range(len(merges))))
self.cache = {}
@property
def vocab_size(self) -> int:
return len(self.encoder)
def get_vocab(self) -> Dict:
return dict(self.encoder, **self.added_tokens_encoder)
def bpe(self, token: str) -> str:
if token in self.cache:
return self.cache[token]
token = re.sub("([.,!?()])", r" \1", token)
token = re.sub("(')", r" \1 ", token)
token = re.sub(r"\s{2,}", " ", token)
if "\n" in token:
token = token.replace("\n", " __newln__")
tokens = token.split(" ")
words = []
for token in tokens:
if not len(token):
continue
token = token.lower()
word = tuple(token)
word = tuple(list(word[:-1]) + [word[-1] + "</w>"])
pairs = get_pairs(word)
if not pairs:
words.append(token)
continue
while True:
bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except ValueError:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
word = "@@ ".join(word)
word = word[:-4]
self.cache[token] = word
words.append(word)
return " ".join(words)
def _tokenize(self, text: str) -> List[str]:
""" Split a string into tokens using BPE."""
split_tokens = []
words = re.findall(r"\S+\n?", text)
for token in words:
split_tokens.extend([t for t in self.bpe(token).split(" ")])
return split_tokens
def _convert_token_to_id(self, token: str) -> int:
""" Converts a token to an id using the vocab. """
token = token.lower()
return self.encoder.get(token, self.encoder.get(self.unk_token))
def _convert_id_to_token(self, index: int) -> str:
"""Converts an index (integer) in a token (str) using the vocab."""
return self.decoder.get(index, self.unk_token)
def convert_tokens_to_string(self, tokens: List[str]) -> str:
""" Converts a sequence of tokens in a single string. """
out_string = " ".join(tokens).replace("@@ ", "").strip()
return out_string
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
return
vocab_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
)
merge_file = os.path.join(
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
)
with open(vocab_file, "w", encoding="utf-8") as f:
f.write(json.dumps(self.encoder, ensure_ascii=False))
index = 0
with open(merge_file, "w", encoding="utf-8") as writer:
writer.write("#version: 0.2\n")
for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
if index != token_index:
logger.warning(
"Saving vocabulary to {}: BPE merge indices are not consecutive."
" Please check that the tokenizer is not corrupted!".format(merge_file)
)
index = token_index
writer.write(" ".join(bpe_tokens) + "\n")
index += 1
return vocab_file, merge_file

View File

@@ -0,0 +1,103 @@
# coding=utf-8
# Copyright 2021, The Facebook, Inc. and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fast tokenization class for BlenderbotSmall."""
from typing import List, Optional
from tokenizers import ByteLevelBPETokenizer
from ...tokenization_utils_fast import PreTrainedTokenizerFast
from ...utils import logging
from .tokenization_blenderbot_small import BlenderbotSmallTokenizer
logger = logging.get_logger(__name__)
VOCAB_FILES_NAMES = {}
PRETRAINED_VOCAB_FILES_MAP = {}
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
"facebook/blenderbot_small-90M": 512,
}
class BlenderbotSmallTokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" BlenderbotSmall tokenizer (backed by HuggingFace's `tokenizers` library).
Args:
vocab_file (:obj:`str`):
Path to the vocabulary file.
"""
vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
slow_tokenizer_class = BlenderbotSmallTokenizer
def __init__(
self,
vocab_file,
merges_file,
unk_token="<|endoftext|>",
bos_token="<|endoftext|>",
eos_token="<|endoftext|>",
add_prefix_space=False,
trim_offsets=True,
**kwargs
):
super().__init__(
ByteLevelBPETokenizer(
vocab_file=vocab_file,
merges_file=merges_file,
add_prefix_space=add_prefix_space,
trim_offsets=trim_offsets,
),
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
**kwargs,
)
self.add_prefix_space = add_prefix_space
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
output = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
if token_ids_1 is None:
return output
return output + [self.eos_token_id] + token_ids_1 + [self.eos_token_id]
def create_token_type_ids_from_sequences(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. BlenderbotSmall
does not make use of token type ids, therefore a list of zeros is returned.
Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
Returns:
:obj:`List[int]`: List of zeros.
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

View File

@@ -77,10 +77,21 @@ ENCODER_DECODER_INPUTS_DOCSTRING = r"""
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Indices of decoder input sequence tokens in the vocabulary.
Indices can be obtained using :class:`~transformers.PreTrainedTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
If :obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
:obj:`past_key_values`).
Provide for sequence to sequence training to the decoder. Indices can be obtained using Provide for sequence to sequence training to the decoder. Indices can be obtained using
:class:`~transformers.PretrainedTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and :class:`~transformers.PretrainedTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and
:meth:`transformers.PreTrainedTokenizer.__call__` for details. :meth:`transformers.PreTrainedTokenizer.__call__` for details.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default. also be used by default.
encoder_outputs (:obj:`tuple(torch.FloatTensor)`, `optional`): encoder_outputs (:obj:`tuple(torch.FloatTensor)`, `optional`):

View File

@@ -235,7 +235,7 @@ FSMT_INPUTS_DOCSTRING = r"""
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for translation and summarization training. By default, the model will create this tensor by Provide for translation and summarization training. By default, the model will create this tensor by
shifting the input_ids right, following the paper. shifting the input_ids right, following the paper.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default. If you want to change padding behavior, you should read also be used by default. If you want to change padding behavior, you should read
:func:`modeling_fstm._prepare_fstm_decoder_inputs` and modify. See diagram 1 in the paper for more info on :func:`modeling_fstm._prepare_fstm_decoder_inputs` and modify. See diagram 1 in the paper for more info on

View File

@@ -15,16 +15,20 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from ...file_utils import is_sentencepiece_available, is_tf_available, is_tokenizers_available, is_torch_available
from ...file_utils import is_sentencepiece_available, is_tf_available, is_torch_available from .configuration_marian import MARIAN_PRETRAINED_CONFIG_ARCHIVE_MAP, MarianConfig
from .configuration_marian import MarianConfig
if is_sentencepiece_available(): if is_sentencepiece_available():
from .tokenization_marian import MarianTokenizer from .tokenization_marian import MarianTokenizer
if is_torch_available(): if is_torch_available():
from .modeling_marian import MarianMTModel from .modeling_marian import (
MARIAN_PRETRAINED_MODEL_ARCHIVE_LIST,
MarianModel,
MarianMTModel,
MarianPreTrainedModel,
)
if is_tf_available(): if is_tf_available():
from .modeling_tf_marian import TFMarianMTModel from .modeling_tf_marian import TFMarianMTModel

View File

@@ -1,5 +1,5 @@
# coding=utf-8 # coding=utf-8
# Copyright 2020 The OPUS-NMT Team, Marian team, and The HuggingFace Inc. team. # Copyright 2021 The Marian Team Authors and The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -14,40 +14,48 @@
# limitations under the License. # limitations under the License.
""" Marian model configuration """ """ Marian model configuration """
from ..bart.configuration_bart import BartConfig from ...configuration_utils import PretrainedConfig
from ...utils import logging
PRETRAINED_CONFIG_ARCHIVE_MAP = { logger = logging.get_logger(__name__)
MARIAN_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"Helsinki-NLP/opus-mt-en-de": "https://huggingface.co/Helsinki-NLP/opus-mt-en-de/resolve/main/config.json", "Helsinki-NLP/opus-mt-en-de": "https://huggingface.co/Helsinki-NLP/opus-mt-en-de/resolve/main/config.json",
# See all Marian models at https://huggingface.co/models?filter=marian
} }
class MarianConfig(BartConfig): class MarianConfig(PretrainedConfig):
""" r"""
This is the configuration class to store the configuration of a :class:`~transformers.MarianMTModel`. It is used to This is the configuration class to store the configuration of a :class:`~transformers.MarianModel`. It is used to
instantiate a Marian model according to the specified arguments, defining the model architecture. instantiate an Marian model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the Marian
`Helsinki-NLP/opus-mt-en-de <https://huggingface.co/Helsinki-NLP/opus-mt-en-de>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 58101): vocab_size (:obj:`int`, `optional`, defaults to 50265):
Vocabulary size of the Marian model. Defines the number of different tokens that can be represented by the Vocabulary size of the Marian model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.MarianMTModel`. :obj:`inputs_ids` passed when calling :class:`~transformers.MarianModel` or
d_model (:obj:`int`, `optional`, defaults to 512): :class:`~transformers.TFMarianModel`.
d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the layers and the pooler layer. Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 6): encoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of encoder layers. Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 6): decoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of decoder layers. Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 8): encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 8): decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder. Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048): decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 2048): encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported. :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
@@ -59,42 +67,113 @@ class MarianConfig(BartConfig):
The dropout ratio for activations inside the fully connected layer. The dropout ratio for activations inside the fully connected layer.
classifier_dropout (:obj:`float`, `optional`, defaults to 0.0): classifier_dropout (:obj:`float`, `optional`, defaults to 0.0):
The dropout ratio for classifier. The dropout ratio for classifier.
max_position_embeddings (:obj:`int`, `optional`, defaults to 512): max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
The maximum sequence length that this model might ever be used with. Typically set this to something large The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02): init_std (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
This should be completed, specific to marian.
normalize_before (:obj:`bool`, `optional`, defaults to :obj:`False`):
Call layernorm before attention ops.
normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Call layernorm after embeddings.
static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`True`):
Don't learn positional embeddings, use sinusoidal.
add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`False`):
Why not add another layernorm?
scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Scale embeddings by diving by sqrt(d_model).
eos_token_id (:obj:`int`, `optional`, defaults to 2)
End of stream token id.
pad_token_id (:obj:`int`, `optional`, defaults to 1)
Padding token id.
bos_token_id (:obj:`int`, `optional`, defaults to 0)
Beginning of stream token id.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper <see The LayerDrop probability for the encoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper <see The LayerDrop probability for the decoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
How many extra learned positional embeddings to use. If True, use gradient checkpointing to save memory at the expense of slower backward pass.
is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`): scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether this is an encoder/decoder model Scale embeddings by diving by sqrt(d_model).
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`): use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``). Whether or not the model should return the last key/values attentions (not used by all models)
"""
Examples::
>>> from transformers import MarianModel, MarianConfig
>>> # Initializing a Marian Helsinki-NLP/opus-mt-en-de style configuration
>>> configuration = MarianConfig()
>>> # Initializing a model from the Helsinki-NLP/opus-mt-en-de style configuration
>>> model = MarianModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "marian" model_type = "marian"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=50265,
max_position_embeddings=1024,
encoder_layers=12,
encoder_ffn_dim=4096,
encoder_attention_heads=16,
decoder_layers=12,
decoder_ffn_dim=4096,
decoder_attention_heads=16,
encoder_layerdrop=0.0,
decoder_layerdrop=0.0,
use_cache=True,
is_encoder_decoder=True,
activation_function="gelu",
d_model=1024,
dropout=0.1,
attention_dropout=0.0,
activation_dropout=0.0,
init_std=0.02,
decoder_start_token_id=58100,
classifier_dropout=0.0,
scale_embedding=False,
gradient_checkpointing=False,
pad_token_id=58100,
eos_token_id=0,
**kwargs
):
super().__init__(
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder,
decoder_start_token_id=decoder_start_token_id,
**kwargs,
)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.init_std = init_std
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.classifier_dropout = classifier_dropout
self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.gradient_checkpointing = gradient_checkpointing
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
# IMPORTANT
# DELETE ALL OF THE FOLLOWING LINES AS SOON AS TF IS READY
self.extra_pos_embeddings = 0
self.normalize_before = False
self.add_final_layer_norm = False
self.do_blenderbot_90_layernorm = False
self.normalize_embedding = False
self.static_position_embeddings = True
self.add_bias_logits = False
self.force_bos_token_to_be_generated = False
@property
def num_attention_heads(self) -> int:
return self.encoder_attention_heads
@property
def hidden_size(self) -> int:
return self.d_model

1184
src/transformers/models/marian/modeling_marian.py Normal file → Executable file

File diff suppressed because it is too large Load Diff

View File

@@ -84,7 +84,7 @@ class MarianTokenizer(PreTrainedTokenizer):
>>> tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de') >>> tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
>>> src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."] >>> src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."]
>>> tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional >>> tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional
>>> batch_enc: BatchEncoding = tok.prepare_seq2seq_batch(src_texts, tgt_texts=tgt_texts, return_tensors="pt") >>> batch_enc = tok.prepare_seq2seq_batch(src_texts, tgt_texts=tgt_texts, return_tensors="pt")
>>> # keys [input_ids, attention_mask, labels]. >>> # keys [input_ids, attention_mask, labels].
>>> # model(**batch) should work >>> # model(**batch) should work
""" """

View File

@@ -15,9 +15,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from ...file_utils import is_sentencepiece_available, is_tf_available, is_tokenizers_available, is_torch_available from ...file_utils import is_sentencepiece_available, is_tf_available, is_tokenizers_available, is_torch_available
from .configuration_mbart import MBartConfig from .configuration_mbart import MBART_PRETRAINED_CONFIG_ARCHIVE_MAP, MBartConfig
if is_sentencepiece_available(): if is_sentencepiece_available():
@@ -27,7 +26,14 @@ if is_tokenizers_available():
from .tokenization_mbart_fast import MBartTokenizerFast from .tokenization_mbart_fast import MBartTokenizerFast
if is_torch_available(): if is_torch_available():
from .modeling_mbart import MBartForConditionalGeneration, MBartModel from .modeling_mbart import (
MBART_PRETRAINED_MODEL_ARCHIVE_LIST,
MBartForConditionalGeneration,
MBartForQuestionAnswering,
MBartForSequenceClassification,
MBartModel,
MBartPreTrainedModel,
)
if is_tf_available(): if is_tf_available():
from .modeling_tf_mbart import TFMBartForConditionalGeneration from .modeling_tf_mbart import TFMBartForConditionalGeneration

View File

@@ -1,5 +1,5 @@
# coding=utf-8 # coding=utf-8
# Copyright 2020 The Fairseq Authors and The HuggingFace Inc. team. # Copyright 2021, The Facebook AI Research Team and The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -12,33 +12,36 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" MBART configuration """ """ MBART model configuration """
from ...configuration_utils import PretrainedConfig
from ...utils import logging from ...utils import logging
from ..bart.configuration_bart import BartConfig
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
MBART_PRETRAINED_CONFIG_ARCHIVE_MAP = { MBART_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/mbart-large-en-ro": "https://huggingface.co/facebook/mbart-large-en-ro/resolve/main/config.json",
"facebook/mbart-large-cc25": "https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json", "facebook/mbart-large-cc25": "https://huggingface.co/facebook/mbart-large-cc25/resolve/main/config.json",
# See all MBART models at https://huggingface.co/models?filter=mbart
} }
class MBartConfig(BartConfig): class MBartConfig(PretrainedConfig):
""" r"""
This is the configuration class to store the configuration of a This is the configuration class to store the configuration of a :class:`~transformers.MBartModel`. It is used to
:class:`~transformers.MBartForConditionalGeneration`. It is used to instantiate a BART model according to the instantiate an MBART model according to the specified arguments, defining the model architecture. Instantiating a
specified arguments, defining the model architecture. configuration with the defaults will yield a similar configuration to that of the MBART `facebook/mbart-large-cc25
<https://huggingface.co/facebook/mbart-large-cc25>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 250027): vocab_size (:obj:`int`, `optional`, defaults to 50265):
Vocabulary size of the MBART model. Defines the number of different tokens that can be represented by the Vocabulary size of the MBART model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.MBartForConditionalGeneration`. :obj:`inputs_ids` passed when calling :class:`~transformers.MBartModel` or
:class:`~transformers.TFMBartModel`.
d_model (:obj:`int`, `optional`, defaults to 1024): d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the layers and the pooler layer. Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 12): encoder_layers (:obj:`int`, `optional`, defaults to 12):
@@ -50,9 +53,9 @@ class MBartConfig(BartConfig):
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16): decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder. Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported. :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
@@ -69,37 +72,108 @@ class MBartConfig(BartConfig):
just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02): init_std (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
This should be completed, specific to marian.
normalize_before (:obj:`bool`, `optional`, defaults to :obj:`True`):
Call layernorm before attention ops.
normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
Call layernorm after embeddings. Only True for Bart.
static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`False`):
Don't learn positional embeddings, use sinusoidal.
add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`True`):
Why not add another layernorm?
scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Scale embeddings by diving by sqrt(d_model).
eos_token_id (:obj:`int`, `optional`, defaults to 2)
End of stream token id.
pad_token_id (:obj:`int`, `optional`, defaults to 1)
Padding token id.
bos_token_id (:obj:`int`, `optional`, defaults to 0)
Beginning of stream token id.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper <see The LayerDrop probability for the encoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper <see The LayerDrop probability for the decoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
How many extra learned positional embeddings to use. Should be equal to :obj:`pad_token_id+1`. If True, use gradient checkpointing to save memory at the expense of slower backward pass.
is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`): scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether this is an encoder/decoder model Scale embeddings by diving by sqrt(d_model).
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`): use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``). Whether or not the model should return the last key/values attentions (not used by all models)
"""
Example::
>>> from transformers import MBartModel, MBartConfig
>>> # Initializing a MBART facebook/mbart-large-cc25 style configuration
>>> configuration = MBartConfig()
>>> # Initializing a model from the facebook/mbart-large-cc25 style configuration
>>> model = MBartModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "mbart" model_type = "mbart"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=50265,
max_position_embeddings=1024,
encoder_layers=12,
encoder_ffn_dim=4096,
encoder_attention_heads=16,
decoder_layers=12,
decoder_ffn_dim=4096,
decoder_attention_heads=16,
encoder_layerdrop=0.0,
decoder_layerdrop=0.0,
use_cache=True,
is_encoder_decoder=True,
activation_function="gelu",
d_model=1024,
dropout=0.1,
attention_dropout=0.0,
activation_dropout=0.0,
init_std=0.02,
classifier_dropout=0.0,
scale_embedding=False,
gradient_checkpointing=False,
pad_token_id=1,
bos_token_id=0,
eos_token_id=2,
**kwargs
):
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder,
**kwargs,
)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.init_std = init_std
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.classifier_dropout = classifier_dropout
self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.gradient_checkpointing = gradient_checkpointing
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
# IMPORTANT
# DELETE ALL OF THE FOLLOWING LINES AS SOON AS TF IS READY
self.extra_pos_embeddings = 2
self.normalize_before = True
self.add_final_layer_norm = True
self.do_blenderbot_90_layernorm = False
self.normalize_embedding = True
self.static_position_embeddings = False
self.add_bias_logits = False
self.force_bos_token_to_be_generated = False
@property
def num_attention_heads(self) -> int:
return self.encoder_attention_heads
@property
def hidden_size(self) -> int:
return self.d_model

1470
src/transformers/models/mbart/modeling_mbart.py Normal file → Executable file

File diff suppressed because it is too large Load Diff

View File

@@ -15,9 +15,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from ...file_utils import is_sentencepiece_available, is_tf_available, is_tokenizers_available, is_torch_available from ...file_utils import is_sentencepiece_available, is_tf_available, is_tokenizers_available, is_torch_available
from .configuration_pegasus import PegasusConfig from .configuration_pegasus import PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusConfig
if is_sentencepiece_available(): if is_sentencepiece_available():
@@ -27,7 +26,12 @@ if is_tokenizers_available():
from .tokenization_pegasus_fast import PegasusTokenizerFast from .tokenization_pegasus_fast import PegasusTokenizerFast
if is_torch_available(): if is_torch_available():
from .modeling_pegasus import PegasusForConditionalGeneration, PegasusModel from .modeling_pegasus import (
PEGASUS_PRETRAINED_MODEL_ARCHIVE_LIST,
PegasusForConditionalGeneration,
PegasusModel,
PegasusPreTrainedModel,
)
if is_tf_available(): if is_tf_available():
from .modeling_tf_pegasus import TFPegasusForConditionalGeneration from .modeling_tf_pegasus import TFPegasusForConditionalGeneration

View File

@@ -1,5 +1,5 @@
# coding=utf-8 # coding=utf-8
# Copyright 2020 Google and The HuggingFace Inc. team. # Copyright 2021, Google and The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -14,84 +14,48 @@
# limitations under the License. # limitations under the License.
""" PEGASUS model configuration """ """ PEGASUS model configuration """
from ...configuration_utils import PretrainedConfig
from ...utils import logging from ...utils import logging
from ..bart.configuration_bart import BartConfig
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
# These config values do not vary between checkpoints PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP = {
DEFAULTS = dict( "google/pegasus-large": "https://huggingface.co/google/pegasus-large/resolve/main/config.json",
vocab_size=96103, # See all PEGASUS models at https://huggingface.co/models?filter=pegasus
max_position_embeddings=512,
d_model=1024,
encoder_ffn_dim=4096,
decoder_ffn_dim=4096,
encoder_attention_heads=16,
decoder_attention_heads=16,
encoder_layers=16,
decoder_layers=16,
dropout=0.1,
attention_dropout=0.1,
activation_dropout=0.1,
pad_token_id=0,
eos_token_id=1,
is_encoder_decoder=True,
normalize_before=True,
scale_embedding=True,
normalize_embedding=False,
add_final_layer_norm=True,
static_position_embeddings=True,
num_beams=8,
activation_function="relu",
)
# Config values that vary between checkpoints: for testing and conversion
task_specific_params = {
# These are task specific params for pegasus-large and normal params for finetuned checkpoints
"summarization_xsum": {"length_penalty": 0.6, "max_length": 64, "max_position_embeddings": 512},
"summarization_cnn_dailymail": {"length_penalty": 0.8, "max_length": 128, "max_position_embeddings": 1024},
"summarization_newsroom": {"length_penalty": 0.8, "max_length": 128, "max_position_embeddings": 512},
"summarization_wikihow": {"length_penalty": 0.6, "max_length": 256, "max_position_embeddings": 512},
"summarization_multi_news": {"length_penalty": 0.8, "max_length": 256, "max_position_embeddings": 1024},
"summarization_reddit_tifu": {"length_penalty": 0.6, "max_length": 128, "max_position_embeddings": 512},
"summarization_big_patent": {"length_penalty": 0.7, "max_length": 256, "max_position_embeddings": 1024},
"summarization_arxiv": {"length_penalty": 0.8, "max_length": 256, "max_position_embeddings": 1024},
"summarization_pubmed": {"length_penalty": 0.8, "max_length": 256, "max_position_embeddings": 1024},
"summarization_gigaword": {"length_penalty": 0.6, "max_length": 32, "max_position_embeddings": 128},
"summarization_aeslc": {"length_penalty": 0.6, "max_length": 32, "max_position_embeddings": 512},
"summarization_billsum": {"length_penalty": 0.6, "max_length": 256, "max_position_embeddings": 1024},
# this last entry is useless -- just for consistency
"summarization_large": {"length_penalty": 0.8, "max_length": 256, "max_position_embeddings": 1024},
} }
class PegasusConfig(BartConfig): class PegasusConfig(PretrainedConfig):
""" r"""
This is the configuration class to store the configuration of a This is the configuration class to store the configuration of a :class:`~transformers.PegasusModel`. It is used to
:class:`~transformers.PegasusForConditionalGeneration`. It is used to instantiate a Pegasus model according to the instantiate an PEGASUS model according to the specified arguments, defining the model architecture. Instantiating a
specified arguments, defining the model architecture. configuration with the defaults will yield a similar configuration to that of the PEGASUS `google/pegasus-large
<https://huggingface.co/google/pegasus-large>`__ architecture.
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
Args: Args:
vocab_size (:obj:`int`, `optional`, defaults to 96103): vocab_size (:obj:`int`, `optional`, defaults to 50265):
Vocabulary size of the Pegasus model. Defines the number of different tokens that can be represented by the Vocabulary size of the PEGASUS model. Defines the number of different tokens that can be represented by the
:obj:`inputs_ids` passed when calling :class:`~transformers.PegasusForConditionalGeneration`. :obj:`inputs_ids` passed when calling :class:`~transformers.PegasusModel` or
:class:`~transformers.TFPegasusModel`.
d_model (:obj:`int`, `optional`, defaults to 1024): d_model (:obj:`int`, `optional`, defaults to 1024):
Dimensionality of the layers and the pooler layer. Dimensionality of the layers and the pooler layer.
encoder_layers (:obj:`int`, `optional`, defaults to 16): encoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of encoder layers. Number of encoder layers.
decoder_layers (:obj:`int`, `optional`, defaults to 16): decoder_layers (:obj:`int`, `optional`, defaults to 12):
Number of decoder layers. Number of decoder layers.
encoder_attention_heads (:obj:`int`, `optional`, defaults to 16): encoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer encoder. Number of attention heads for each attention layer in the Transformer encoder.
decoder_attention_heads (:obj:`int`, `optional`, defaults to 16): decoder_attention_heads (:obj:`int`, `optional`, defaults to 16):
Number of attention heads for each attention layer in the Transformer decoder. Number of attention heads for each attention layer in the Transformer decoder.
decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): decoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096): encoder_ffn_dim (:obj:`int`, `optional`, defaults to 4096):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in decoder. Dimensionality of the "intermediate" (often named feed-forward) layer in decoder.
activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`): activation_function (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
The non-linear activation function (function or string) in the encoder and pooler. If string, The non-linear activation function (function or string) in the encoder and pooler. If string,
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported. :obj:`"gelu"`, :obj:`"relu"`, :obj:`"silu"` and :obj:`"gelu_new"` are supported.
@@ -108,38 +72,108 @@ class PegasusConfig(BartConfig):
just in case (e.g., 512 or 1024 or 2048). just in case (e.g., 512 or 1024 or 2048).
init_std (:obj:`float`, `optional`, defaults to 0.02): init_std (:obj:`float`, `optional`, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices. The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
add_bias_logits (:obj:`bool`, `optional`, defaults to :obj:`False`):
This should be completed, specific to marian.
normalize_before (:obj:`bool`, `optional`, defaults to :obj:`True`):
Call layernorm before attention ops.
normalize_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Call layernorm after embeddings.
static_position_embeddings (:obj:`bool`, `optional`, defaults to :obj:`True`):
Don't learn positional embeddings, use sinusoidal.
add_final_layer_norm (:obj:`bool`, `optional`, defaults to :obj:`True`):
Why not add another layernorm?
scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`True`):
Scale embeddings by diving by sqrt(d_model).
eos_token_id (:obj:`int`, `optional`, defaults to 2)
End of stream token id.
pad_token_id (:obj:`int`, `optional`, defaults to 1)
Padding token id.
bos_token_id (:obj:`int`, `optional`, defaults to 0)
Beginning of stream token id.
encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): encoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the encoder. See the `LayerDrop paper <see The LayerDrop probability for the encoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0): decoder_layerdrop: (:obj:`float`, `optional`, defaults to 0.0):
The LayerDrop probability for the decoder. See the `LayerDrop paper <see The LayerDrop probability for the decoder. See the `LayerDrop paper <see
https://arxiv.org/abs/1909.11556>`__ for more details. https://arxiv.org/abs/1909.11556>`__ for more details.
extra_pos_embeddings: (:obj:`int`, `optional`, defaults to 2): gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
How many extra learned positional embeddings to use. Should be pad_token_id+1 for bart. If True, use gradient checkpointing to save memory at the expense of slower backward pass.
is_encoder_decoder (:obj:`bool`, `optional`, defaults to :obj:`True`): scale_embedding (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether this is an encoder/decoder model Scale embeddings by diving by sqrt(d_model).
force_bos_token_to_be_generated (:obj:`bool`, `optional`, defaults to :obj:`False`): use_cache (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to force BOS token to be generated at step 1 (after ``decoder_start_token_id``). Whether or not the model should return the last key/values attentions (not used by all models)
"""
Example::
>>> from transformers import PegasusModel, PegasusConfig
>>> # Initializing a PEGASUS google/pegasus-large style configuration
>>> configuration = PegasusConfig()
>>> # Initializing a model from the google/pegasus-large style configuration
>>> model = PegasusModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
"""
model_type = "pegasus" model_type = "pegasus"
keys_to_ignore_at_inference = ["past_key_values"] keys_to_ignore_at_inference = ["past_key_values"]
# The implementation of the config object is in BartConfig
def __init__(
self,
vocab_size=50265,
max_position_embeddings=1024,
encoder_layers=12,
encoder_ffn_dim=4096,
encoder_attention_heads=16,
decoder_layers=12,
decoder_ffn_dim=4096,
decoder_attention_heads=16,
encoder_layerdrop=0.0,
decoder_layerdrop=0.0,
use_cache=True,
is_encoder_decoder=True,
activation_function="gelu",
d_model=1024,
dropout=0.1,
attention_dropout=0.0,
activation_dropout=0.0,
init_std=0.02,
decoder_start_token_id=0,
classifier_dropout=0.0,
scale_embedding=False,
gradient_checkpointing=False,
pad_token_id=0,
eos_token_id=1,
**kwargs
):
super().__init__(
pad_token_id=pad_token_id,
eos_token_id=eos_token_id,
is_encoder_decoder=is_encoder_decoder,
decoder_start_token_id=decoder_start_token_id,
**kwargs,
)
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.d_model = d_model
self.encoder_ffn_dim = encoder_ffn_dim
self.encoder_layers = encoder_layers
self.encoder_attention_heads = encoder_attention_heads
self.decoder_ffn_dim = decoder_ffn_dim
self.decoder_layers = decoder_layers
self.decoder_attention_heads = decoder_attention_heads
self.dropout = dropout
self.attention_dropout = attention_dropout
self.activation_dropout = activation_dropout
self.activation_function = activation_function
self.init_std = init_std
self.encoder_layerdrop = encoder_layerdrop
self.decoder_layerdrop = decoder_layerdrop
self.classifier_dropout = classifier_dropout
self.use_cache = use_cache
self.num_hidden_layers = encoder_layers
self.gradient_checkpointing = gradient_checkpointing
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
# IMPORTANT
# DELETE ALL OF THE FOLLOWING LINES AS SOON AS TF IS READY
self.extra_pos_embeddings = 0
self.normalize_before = True
self.add_final_layer_norm = True
self.do_blenderbot_90_layernorm = False
self.normalize_embedding = False
self.static_position_embeddings = True
self.add_bias_logits = False
self.force_bos_token_to_be_generated = False
@property
def num_attention_heads(self) -> int:
return self.encoder_attention_heads
@property
def hidden_size(self) -> int:
return self.d_model

1244
src/transformers/models/pegasus/modeling_pegasus.py Normal file → Executable file

File diff suppressed because it is too large Load Diff

View File

@@ -88,9 +88,19 @@ PROPHETNET_INPUTS_DOCSTRING = r"""
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for translation and summarization training. By default, the model will create this tensor by Indices of decoder input sequence tokens in the vocabulary.
shifting the :obj:`input_ids` to the right, following the paper.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): Indices can be obtained using :class:`~transformers.PreTrainedTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
ProphetNet uses the :obj:`eos_token_id` as the starting token for :obj:`decoder_input_ids` generation. If
:obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
:obj:`past_key_values`).
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default. also be used by default.

View File

@@ -1028,14 +1028,22 @@ T5_INPUTS_DOCSTRING = r"""
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for sequence to sequence training. T5 uses the :obj:`pad_token_id` as the starting token for Indices of decoder input sequence tokens in the vocabulary.
:obj:`decoder_input_ids` generation. If :obj:`past_key_values` is used, optionally only the last
:obj:`decoder_input_ids` have to be input (see :obj:`past_key_values`). Indices can be obtained using :class:`~transformers.BartTokenizer`. See
:meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
details.
`What are input IDs? <../glossary.html#input-ids>`__
T5 uses the :obj:`pad_token_id` as the starting token for :obj:`decoder_input_ids` generation. If
:obj:`past_key_values` is used, optionally only the last :obj:`decoder_input_ids` have to be input (see
:obj:`past_key_values`).
To know more on how to prepare :obj:`decoder_input_ids` for pretraining take a look at `T5 Training To know more on how to prepare :obj:`decoder_input_ids` for pretraining take a look at `T5 Training
<./t5.html#training>`__. If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both unset, <./t5.html#training>`__. If :obj:`decoder_input_ids` and :obj:`decoder_inputs_embeds` are both unset,
:obj:`decoder_input_ids` takes the value of :obj:`input_ids`. :obj:`decoder_input_ids` takes the value of :obj:`input_ids`.
decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default. also be used by default.
encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`): encoder_outputs (:obj:`tuple(tuple(torch.FloatTensor)`, `optional`):

View File

@@ -922,7 +922,7 @@ T5_INPUTS_DOCSTRING = r"""
- 0 for tokens that are **masked**. - 0 for tokens that are **masked**.
`What are attention masks? <../glossary.html#attention-mask>`__ `What are attention masks? <../glossary.html#attention-mask>`__
decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default. also be used by default.
encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`): encoder_outputs (:obj:`tuple(tuple(tf.FloatTensor)`, `optional`):

View File

@@ -609,6 +609,27 @@ class BlenderbotModel:
requires_pytorch(self) requires_pytorch(self)
BLENDERBOT_SMALL_PRETRAINED_MODEL_ARCHIVE_LIST = None
class BlenderbotSmallForConditionalGeneration:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
class BlenderbotSmallModel:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = None CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = None
@@ -1327,6 +1348,15 @@ class LxmertXLayer:
requires_pytorch(self) requires_pytorch(self)
class MarianModel:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
class MarianMTModel: class MarianMTModel:
def __init__(self, *args, **kwargs): def __init__(self, *args, **kwargs):
requires_pytorch(self) requires_pytorch(self)
@@ -1345,6 +1375,24 @@ class MBartForConditionalGeneration:
requires_pytorch(self) requires_pytorch(self)
class MBartForQuestionAnswering:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
class MBartForSequenceClassification:
def __init__(self, *args, **kwargs):
requires_pytorch(self)
@classmethod
def from_pretrained(self, *args, **kwargs):
requires_pytorch(self)
class MBartModel: class MBartModel:
def __init__(self, *args, **kwargs): def __init__(self, *args, **kwargs):
requires_pytorch(self) requires_pytorch(self)

View File

@@ -135,6 +135,10 @@ class {{cookiecutter.camelcase_modelname}}Config(PretrainedConfig):
>>> configuration = model.config >>> configuration = model.config
""" """
model_type = "{{cookiecutter.lowercase_modelname}}" model_type = "{{cookiecutter.lowercase_modelname}}"
{% if cookiecutter.is_encoder_decoder_model == "False" -%}
{% else -%}
keys_to_ignore_at_inference = ["past_key_values"]
{% endif -%}
def __init__( def __init__(
self, self,

View File

@@ -1849,7 +1849,7 @@ class TF{{cookiecutter.camelcase_modelname}}PreTrainedModel(TFPreTrainedModel):
decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for translation and summarization training. By default, the model will create this tensor by Provide for translation and summarization training. By default, the model will create this tensor by
shifting the input_ids right, following the paper. shifting the input_ids right, following the paper.
decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
will be made by default and ignore pad tokens. It is not recommended to set this for most use cases. will be made by default and ignore pad tokens. It is not recommended to set this for most use cases.
encoder_outputs (:obj:`tf.FloatTensor`, `optional`): encoder_outputs (:obj:`tf.FloatTensor`, `optional`):
hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. hidden states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.

View File

@@ -1601,17 +1601,6 @@ def _expand_mask(
return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min) return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)
def {{cookiecutter.camelcase_modelname}}LayerNorm(normalized_shape: torch.Size, eps: float = 1e-5, elementwise_affine: bool = True):
if torch.cuda.is_available():
try:
from apex.normalization import FusedLayerNorm
return FusedLayerNorm(normalized_shape, eps, elementwise_affine)
except ImportError:
pass
return torch.nn.LayerNorm(normalized_shape, eps, elementwise_affine)
class {{cookiecutter.camelcase_modelname}}LearnedPositionalEmbedding(nn.Embedding): class {{cookiecutter.camelcase_modelname}}LearnedPositionalEmbedding(nn.Embedding):
""" """
This module learns positional embeddings up to a fixed maximum size. This module learns positional embeddings up to a fixed maximum size.
@@ -1619,7 +1608,6 @@ class {{cookiecutter.camelcase_modelname}}LearnedPositionalEmbedding(nn.Embeddin
def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int): def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: int):
assert padding_idx is not None, "`padding_idx` should not be None, but of type int" assert padding_idx is not None, "`padding_idx` should not be None, but of type int"
num_embeddings
super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx) super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)
def forward(self, input_ids_shape: torch.Size, past_key_values_length: int = 0): def forward(self, input_ids_shape: torch.Size, past_key_values_length: int = 0):
@@ -1774,13 +1762,13 @@ class {{cookiecutter.camelcase_modelname}}EncoderLayer(nn.Module):
num_heads=config.encoder_attention_heads, num_heads=config.encoder_attention_heads,
dropout=config.attention_dropout, dropout=config.attention_dropout,
) )
self.self_attn_layer_norm = {{cookiecutter.camelcase_modelname}}LayerNorm(self.embed_dim) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.dropout = config.dropout self.dropout = config.dropout
self.activation_fn = ACT2FN[config.activation_function] self.activation_fn = ACT2FN[config.activation_function]
self.activation_dropout = config.activation_dropout self.activation_dropout = config.activation_dropout
self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim) self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim) self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
self.final_layer_norm = {{cookiecutter.camelcase_modelname}}LayerNorm(self.embed_dim) self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool = False): def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, output_attentions: bool = False):
""" """
@@ -1788,8 +1776,9 @@ class {{cookiecutter.camelcase_modelname}}EncoderLayer(nn.Module):
hidden_states (:obj:`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)` hidden_states (:obj:`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)`
attention_mask (:obj:`torch.FloatTensor`): attention mask of size attention_mask (:obj:`torch.FloatTensor`): attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
output_attentions (:obj:`bool`): Whether the base model outputs attentions. output_attentions (:obj:`bool`, `optional`):
This requires the attentions tensor to be reshaped in this function. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
returned tensors for more detail.
""" """
residual = hidden_states residual = hidden_states
hidden_states, attn_weights, _ = self.self_attn( hidden_states, attn_weights, _ = self.self_attn(
@@ -1834,17 +1823,17 @@ class {{cookiecutter.camelcase_modelname}}DecoderLayer(nn.Module):
self.activation_fn = ACT2FN[config.activation_function] self.activation_fn = ACT2FN[config.activation_function]
self.activation_dropout = config.activation_dropout self.activation_dropout = config.activation_dropout
self.self_attn_layer_norm = {{cookiecutter.camelcase_modelname}}LayerNorm(self.embed_dim) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.encoder_attn = {{cookiecutter.camelcase_modelname}}Attention( self.encoder_attn = {{cookiecutter.camelcase_modelname}}Attention(
self.embed_dim, self.embed_dim,
config.decoder_attention_heads, config.decoder_attention_heads,
dropout=config.attention_dropout, dropout=config.attention_dropout,
is_decoder=True, is_decoder=True,
) )
self.encoder_attn_layer_norm = {{cookiecutter.camelcase_modelname}}LayerNorm(self.embed_dim) self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim)
self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim) self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)
self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim) self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)
self.final_layer_norm = {{cookiecutter.camelcase_modelname}}LayerNorm(self.embed_dim) self.final_layer_norm = nn.LayerNorm(self.embed_dim)
def forward( def forward(
self, self,
@@ -1865,8 +1854,9 @@ class {{cookiecutter.camelcase_modelname}}DecoderLayer(nn.Module):
encoder_attention_mask (:obj:`torch.FloatTensor`): encoder attention mask of size encoder_attention_mask (:obj:`torch.FloatTensor`): encoder attention mask of size
`(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
past_key_value (:obj:`Tuple(torch.FloatTensor)`): cached past key and value projection states past_key_value (:obj:`Tuple(torch.FloatTensor)`): cached past key and value projection states
output_attentions (:obj:`bool`): Whether the base model outputs attentions. output_attentions (:obj:`bool`, `optional`):
This requires the attentions tensor to be reshaped in this function. Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
returned tensors for more detail.
""" """
residual = hidden_states residual = hidden_states
@@ -2031,7 +2021,7 @@ class {{cookiecutter.camelcase_modelname}}PreTrainedModel(PreTrainedModel):
decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`): decoder_input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Provide for translation and summarization training. By default, the model will create this tensor by Provide for translation and summarization training. By default, the model will create this tensor by
shifting the :obj:`input_ids` to the right, following the paper. shifting the :obj:`input_ids` to the right, following the paper.
decoder_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`): decoder_attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`):
Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will Default behavior: generate a tensor that ignores pad tokens in :obj:`decoder_input_ids`. Causal mask will
also be used by default. also be used by default.
@@ -2094,7 +2084,7 @@ class {{cookiecutter.camelcase_modelname}}Encoder({{cookiecutter.camelcase_model
embed_dim = config.d_model embed_dim = config.d_model
self.padding_idx = config.pad_token_id self.padding_idx = config.pad_token_id
self.max_source_positions = config.max_position_embeddings self.max_source_positions = config.max_position_embeddings
self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0 self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0
if embed_tokens is not None: if embed_tokens is not None:
self.embed_tokens = embed_tokens self.embed_tokens = embed_tokens
@@ -2107,7 +2097,7 @@ class {{cookiecutter.camelcase_modelname}}Encoder({{cookiecutter.camelcase_model
self.padding_idx, self.padding_idx,
) )
self.layers = nn.ModuleList([{{cookiecutter.camelcase_modelname}}EncoderLayer(config) for _ in range(config.encoder_layers)]) self.layers = nn.ModuleList([{{cookiecutter.camelcase_modelname}}EncoderLayer(config) for _ in range(config.encoder_layers)])
self.layernorm_embedding = {{cookiecutter.camelcase_modelname}}LayerNorm(embed_dim) self.layernorm_embedding = nn.LayerNorm(embed_dim)
self.init_weights() self.init_weights()
@@ -2251,7 +2241,7 @@ class {{cookiecutter.camelcase_modelname}}Decoder({{cookiecutter.camelcase_model
self.padding_idx, self.padding_idx,
) )
self.layers = nn.ModuleList([{{cookiecutter.camelcase_modelname}}DecoderLayer(config) for _ in range(config.decoder_layers)]) self.layers = nn.ModuleList([{{cookiecutter.camelcase_modelname}}DecoderLayer(config) for _ in range(config.decoder_layers)])
self.layernorm_embedding = {{cookiecutter.camelcase_modelname}}LayerNorm(config.d_model) self.layernorm_embedding = nn.LayerNorm(config.d_model)
self.init_weights() self.init_weights()
@@ -2513,7 +2503,7 @@ class {{cookiecutter.camelcase_modelname}}Model({{cookiecutter.camelcase_modelna
output_hidden_states=output_hidden_states, output_hidden_states=output_hidden_states,
return_dict=return_dict, return_dict=return_dict,
) )
# If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=False # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=True
elif return_dict and not isinstance(encoder_outputs, BaseModelOutput): elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
encoder_outputs = BaseModelOutput( encoder_outputs = BaseModelOutput(
last_hidden_state=encoder_outputs[0], last_hidden_state=encoder_outputs[0],

View File

@@ -480,8 +480,6 @@ import copy
import tempfile import tempfile
import unittest import unittest
import timeout_decorator # noqa
from transformers import is_torch_available from transformers import is_torch_available
from transformers.file_utils import cached_property from transformers.file_utils import cached_property
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device

File diff suppressed because one or more lines are too long

View File

@@ -1,6 +1,5 @@
#!/usr/bin/env python3
# coding=utf-8 # coding=utf-8
# Copyright 2020 The HuggingFace Team. All rights reserved. # Copyright 2021, The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -13,7 +12,10 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""Tests for BlenderBot""" """ Testing suite for the PyTorch Blenderbot model. """
import tempfile
import unittest import unittest
from transformers import is_torch_available from transformers import is_torch_available
@@ -21,113 +23,236 @@ from transformers.file_utils import cached_property
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
from .test_configuration_common import ConfigTester from .test_configuration_common import ConfigTester
from .test_generation_utils import GenerationTesterMixin
from .test_modeling_common import ModelTesterMixin, ids_tensor from .test_modeling_common import ModelTesterMixin, ids_tensor
if is_torch_available(): if is_torch_available():
import torch import torch
from transformers import ( from transformers import BlenderbotConfig, BlenderbotForConditionalGeneration, BlenderbotModel, BlenderbotTokenizer
AutoModelForSeq2SeqLM, from transformers.models.blenderbot.modeling_blenderbot import BlenderbotDecoder, BlenderbotEncoder
AutoTokenizer,
BlenderbotConfig,
BlenderbotForConditionalGeneration,
BlenderbotModel,
BlenderbotSmallTokenizer,
BlenderbotTokenizer,
)
TOK_DECODE_KW = dict(skip_special_tokens=True, clean_up_tokenization_spaces=True)
FASTER_GEN_KWARGS = dict(num_beams=1, early_stopping=True, min_length=15, max_length=25) def prepare_blenderbot_inputs_dict(
config,
input_ids,
decoder_input_ids,
attention_mask=None,
decoder_attention_mask=None,
):
if attention_mask is None:
attention_mask = input_ids.ne(config.pad_token_id)
if decoder_attention_mask is None:
decoder_attention_mask = decoder_input_ids.ne(config.pad_token_id)
return {
"input_ids": input_ids,
"decoder_input_ids": decoder_input_ids,
"attention_mask": attention_mask,
"decoder_attention_mask": attention_mask,
}
@require_torch @require_torch
class BlenderbotModelTester: class BlenderbotModelTester:
# Required attributes def __init__(
vocab_size = 99 self,
batch_size = 13 parent,
seq_length = 7 batch_size=13,
num_hidden_layers = 2 seq_length=7,
hidden_size = 16 is_training=True,
num_attention_heads = 4 use_labels=False,
is_training = True vocab_size=99,
hidden_size=16,
def __init__(self, parent): num_hidden_layers=2,
torch.manual_seed(0) num_attention_heads=4,
intermediate_size=4,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=20,
eos_token_id=2,
pad_token_id=1,
bos_token_id=0,
):
self.parent = parent self.parent = parent
self.config = BlenderbotConfig( self.batch_size = batch_size
d_model=self.hidden_size, self.seq_length = seq_length
dropout=0.0, self.is_training = is_training
activation_function="gelu", self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).clamp(
3,
)
input_ids[:, -1] = self.eos_token_id # Eos Token
decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
config = BlenderbotConfig(
vocab_size=self.vocab_size, vocab_size=self.vocab_size,
d_model=self.hidden_size,
encoder_layers=self.num_hidden_layers, encoder_layers=self.num_hidden_layers,
decoder_layers=self.num_hidden_layers, decoder_layers=self.num_hidden_layers,
encoder_attention_heads=self.num_attention_heads, encoder_attention_heads=self.num_attention_heads,
decoder_attention_heads=self.num_attention_heads, decoder_attention_heads=self.num_attention_heads,
attention_dropout=0.0, encoder_ffn_dim=self.intermediate_size,
encoder_ffn_dim=4, decoder_ffn_dim=self.intermediate_size,
decoder_ffn_dim=4, dropout=self.hidden_dropout_prob,
do_blenderbot_90_layernorm=False, attention_dropout=self.attention_probs_dropout_prob,
normalize_before=True, max_position_embeddings=self.max_position_embeddings,
max_position_embeddings=50, eos_token_id=self.eos_token_id,
static_position_embeddings=False, bos_token_id=self.bos_token_id,
scale_embedding=True, pad_token_id=self.pad_token_id,
bos_token_id=0,
eos_token_id=2,
pad_token_id=1,
num_beams=1,
min_length=3,
max_length=10,
) )
inputs_dict = prepare_blenderbot_inputs_dict(config, input_ids, decoder_input_ids)
return config, inputs_dict
def prepare_config_and_inputs_for_common(self): def prepare_config_and_inputs_for_common(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size) config, inputs_dict = self.prepare_config_and_inputs()
attention_mask = ids_tensor([self.batch_size, self.seq_length], vocab_size=2) return config, inputs_dict
inputs_dict = {"input_ids": input_ids, "attention_mask": attention_mask}
return self.config, inputs_dict def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = BlenderbotModel(config=config).get_decoder().to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["last_hidden_state"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"last_hidden_state"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-2))
def check_encoder_decoder_model_standalone(self, config, inputs_dict):
model = BlenderbotModel(config=config).to(torch_device).eval()
outputs = model(**inputs_dict)
encoder_last_hidden_state = outputs.encoder_last_hidden_state
last_hidden_state = outputs.last_hidden_state
with tempfile.TemporaryDirectory() as tmpdirname:
encoder = model.get_encoder()
encoder.save_pretrained(tmpdirname)
encoder = BlenderbotEncoder.from_pretrained(tmpdirname).to(torch_device)
encoder_last_hidden_state_2 = encoder(inputs_dict["input_ids"], attention_mask=inputs_dict["attention_mask"])[
0
]
self.parent.assertTrue((encoder_last_hidden_state_2 - encoder_last_hidden_state).abs().max().item() < 1e-3)
with tempfile.TemporaryDirectory() as tmpdirname:
decoder = model.get_decoder()
decoder.save_pretrained(tmpdirname)
decoder = BlenderbotDecoder.from_pretrained(tmpdirname).to(torch_device)
last_hidden_state_2 = decoder(
input_ids=inputs_dict["decoder_input_ids"],
attention_mask=inputs_dict["decoder_attention_mask"],
encoder_hidden_states=encoder_last_hidden_state,
encoder_attention_mask=inputs_dict["attention_mask"],
)[0]
self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3)
@require_torch @require_torch
class BlenderbotTesterMixin(ModelTesterMixin, unittest.TestCase): class BlenderbotModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
if is_torch_available(): all_model_classes = (BlenderbotModel, BlenderbotForConditionalGeneration) if is_torch_available() else ()
all_generative_model_classes = (BlenderbotForConditionalGeneration,) all_generative_model_classes = (BlenderbotForConditionalGeneration,) if is_torch_available() else ()
all_model_classes = (BlenderbotForConditionalGeneration, BlenderbotModel)
else:
all_generative_model_classes = ()
all_model_classes = ()
is_encoder_decoder = True is_encoder_decoder = True
test_head_masking = False
test_pruning = False test_pruning = False
test_head_masking = False
test_missing_keys = False test_missing_keys = False
test_torchscript = False
def setUp(self): def setUp(self):
self.model_tester = BlenderbotModelTester(self) self.model_tester = BlenderbotModelTester(self)
self.config_tester = ConfigTester(self, config_class=BlenderbotConfig) self.config_tester = ConfigTester(self, config_class=BlenderbotConfig)
def test_initialization_module(self): def test_config(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() self.config_tester.run_common_tests()
model = BlenderbotForConditionalGeneration(config).model
model.to(torch_device)
model.eval()
enc_embeds = model.encoder.embed_tokens.weight
assert (enc_embeds == model.shared.weight).all().item()
self.assertAlmostEqual(torch.std(enc_embeds).item(), config.init_std, 2)
def test_embed_pos_shape(self): def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() config, inputs_dict = self.model_tester.prepare_config_and_inputs()
model = BlenderbotForConditionalGeneration(config) for model_class in self.all_model_classes:
expected_shape = (config.max_position_embeddings + config.extra_pos_embeddings, config.d_model) model = model_class(config)
assert model.model.encoder.embed_positions.weight.shape == expected_shape
model.model.decoder.embed_positions.weight.shape == expected_shape
@unittest.skip("This test is flaky") with tempfile.TemporaryDirectory() as tmpdirname:
def test_feed_forward_chunking(self): model.save_pretrained(tmpdirname)
pass model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
@unittest.skip("TODO: Decoder embeddings cannot be resized at the moment") def test_decoder_model_past_with_large_inputs(self):
def test_resize_embeddings_untied(self): config_and_inputs = self.model_tester.prepare_config_and_inputs()
pass self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
def test_encoder_decoder_model_standalone(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs)
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
attention_mask = input_ids.ne(1).to(torch_device)
model = BlenderbotForConditionalGeneration(config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
model.generate(input_ids, attention_mask=attention_mask)
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
def assert_tensors_close(a, b, atol=1e-12, prefix=""):
"""If tensors have different shapes, different values or a and b are not both tensors, raise a nice Assertion error."""
if a is None and b is None:
return True
try:
if torch.allclose(a, b, atol=atol):
return True
raise
except Exception:
pct_different = (torch.gt((a - b).abs(), atol)).float().mean().item()
if a.numel() > 100:
msg = f"tensor values are {pct_different:.1%} percent different."
else:
msg = f"{a} != {b}"
if prefix:
msg = prefix + ": " + msg
raise AssertionError(msg)
@unittest.skipUnless(torch_device != "cpu", "3B test too slow on CPU.") @unittest.skipUnless(torch_device != "cpu", "3B test too slow on CPU.")
@@ -143,6 +268,9 @@ class Blenderbot3BIntegrationTests(unittest.TestCase):
@slow @slow
def test_generation_from_short_input_same_as_parlai_3B(self): def test_generation_from_short_input_same_as_parlai_3B(self):
FASTER_GEN_KWARGS = dict(num_beams=1, early_stopping=True, min_length=15, max_length=25)
TOK_DECODE_KW = dict(skip_special_tokens=True, clean_up_tokenization_spaces=True)
torch.cuda.empty_cache() torch.cuda.empty_cache()
model = BlenderbotForConditionalGeneration.from_pretrained(self.ckpt).half().to(torch_device) model = BlenderbotForConditionalGeneration.from_pretrained(self.ckpt).half().to(torch_device)
@@ -164,53 +292,3 @@ class Blenderbot3BIntegrationTests(unittest.TestCase):
assert "I think it's because we are so worried about what people think of us." == reply.strip() assert "I think it's because we are so worried about what people think of us." == reply.strip()
del model del model
@require_torch
class Blenderbot90MIntegrationTests(unittest.TestCase):
ckpt = "facebook/blenderbot-90M"
@cached_property
def model(self):
model = AutoModelForSeq2SeqLM.from_pretrained(self.ckpt).to(torch_device)
if torch_device == "cuda":
model = model.half()
return model
@cached_property
def tokenizer(self):
return AutoTokenizer.from_pretrained(self.ckpt)
@slow
def test_90_generation_from_long_input(self):
src_text = [
"Social anxiety\nWow, I am never shy. Do you have anxiety?\nYes. I end up sweating and blushing and feel like\
i'm going to throw up.\nand why is that?"
]
model_inputs = self.tokenizer(src_text, return_tensors="pt").to(torch_device)
# model does not have "token_type_ids"
model_inputs.pop("token_type_ids")
assert isinstance(self.tokenizer, BlenderbotSmallTokenizer)
generated_ids = self.model.generate(**model_inputs)[0]
reply = self.tokenizer.decode(generated_ids, **TOK_DECODE_KW)
assert reply in (
"i don't know. i just feel like i'm going to throw up. it's not fun.",
"i'm not sure. i just feel like i've been feeling like i have to be in a certain place",
)
def test_90_generation_from_short_input(self):
model_inputs = self.tokenizer(["sam"], return_tensors="pt").to(torch_device)
# model does not have "token_type_ids"
model_inputs.pop("token_type_ids")
generated_utterances = self.model.generate(**model_inputs)
clean_txt = self.tokenizer.decode(generated_utterances[0], **TOK_DECODE_KW)
assert clean_txt in (
"have you ever been to a sam club? it's a great club in the south.",
"have you ever heard of sam harris? he's an american singer, songwriter, and actor.",
)

View File

@@ -0,0 +1,315 @@
# coding=utf-8
# Copyright 2021, The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch BlenderbotSmall model. """
import tempfile
import unittest
from transformers import is_torch_available
from transformers.file_utils import cached_property
from transformers.testing_utils import require_torch, slow, torch_device
from .test_configuration_common import ConfigTester
from .test_generation_utils import GenerationTesterMixin
from .test_modeling_common import ModelTesterMixin, ids_tensor
if is_torch_available():
import torch
from transformers import (
BlenderbotSmallConfig,
BlenderbotSmallForConditionalGeneration,
BlenderbotSmallModel,
BlenderbotSmallTokenizer,
)
from transformers.models.blenderbot_small.modeling_blenderbot_small import (
BlenderbotSmallDecoder,
BlenderbotSmallEncoder,
)
def prepare_blenderbot_small_inputs_dict(
config,
input_ids,
decoder_input_ids,
attention_mask=None,
decoder_attention_mask=None,
):
if attention_mask is None:
attention_mask = input_ids.ne(config.pad_token_id)
if decoder_attention_mask is None:
decoder_attention_mask = decoder_input_ids.ne(config.pad_token_id)
return {
"input_ids": input_ids,
"decoder_input_ids": decoder_input_ids,
"attention_mask": attention_mask,
"decoder_attention_mask": attention_mask,
}
@require_torch
class BlenderbotSmallModelTester:
def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_labels=False,
vocab_size=99,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=4,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=20,
eos_token_id=2,
pad_token_id=1,
bos_token_id=0,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).clamp(
3,
)
input_ids[:, -1] = self.eos_token_id # Eos Token
decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
config = BlenderbotSmallConfig(
vocab_size=self.vocab_size,
d_model=self.hidden_size,
encoder_layers=self.num_hidden_layers,
decoder_layers=self.num_hidden_layers,
encoder_attention_heads=self.num_attention_heads,
decoder_attention_heads=self.num_attention_heads,
encoder_ffn_dim=self.intermediate_size,
decoder_ffn_dim=self.intermediate_size,
dropout=self.hidden_dropout_prob,
attention_dropout=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
eos_token_id=self.eos_token_id,
bos_token_id=self.bos_token_id,
pad_token_id=self.pad_token_id,
)
inputs_dict = prepare_blenderbot_small_inputs_dict(config, input_ids, decoder_input_ids)
return config, inputs_dict
def prepare_config_and_inputs_for_common(self):
config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = BlenderbotSmallModel(config=config).get_decoder().to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["last_hidden_state"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"last_hidden_state"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-2))
def check_encoder_decoder_model_standalone(self, config, inputs_dict):
model = BlenderbotSmallModel(config=config).to(torch_device).eval()
outputs = model(**inputs_dict)
encoder_last_hidden_state = outputs.encoder_last_hidden_state
last_hidden_state = outputs.last_hidden_state
with tempfile.TemporaryDirectory() as tmpdirname:
encoder = model.get_encoder()
encoder.save_pretrained(tmpdirname)
encoder = BlenderbotSmallEncoder.from_pretrained(tmpdirname).to(torch_device)
encoder_last_hidden_state_2 = encoder(inputs_dict["input_ids"], attention_mask=inputs_dict["attention_mask"])[
0
]
self.parent.assertTrue((encoder_last_hidden_state_2 - encoder_last_hidden_state).abs().max().item() < 1e-3)
with tempfile.TemporaryDirectory() as tmpdirname:
decoder = model.get_decoder()
decoder.save_pretrained(tmpdirname)
decoder = BlenderbotSmallDecoder.from_pretrained(tmpdirname).to(torch_device)
last_hidden_state_2 = decoder(
input_ids=inputs_dict["decoder_input_ids"],
attention_mask=inputs_dict["decoder_attention_mask"],
encoder_hidden_states=encoder_last_hidden_state,
encoder_attention_mask=inputs_dict["attention_mask"],
)[0]
self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3)
@require_torch
class BlenderbotSmallModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (BlenderbotSmallModel, BlenderbotSmallForConditionalGeneration) if is_torch_available() else ()
all_generative_model_classes = (BlenderbotSmallForConditionalGeneration,) if is_torch_available() else ()
is_encoder_decoder = True
test_pruning = False
test_head_masking = False
test_missing_keys = False
def setUp(self):
self.model_tester = BlenderbotSmallModelTester(self)
self.config_tester = ConfigTester(self, config_class=BlenderbotSmallConfig)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_decoder_model_past_with_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
def test_encoder_decoder_model_standalone(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs)
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
attention_mask = input_ids.ne(1).to(torch_device)
model = BlenderbotSmallForConditionalGeneration(config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
model.generate(input_ids, attention_mask=attention_mask)
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
def assert_tensors_close(a, b, atol=1e-12, prefix=""):
"""If tensors have different shapes, different values or a and b are not both tensors, raise a nice Assertion error."""
if a is None and b is None:
return True
try:
if torch.allclose(a, b, atol=atol):
return True
raise
except Exception:
pct_different = (torch.gt((a - b).abs(), atol)).float().mean().item()
if a.numel() > 100:
msg = f"tensor values are {pct_different:.1%} percent different."
else:
msg = f"{a} != {b}"
if prefix:
msg = prefix + ": " + msg
raise AssertionError(msg)
@require_torch
class Blenderbot90MIntegrationTests(unittest.TestCase):
ckpt = "facebook/blenderbot-90M"
@cached_property
def model(self):
model = BlenderbotSmallForConditionalGeneration.from_pretrained(self.ckpt).to(torch_device)
if torch_device == "cuda":
model = model.half()
return model
@cached_property
def tokenizer(self):
return BlenderbotSmallTokenizer.from_pretrained(self.ckpt)
@slow
def test_90_generation_from_long_input(self):
src_text = [
"Social anxiety\nWow, I am never shy. Do you have anxiety?\nYes. I end up sweating and blushing and feel like\
i'm going to throw up.\nand why is that?"
]
model_inputs = self.tokenizer(src_text, return_tensors="pt").to(torch_device)
# model does not have "token_type_ids"
model_inputs.pop("token_type_ids")
assert isinstance(self.tokenizer, BlenderbotSmallTokenizer)
generated_ids = self.model.generate(**model_inputs)[0]
reply = self.tokenizer.decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
assert reply in (
"i don't know. i just feel like i'm going to throw up. it's not fun.",
"i'm not sure. i just feel like i've been feeling like i have to be in a certain place",
)
def test_90_generation_from_short_input(self):
model_inputs = self.tokenizer(["sam"], return_tensors="pt").to(torch_device)
# model does not have "token_type_ids"
model_inputs.pop("token_type_ids")
generated_utterances = self.model.generate(**model_inputs)
clean_txt = self.tokenizer.decode(
generated_utterances[0], skip_special_tokens=True, clean_up_tokenization_spaces=True
)
assert clean_txt in (
"have you ever been to a sam club? it's a great club in the south.",
"have you ever heard of sam harris? he's an american singer, songwriter, and actor.",
)

View File

@@ -1,5 +1,5 @@
# coding=utf-8 # coding=utf-8
# Copyright 2020 HuggingFace Inc. team. # Copyright 2021, The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -12,58 +12,268 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" Testing suite for the PyTorch Marian model. """
import tempfile
import unittest import unittest
from transformers import AutoConfig, AutoTokenizer, MarianConfig, MarianTokenizer, is_torch_available from transformers import is_torch_available
from transformers.file_utils import cached_property from transformers.file_utils import cached_property
from transformers.hf_api import HfApi from transformers.hf_api import HfApi
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
from .test_modeling_common import ModelTesterMixin from .test_configuration_common import ConfigTester
from .test_generation_utils import GenerationTesterMixin
from .test_modeling_common import ModelTesterMixin, ids_tensor
if is_torch_available(): if is_torch_available():
import torch import torch
from transformers import AutoModelWithLMHead, MarianMTModel from transformers import (
from transformers.models.bart.modeling_bart import shift_tokens_right AutoConfig,
AutoModelWithLMHead,
AutoTokenizer,
MarianConfig,
MarianModel,
MarianMTModel,
TranslationPipeline,
)
from transformers.models.marian.convert_marian_to_pytorch import ( from transformers.models.marian.convert_marian_to_pytorch import (
ORG_NAME, ORG_NAME,
convert_hf_name_to_opus_name, convert_hf_name_to_opus_name,
convert_opus_name_to_hf_name, convert_opus_name_to_hf_name,
) )
from transformers.pipelines import TranslationPipeline from transformers.models.marian.modeling_marian import MarianDecoder, MarianEncoder, shift_tokens_right
class ModelTester: def prepare_marian_inputs_dict(
def __init__(self, parent): config,
self.config = MarianConfig( input_ids,
vocab_size=99, decoder_input_ids,
d_model=24, attention_mask=None,
encoder_layers=2, decoder_attention_mask=None,
decoder_layers=2, ):
encoder_attention_heads=2, if attention_mask is None:
decoder_attention_heads=2, attention_mask = input_ids.ne(config.pad_token_id)
encoder_ffn_dim=32, if decoder_attention_mask is None:
decoder_ffn_dim=32, decoder_attention_mask = decoder_input_ids.ne(config.pad_token_id)
max_position_embeddings=48, return {
add_final_layer_norm=True, "input_ids": input_ids,
) "decoder_input_ids": decoder_input_ids,
"attention_mask": attention_mask,
def prepare_config_and_inputs_for_common(self): "decoder_attention_mask": attention_mask,
return self.config, {} }
@require_torch @require_torch
class SelectiveCommonTest(unittest.TestCase): class MarianModelTester:
all_model_classes = (MarianMTModel,) if is_torch_available() else () def __init__(
self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_labels=False,
vocab_size=99,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=4,
hidden_act="gelu",
hidden_dropout_prob=0.1,
attention_probs_dropout_prob=0.1,
max_position_embeddings=20,
eos_token_id=2,
pad_token_id=1,
bos_token_id=0,
decoder_start_token_id=3,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
self.decoder_start_token_id = decoder_start_token_id
test_save_load__keys_to_ignore_on_save = ModelTesterMixin.test_save_load__keys_to_ignore_on_save def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).clamp(
3,
)
input_ids[:, -1] = self.eos_token_id # Eos Token
decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
config = MarianConfig(
vocab_size=self.vocab_size,
d_model=self.hidden_size,
encoder_layers=self.num_hidden_layers,
decoder_layers=self.num_hidden_layers,
encoder_attention_heads=self.num_attention_heads,
decoder_attention_heads=self.num_attention_heads,
encoder_ffn_dim=self.intermediate_size,
decoder_ffn_dim=self.intermediate_size,
dropout=self.hidden_dropout_prob,
attention_dropout=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
eos_token_id=self.eos_token_id,
bos_token_id=self.bos_token_id,
pad_token_id=self.pad_token_id,
decoder_start_token_id=self.decoder_start_token_id,
)
inputs_dict = prepare_marian_inputs_dict(config, input_ids, decoder_input_ids)
return config, inputs_dict
def prepare_config_and_inputs_for_common(self):
config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = MarianModel(config=config).get_decoder().to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["last_hidden_state"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"last_hidden_state"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-2))
def check_encoder_decoder_model_standalone(self, config, inputs_dict):
model = MarianModel(config=config).to(torch_device).eval()
outputs = model(**inputs_dict)
encoder_last_hidden_state = outputs.encoder_last_hidden_state
last_hidden_state = outputs.last_hidden_state
with tempfile.TemporaryDirectory() as tmpdirname:
encoder = model.get_encoder()
encoder.save_pretrained(tmpdirname)
encoder = MarianEncoder.from_pretrained(tmpdirname).to(torch_device)
encoder_last_hidden_state_2 = encoder(inputs_dict["input_ids"], attention_mask=inputs_dict["attention_mask"])[
0
]
self.parent.assertTrue((encoder_last_hidden_state_2 - encoder_last_hidden_state).abs().max().item() < 1e-3)
with tempfile.TemporaryDirectory() as tmpdirname:
decoder = model.get_decoder()
decoder.save_pretrained(tmpdirname)
decoder = MarianDecoder.from_pretrained(tmpdirname).to(torch_device)
last_hidden_state_2 = decoder(
input_ids=inputs_dict["decoder_input_ids"],
attention_mask=inputs_dict["decoder_attention_mask"],
encoder_hidden_states=encoder_last_hidden_state,
encoder_attention_mask=inputs_dict["attention_mask"],
)[0]
self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3)
@require_torch
class MarianModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (MarianModel, MarianMTModel) if is_torch_available() else ()
all_generative_model_classes = (MarianMTModel,) if is_torch_available() else ()
is_encoder_decoder = True
test_pruning = False
test_head_masking = False
test_missing_keys = False
def setUp(self): def setUp(self):
self.model_tester = ModelTester(self) self.model_tester = MarianModelTester(self)
self.config_tester = ConfigTester(self, config_class=MarianConfig)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_decoder_model_past_with_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
def test_encoder_decoder_model_standalone(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs)
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
attention_mask = input_ids.ne(1).to(torch_device)
model = MarianMTModel(config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
model.generate(input_ids, attention_mask=attention_mask)
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
def assert_tensors_close(a, b, atol=1e-12, prefix=""):
"""If tensors have different shapes, different values or a and b are not both tensors, raise a nice Assertion error."""
if a is None and b is None:
return True
try:
if torch.allclose(a, b, atol=atol):
return True
raise
except Exception:
pct_different = (torch.gt((a - b).abs(), atol)).float().mean().item()
if a.numel() > 100:
msg = f"tensor values are {pct_different:.1%} percent different."
else:
msg = f"{a} != {b}"
if prefix:
msg = prefix + ": " + msg
raise AssertionError(msg)
def _long_tensor(tok_lst):
return torch.tensor(tok_lst, dtype=torch.long, device=torch_device)
class ModelManagementTests(unittest.TestCase): class ModelManagementTests(unittest.TestCase):
@@ -107,7 +317,7 @@ class MarianIntegrationTest(unittest.TestCase):
return cls return cls
@cached_property @cached_property
def tokenizer(self) -> MarianTokenizer: def tokenizer(self):
return AutoTokenizer.from_pretrained(self.model_name) return AutoTokenizer.from_pretrained(self.model_name)
@property @property
@@ -163,7 +373,9 @@ class TestMarian_EN_DE_More(MarianIntegrationTest):
"labels", "labels",
} }
self.assertSetEqual(desired_keys, set(model_inputs.keys())) self.assertSetEqual(desired_keys, set(model_inputs.keys()))
model_inputs["decoder_input_ids"] = shift_tokens_right(model_inputs.labels, self.tokenizer.pad_token_id) model_inputs["decoder_input_ids"] = shift_tokens_right(
model_inputs.labels, self.tokenizer.pad_token_id, self.model.config.decoder_start_token_id
)
model_inputs["return_dict"] = True model_inputs["return_dict"] = True
model_inputs["use_cache"] = False model_inputs["use_cache"] = False
with torch.no_grad(): with torch.no_grad():

View File

@@ -1,4 +1,5 @@
# Copyright 2020 The HuggingFace Team. All rights reserved. # coding=utf-8
# Copyright 2021, The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -11,61 +12,294 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" Testing suite for the PyTorch MBART model. """
import copy
import tempfile
import unittest import unittest
from transformers import is_torch_available from transformers import is_torch_available
from transformers.file_utils import cached_property from transformers.file_utils import cached_property
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
from .test_modeling_common import ModelTesterMixin from .test_configuration_common import ConfigTester
from .test_generation_utils import GenerationTesterMixin
from .test_modeling_common import ModelTesterMixin, ids_tensor
if is_torch_available(): if is_torch_available():
import torch import torch
from transformers import ( from transformers import (
AutoModelForSeq2SeqLM,
AutoTokenizer, AutoTokenizer,
BatchEncoding, BatchEncoding,
MBartConfig, MBartConfig,
MBartForConditionalGeneration, MBartForConditionalGeneration,
MBartForQuestionAnswering,
MBartForSequenceClassification,
MBartModel, MBartModel,
) )
from transformers.models.mbart.modeling_mbart import MBartDecoder, MBartEncoder
EN_CODE = 250004 def prepare_mbart_inputs_dict(
RO_CODE = 250020 config,
input_ids,
decoder_input_ids,
attention_mask=None,
decoder_attention_mask=None,
):
if attention_mask is None:
attention_mask = input_ids.ne(config.pad_token_id)
if decoder_attention_mask is None:
decoder_attention_mask = decoder_input_ids.ne(config.pad_token_id)
return {
"input_ids": input_ids,
"decoder_input_ids": decoder_input_ids,
"attention_mask": attention_mask,
"decoder_attention_mask": attention_mask,
}
@require_torch @require_torch
class ModelTester: class MBartModelTester:
def __init__(self, parent): def __init__(
self.config = MBartConfig( self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_labels=False,
vocab_size=99, vocab_size=99,
d_model=24, hidden_size=16,
encoder_layers=2, num_hidden_layers=2,
decoder_layers=2, num_attention_heads=4,
encoder_attention_heads=2, intermediate_size=4,
decoder_attention_heads=2, hidden_act="gelu",
encoder_ffn_dim=32, hidden_dropout_prob=0.1,
decoder_ffn_dim=32, attention_probs_dropout_prob=0.1,
max_position_embeddings=48, max_position_embeddings=20,
add_final_layer_norm=True, eos_token_id=2,
pad_token_id=1,
bos_token_id=0,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).clamp(
3,
) )
input_ids[:, -1] = self.eos_token_id # Eos Token
decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
config = MBartConfig(
vocab_size=self.vocab_size,
d_model=self.hidden_size,
encoder_layers=self.num_hidden_layers,
decoder_layers=self.num_hidden_layers,
encoder_attention_heads=self.num_attention_heads,
decoder_attention_heads=self.num_attention_heads,
encoder_ffn_dim=self.intermediate_size,
decoder_ffn_dim=self.intermediate_size,
dropout=self.hidden_dropout_prob,
attention_dropout=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
eos_token_id=self.eos_token_id,
bos_token_id=self.bos_token_id,
pad_token_id=self.pad_token_id,
)
inputs_dict = prepare_mbart_inputs_dict(config, input_ids, decoder_input_ids)
return config, inputs_dict
def prepare_config_and_inputs_for_common(self): def prepare_config_and_inputs_for_common(self):
return self.config, {} config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = MBartModel(config=config).get_decoder().to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["last_hidden_state"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"last_hidden_state"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-2))
def check_encoder_decoder_model_standalone(self, config, inputs_dict):
model = MBartModel(config=config).to(torch_device).eval()
outputs = model(**inputs_dict)
encoder_last_hidden_state = outputs.encoder_last_hidden_state
last_hidden_state = outputs.last_hidden_state
with tempfile.TemporaryDirectory() as tmpdirname:
encoder = model.get_encoder()
encoder.save_pretrained(tmpdirname)
encoder = MBartEncoder.from_pretrained(tmpdirname).to(torch_device)
encoder_last_hidden_state_2 = encoder(inputs_dict["input_ids"], attention_mask=inputs_dict["attention_mask"])[
0
]
self.parent.assertTrue((encoder_last_hidden_state_2 - encoder_last_hidden_state).abs().max().item() < 1e-3)
with tempfile.TemporaryDirectory() as tmpdirname:
decoder = model.get_decoder()
decoder.save_pretrained(tmpdirname)
decoder = MBartDecoder.from_pretrained(tmpdirname).to(torch_device)
last_hidden_state_2 = decoder(
input_ids=inputs_dict["decoder_input_ids"],
attention_mask=inputs_dict["decoder_attention_mask"],
encoder_hidden_states=encoder_last_hidden_state,
encoder_attention_mask=inputs_dict["attention_mask"],
)[0]
self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3)
@require_torch @require_torch
class SelectiveCommonTest(unittest.TestCase): class MBartModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (MBartForConditionalGeneration, MBartModel) if is_torch_available() else () all_model_classes = (
(MBartModel, MBartForConditionalGeneration, MBartForSequenceClassification, MBartForQuestionAnswering)
test_save_load__keys_to_ignore_on_save = ModelTesterMixin.test_save_load__keys_to_ignore_on_save if is_torch_available()
else ()
)
all_generative_model_classes = (MBartForConditionalGeneration,) if is_torch_available() else ()
is_encoder_decoder = True
test_pruning = False
test_head_masking = False
test_missing_keys = False
def setUp(self): def setUp(self):
self.model_tester = ModelTester(self) self.model_tester = MBartModelTester(self)
self.config_tester = ConfigTester(self, config_class=MBartConfig)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_decoder_model_past_with_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
def test_encoder_decoder_model_standalone(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs)
# MBartForSequenceClassification does not support inputs_embeds
def test_inputs_embeds(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in (MBartModel, MBartForConditionalGeneration, MBartForQuestionAnswering):
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
if not self.is_encoder_decoder:
input_ids = inputs["input_ids"]
del inputs["input_ids"]
else:
encoder_input_ids = inputs["input_ids"]
decoder_input_ids = inputs.get("decoder_input_ids", encoder_input_ids)
del inputs["input_ids"]
inputs.pop("decoder_input_ids", None)
wte = model.get_input_embeddings()
if not self.is_encoder_decoder:
inputs["inputs_embeds"] = wte(input_ids)
else:
inputs["inputs_embeds"] = wte(encoder_input_ids)
inputs["decoder_inputs_embeds"] = wte(decoder_input_ids)
with torch.no_grad():
model(**inputs)[0]
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
attention_mask = input_ids.ne(1).to(torch_device)
model = MBartForConditionalGeneration(config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
model.generate(input_ids, attention_mask=attention_mask)
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
def assert_tensors_close(a, b, atol=1e-12, prefix=""):
"""If tensors have different shapes, different values or a and b are not both tensors, raise a nice Assertion error."""
if a is None and b is None:
return True
try:
if torch.allclose(a, b, atol=atol):
return True
raise
except Exception:
pct_different = (torch.gt((a - b).abs(), atol)).float().mean().item()
if a.numel() > 100:
msg = f"tensor values are {pct_different:.1%} percent different."
else:
msg = f"{a} != {b}"
if prefix:
msg = prefix + ": " + msg
raise AssertionError(msg)
def _long_tensor(tok_lst):
return torch.tensor(tok_lst, dtype=torch.long, device=torch_device)
@require_torch @require_torch
@@ -83,7 +317,7 @@ class AbstractSeq2SeqIntegrationTest(unittest.TestCase):
@cached_property @cached_property
def model(self): def model(self):
"""Only load the model if needed.""" """Only load the model if needed."""
model = AutoModelForSeq2SeqLM.from_pretrained(self.checkpoint_name).to(torch_device) model = MBartForConditionalGeneration.from_pretrained(self.checkpoint_name).to(torch_device)
if "cuda" in torch_device: if "cuda" in torch_device:
model = model.half() model = model.half()
return model return model
@@ -102,7 +336,7 @@ class MBartEnroIntegrationTest(AbstractSeq2SeqIntegrationTest):
"Şeful ONU declară că nu există o soluţie militară în Siria", "Şeful ONU declară că nu există o soluţie militară în Siria",
'Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al Rusiei pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi că noi arme nu vor face decât să înrăutăţească violenţa şi mizeria pentru milioane de oameni.', 'Secretarul General Ban Ki-moon declară că răspunsul său la intensificarea sprijinului militar al Rusiei pentru Siria este că "nu există o soluţie militară" la conflictul de aproape cinci ani şi că noi arme nu vor face decât să înrăutăţească violenţa şi mizeria pentru milioane de oameni.',
] ]
expected_src_tokens = [8274, 127873, 25916, 7, 8622, 2071, 438, 67485, 53, 187895, 23, 51712, 2, EN_CODE] expected_src_tokens = [8274, 127873, 25916, 7, 8622, 2071, 438, 67485, 53, 187895, 23, 51712, 2, 250004]
@slow @slow
def test_enro_generate_one(self): def test_enro_generate_one(self):
@@ -128,7 +362,6 @@ class MBartEnroIntegrationTest(AbstractSeq2SeqIntegrationTest):
expected = {"scale_embedding": True, "output_past": True} expected = {"scale_embedding": True, "output_past": True}
for name in mbart_models: for name in mbart_models:
config = MBartConfig.from_pretrained(name) config = MBartConfig.from_pretrained(name)
self.assertTrue(config.is_valid_mbart())
for k, v in expected.items(): for k, v in expected.items():
try: try:
self.assertEqual(v, getattr(config, k)) self.assertEqual(v, getattr(config, k))

View File

@@ -1,4 +1,5 @@
# Copyright 2020 The HuggingFace Team. All rights reserved. # coding=utf-8
# Copyright 2021, The HuggingFace Inc. team. All rights reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
@@ -11,56 +12,252 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" Testing suite for the PyTorch PEGASUS model. """
import tempfile
import unittest import unittest
from transformers import AutoConfig, AutoTokenizer, is_torch_available from transformers import is_torch_available
from transformers.file_utils import cached_property from transformers.file_utils import cached_property
from transformers.models.pegasus.configuration_pegasus import task_specific_params
from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device from transformers.testing_utils import require_sentencepiece, require_tokenizers, require_torch, slow, torch_device
from transformers.utils.logging import ERROR, set_verbosity
from .test_modeling_bart import PGE_ARTICLE from .test_configuration_common import ConfigTester
from .test_modeling_common import ModelTesterMixin from .test_generation_utils import GenerationTesterMixin
from .test_modeling_common import ModelTesterMixin, ids_tensor
from .test_modeling_mbart import AbstractSeq2SeqIntegrationTest from .test_modeling_mbart import AbstractSeq2SeqIntegrationTest
if is_torch_available(): if is_torch_available():
import torch
from transformers import AutoModelForSeq2SeqLM, PegasusConfig, PegasusForConditionalGeneration, PegasusModel from transformers import AutoModelForSeq2SeqLM, PegasusConfig, PegasusForConditionalGeneration, PegasusModel
from transformers.models.pegasus.modeling_pegasus import PegasusDecoder, PegasusEncoder
XSUM_ENTRY_LONGER = """ The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning 'Oh I think you're nominated'", said Dappy."And I was like 'Oh yeah, which one?' And now we've got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it's best of us to come down and mingle with everyone and say hello to the cameras. And now we find we've got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn't be too disappointed if they didn't win this time around."At the end of the day we're grateful to be where we are in our careers."If it don't happen then it don't happen - live to fight another day and keep on making albums and hits for the fans."Dappy also revealed they could be performing live several times on the night.The group will be doing Number One and also a possible rendition of the War Child single, I Got Soul.The charity song is a re-working of The Killers' All These Things That I've Done and is set to feature artists like Chipmunk, Ironik and Pixie Lott.This year's Mobos will be held outside of London for the first time, in Glasgow on 30 September.N-Dubz said they were looking forward to performing for their Scottish fans and boasted about their recent shows north of the border."We just done Edinburgh the other day," said Dappy."We smashed up an N-Dubz show over there. We done Aberdeen about three or four months ago - we smashed up that show over there! Everywhere we go we smash it up!" """
set_verbosity(ERROR) def prepare_pegasus_inputs_dict(
config,
input_ids,
decoder_input_ids,
attention_mask=None,
decoder_attention_mask=None,
):
if attention_mask is None:
attention_mask = input_ids.ne(config.pad_token_id)
if decoder_attention_mask is None:
decoder_attention_mask = decoder_input_ids.ne(config.pad_token_id)
return {
"input_ids": input_ids,
"decoder_input_ids": decoder_input_ids,
"attention_mask": attention_mask,
"decoder_attention_mask": attention_mask,
}
@require_torch @require_torch
class ModelTester: class PegasusModelTester:
def __init__(self, parent): def __init__(
self.config = PegasusConfig( self,
parent,
batch_size=13,
seq_length=7,
is_training=True,
use_labels=False,
vocab_size=99, vocab_size=99,
d_model=24, hidden_size=16,
encoder_layers=2, num_hidden_layers=2,
decoder_layers=2, num_attention_heads=4,
encoder_attention_heads=2, intermediate_size=4,
decoder_attention_heads=2, hidden_act="gelu",
encoder_ffn_dim=32, hidden_dropout_prob=0.1,
decoder_ffn_dim=32, attention_probs_dropout_prob=0.1,
max_position_embeddings=48, max_position_embeddings=20,
add_final_layer_norm=True, eos_token_id=2,
pad_token_id=1,
bos_token_id=0,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.hidden_dropout_prob = hidden_dropout_prob
self.attention_probs_dropout_prob = attention_probs_dropout_prob
self.max_position_embeddings = max_position_embeddings
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).clamp(
3,
) )
input_ids[:, -1] = self.eos_token_id # Eos Token
decoder_input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
config = PegasusConfig(
vocab_size=self.vocab_size,
d_model=self.hidden_size,
encoder_layers=self.num_hidden_layers,
decoder_layers=self.num_hidden_layers,
encoder_attention_heads=self.num_attention_heads,
decoder_attention_heads=self.num_attention_heads,
encoder_ffn_dim=self.intermediate_size,
decoder_ffn_dim=self.intermediate_size,
dropout=self.hidden_dropout_prob,
attention_dropout=self.attention_probs_dropout_prob,
max_position_embeddings=self.max_position_embeddings,
eos_token_id=self.eos_token_id,
bos_token_id=self.bos_token_id,
pad_token_id=self.pad_token_id,
)
inputs_dict = prepare_pegasus_inputs_dict(config, input_ids, decoder_input_ids)
return config, inputs_dict
def prepare_config_and_inputs_for_common(self): def prepare_config_and_inputs_for_common(self):
return self.config, {} config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = PegasusModel(config=config).get_decoder().to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["last_hidden_state"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"last_hidden_state"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-2))
def check_encoder_decoder_model_standalone(self, config, inputs_dict):
model = PegasusModel(config=config).to(torch_device).eval()
outputs = model(**inputs_dict)
encoder_last_hidden_state = outputs.encoder_last_hidden_state
last_hidden_state = outputs.last_hidden_state
with tempfile.TemporaryDirectory() as tmpdirname:
encoder = model.get_encoder()
encoder.save_pretrained(tmpdirname)
encoder = PegasusEncoder.from_pretrained(tmpdirname).to(torch_device)
encoder_last_hidden_state_2 = encoder(inputs_dict["input_ids"], attention_mask=inputs_dict["attention_mask"])[
0
]
self.parent.assertTrue((encoder_last_hidden_state_2 - encoder_last_hidden_state).abs().max().item() < 1e-3)
with tempfile.TemporaryDirectory() as tmpdirname:
decoder = model.get_decoder()
decoder.save_pretrained(tmpdirname)
decoder = PegasusDecoder.from_pretrained(tmpdirname).to(torch_device)
last_hidden_state_2 = decoder(
input_ids=inputs_dict["decoder_input_ids"],
attention_mask=inputs_dict["decoder_attention_mask"],
encoder_hidden_states=encoder_last_hidden_state,
encoder_attention_mask=inputs_dict["attention_mask"],
)[0]
self.parent.assertTrue((last_hidden_state_2 - last_hidden_state).abs().max().item() < 1e-3)
@require_torch @require_torch
class SelectiveCommonTest(unittest.TestCase): class PegasusModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (PegasusForConditionalGeneration, PegasusModel) if is_torch_available() else () all_model_classes = (PegasusModel, PegasusForConditionalGeneration) if is_torch_available() else ()
all_generative_model_classes = (PegasusForConditionalGeneration,) if is_torch_available() else ()
test_save_load__keys_to_ignore_on_save = ModelTesterMixin.test_save_load__keys_to_ignore_on_save is_encoder_decoder = True
test_pruning = False
test_head_masking = False
test_missing_keys = False
def setUp(self): def setUp(self):
self.model_tester = ModelTester(self) self.model_tester = PegasusModelTester(self)
self.config_tester = ConfigTester(self, config_class=PegasusConfig)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_decoder_model_past_with_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
def test_encoder_decoder_model_standalone(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs_for_common()
self.model_tester.check_encoder_decoder_model_standalone(*config_and_inputs)
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
attention_mask = input_ids.ne(1).to(torch_device)
model = PegasusForConditionalGeneration(config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
model.generate(input_ids, attention_mask=attention_mask)
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
def assert_tensors_close(a, b, atol=1e-12, prefix=""):
"""If tensors have different shapes, different values or a and b are not both tensors, raise a nice Assertion error."""
if a is None and b is None:
return True
try:
if torch.allclose(a, b, atol=atol):
return True
raise
except Exception:
pct_different = (torch.gt((a - b).abs(), atol)).float().mean().item()
if a.numel() > 100:
msg = f"tensor values are {pct_different:.1%} percent different."
else:
msg = f"{a} != {b}"
if prefix:
msg = prefix + ": " + msg
raise AssertionError(msg)
def _long_tensor(tok_lst):
return torch.tensor(tok_lst, dtype=torch.long, device=torch_device)
@require_torch @require_torch
@@ -68,7 +265,11 @@ class SelectiveCommonTest(unittest.TestCase):
@require_tokenizers @require_tokenizers
class PegasusXSUMIntegrationTest(AbstractSeq2SeqIntegrationTest): class PegasusXSUMIntegrationTest(AbstractSeq2SeqIntegrationTest):
checkpoint_name = "google/pegasus-xsum" checkpoint_name = "google/pegasus-xsum"
src_text = [PGE_ARTICLE, XSUM_ENTRY_LONGER] src_text = [
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow.""",
""" The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning 'Oh I think you're nominated'", said Dappy."And I was like 'Oh yeah, which one?' And now we've got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it's best of us to come down and mingle with everyone and say hello to the cameras. And now we find we've got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn't be too disappointed if they didn't win this time around."At the end of the day we're grateful to be where we are in our careers."If it don't happen then it don't happen - live to fight another day and keep on making albums and hits for the fans."Dappy also revealed they could be performing live several times on the night.The group will be doing Number One and also a possible rendition of the War Child single, I Got Soul.The charity song is a re-working of The Killers' All These Things That I've Done and is set to feature artists like Chipmunk, Ironik and Pixie Lott.This year's Mobos will be held outside of London for the first time, in Glasgow on 30 September.N-Dubz said they were looking forward to performing for their Scottish fans and boasted about their recent shows north of the border."We just done Edinburgh the other day," said Dappy."We smashed up an N-Dubz show over there. We done Aberdeen about three or four months ago - we smashed up that show over there! Everywhere we go we smash it up!" """,
]
tgt_text = [ tgt_text = [
"California's largest electricity provider has turned off power to hundreds of thousands of customers.", "California's largest electricity provider has turned off power to hundreds of thousands of customers.",
"Pop group N-Dubz have revealed they were surprised to get four nominations for this year's Mobo Awards.", "Pop group N-Dubz have revealed they were surprised to get four nominations for this year's Mobo Awards.",
@@ -99,28 +300,3 @@ class PegasusXSUMIntegrationTest(AbstractSeq2SeqIntegrationTest):
"California's largest electricity provider has begun", "California's largest electricity provider has begun",
"N-Dubz have revealed they were", "N-Dubz have revealed they were",
] ]
class PegasusConfigTests(unittest.TestCase):
@slow
def test_task_specific_params(self):
"""Test that task_specific params['summarization_xsum'] == config['pegasus_xsum'] """
failures = []
pegasus_prefix = "google/pegasus"
n_prefix_chars = len("summarization_")
for task, desired_settings in task_specific_params.items():
dataset = task[n_prefix_chars:]
mname = f"{pegasus_prefix}-{dataset}"
cfg = AutoConfig.from_pretrained(mname)
for k, v in desired_settings.items():
actual_value = getattr(cfg, k)
if actual_value != v:
failures.append(f"config for {mname} had {k}: {actual_value}, expected {v}")
tokenizer = AutoTokenizer.from_pretrained(mname)
n_pos_embeds = desired_settings["max_position_embeddings"]
if n_pos_embeds != tokenizer.model_max_length:
failures.append(f"tokenizer.model_max_length {tokenizer.model_max_length} expected {n_pos_embeds}")
# error
all_fails = "\n".join(failures)
assert not failures, f"The following configs have unexpected settings: {all_fails}"

View File

@@ -246,16 +246,16 @@ TOLERANCE = 1e-4
class TFBartModelIntegrationTest(unittest.TestCase): class TFBartModelIntegrationTest(unittest.TestCase):
def test_inference_no_head(self): def test_inference_no_head(self):
model = TFBartModel.from_pretrained("facebook/bart-large", from_pt=True) model = TFBartModel.from_pretrained("facebook/bart-large", from_pt=True)
input_ids = _long_tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]]) input_ids = _long_tensor([[0, 31414, 232, 328, 740, 1140, 12695, 69, 46078, 1588, 2]])
inputs_dict = prepare_bart_inputs_dict(model.config, input_ids) attention_mask = tf.cast(tf.math.not_equal(input_ids, model.config.pad_token_id), tf.int8)
# with torch.no_grad(): output = model(input_ids=input_ids, attention_mask=attention_mask)[0]
output = model(**inputs_dict)[0]
expected_shape = (1, 11, 1024) expected_shape = (1, 11, 1024)
self.assertEqual(output.shape, expected_shape) self.assertEqual(output.shape, expected_shape)
expected_slice = tf.Tensor( expected_slice = tf.convert_to_tensor(
[[0.7144, 0.8143, -1.2813], [0.7144, 0.8143, -1.2813], [-0.0467, 2.5911, -2.1845]], [[0.7144, 0.8143, -1.2813], [0.7144, 0.8143, -1.2813], [-0.0467, 2.5911, -2.1845]],
) )
self.assertTrue(tf.debugging.assert_near(output[:, :3, :3], expected_slice, atol=TOLERANCE)) tf.debugging.assert_near(output[:, :3, :3], expected_slice, atol=TOLERANCE)
def test_cnn_summarization_same_as_fairseq_hard(self): def test_cnn_summarization_same_as_fairseq_hard(self):
hf = TFBartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn", from_pt=True) hf = TFBartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn", from_pt=True)

View File

@@ -21,7 +21,6 @@ from transformers.file_utils import cached_property
from transformers.testing_utils import is_pt_tf_cross_test, require_sentencepiece, require_tf, require_tokenizers, slow from transformers.testing_utils import is_pt_tf_cross_test, require_sentencepiece, require_tf, require_tokenizers, slow
from .test_configuration_common import ConfigTester from .test_configuration_common import ConfigTester
from .test_modeling_pegasus import PGE_ARTICLE, XSUM_ENTRY_LONGER
from .test_modeling_tf_bart import TFBartModelTester from .test_modeling_tf_bart import TFBartModelTester
from .test_modeling_tf_common import TFModelTesterMixin from .test_modeling_tf_common import TFModelTesterMixin
@@ -115,7 +114,10 @@ class TestTFPegasusCommon(TFModelTesterMixin, unittest.TestCase):
@require_sentencepiece @require_sentencepiece
@require_tokenizers @require_tokenizers
class TFPegasusIntegrationTests(unittest.TestCase): class TFPegasusIntegrationTests(unittest.TestCase):
src_text = [PGE_ARTICLE, XSUM_ENTRY_LONGER] src_text = [
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow.""",
""" The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning 'Oh I think you're nominated'", said Dappy."And I was like 'Oh yeah, which one?' And now we've got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it's best of us to come down and mingle with everyone and say hello to the cameras. And now we find we've got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn't be too disappointed if they didn't win this time around."At the end of the day we're grateful to be where we are in our careers."If it don't happen then it don't happen - live to fight another day and keep on making albums and hits for the fans."Dappy also revealed they could be performing live several times on the night.The group will be doing Number One and also a possible rendition of the War Child single, I Got Soul.The charity song is a re-working of The Killers' All These Things That I've Done and is set to feature artists like Chipmunk, Ironik and Pixie Lott.This year's Mobos will be held outside of London for the first time, in Glasgow on 30 September.N-Dubz said they were looking forward to performing for their Scottish fans and boasted about their recent shows north of the border."We just done Edinburgh the other day," said Dappy."We smashed up an N-Dubz show over there. We done Aberdeen about three or four months ago - we smashed up that show over there! Everywhere we go we smash it up!" """,
]
expected_text = [ expected_text = [
"California's largest electricity provider has cut power to hundreds of thousands of customers in an effort to reduce the risk of wildfires.", "California's largest electricity provider has cut power to hundreds of thousands of customers in an effort to reduce the risk of wildfires.",
'N-Dubz have revealed they\'re "grateful" to have been nominated for four Mobo Awards.', 'N-Dubz have revealed they\'re "grateful" to have been nominated for four Mobo Awards.',

View File

@@ -14,78 +14,10 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
"""Tests for Blenderbot Tokenizers, including common tests for BlenderbotSmallTokenizer.""" """Tests for Blenderbot Tokenizers, including common tests for BlenderbotSmallTokenizer."""
import json
import os
import unittest import unittest
from transformers.file_utils import cached_property from transformers.file_utils import cached_property
from transformers.models.blenderbot.tokenization_blenderbot import ( from transformers.models.blenderbot.tokenization_blenderbot import BlenderbotTokenizer
VOCAB_FILES_NAMES,
BlenderbotSmallTokenizer,
BlenderbotTokenizer,
)
from .test_tokenization_common import TokenizerTesterMixin
class BlenderbotSmallTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = BlenderbotSmallTokenizer
def setUp(self):
super().setUp()
vocab = ["__start__", "adapt", "act", "ap@@", "te", "__end__", "__unk__"]
vocab_tokens = dict(zip(vocab, range(len(vocab))))
merges = ["#version: 0.2", "a p", "t e</w>", "ap t</w>", "a d", "ad apt</w>", "a c", "ac t</w>", ""]
self.special_tokens_map = {"unk_token": "__unk__", "bos_token": "__start__", "eos_token": "__end__"}
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
with open(self.vocab_file, "w", encoding="utf-8") as fp:
fp.write(json.dumps(vocab_tokens) + "\n")
with open(self.merges_file, "w", encoding="utf-8") as fp:
fp.write("\n".join(merges))
def get_tokenizer(self, **kwargs):
kwargs.update(self.special_tokens_map)
return BlenderbotSmallTokenizer.from_pretrained(self.tmpdirname, **kwargs)
def get_input_output_texts(self, tokenizer):
input_text = "adapt act apte"
output_text = "adapt act apte"
return input_text, output_text
def test_full_blenderbot_small_tokenizer(self):
tokenizer = BlenderbotSmallTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
text = "adapt act apte"
bpe_tokens = ["adapt", "act", "ap@@", "te"]
tokens = tokenizer.tokenize(text)
self.assertListEqual(tokens, bpe_tokens)
input_tokens = [tokenizer.bos_token] + tokens + [tokenizer.eos_token]
input_bpe_tokens = [0, 1, 2, 3, 4, 5]
self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
def test_special_tokens_small_tok(self):
tok = BlenderbotSmallTokenizer.from_pretrained("facebook/blenderbot-90M")
assert tok("sam").input_ids == [1384]
src_text = "I am a small frog."
encoded = tok([src_text], padding=False, truncation=False)["input_ids"]
decoded = tok.batch_decode(encoded, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
assert src_text != decoded # I wish it did!
assert decoded == "i am a small frog ."
def test_empty_word_small_tok(self):
tok = BlenderbotSmallTokenizer.from_pretrained("facebook/blenderbot-90M")
src_text = "I am a small frog ."
src_text_dot = "."
encoded = tok(src_text)["input_ids"]
encoded_dot = tok(src_text_dot)["input_ids"]
assert encoded[-1] == encoded_dot[0]
class Blenderbot3BTokenizerTests(unittest.TestCase): class Blenderbot3BTokenizerTests(unittest.TestCase):

View File

@@ -31,7 +31,7 @@ if _sentencepiece_available:
if is_torch_available(): if is_torch_available():
from transformers.models.bart.modeling_bart import shift_tokens_right from transformers.models.mbart.modeling_mbart import shift_tokens_right
EN_CODE = 250004 EN_CODE = 250004
RO_CODE = 250020 RO_CODE = 250020
@@ -196,6 +196,7 @@ class MBartEnroIntegrationTest(unittest.TestCase):
self.src_text, tgt_texts=self.tgt_text, return_tensors="pt" self.src_text, tgt_texts=self.tgt_text, return_tensors="pt"
) )
batch["decoder_input_ids"] = shift_tokens_right(batch.labels, self.tokenizer.pad_token_id) batch["decoder_input_ids"] = shift_tokens_right(batch.labels, self.tokenizer.pad_token_id)
for k in batch: for k in batch:
batch[k] = batch[k].tolist() batch[k] = batch[k].tolist()
# batch = {k: v.tolist() for k,v in batch.items()} # batch = {k: v.tolist() for k,v in batch.items()}

View File

@@ -0,0 +1,86 @@
#!/usr/bin/env python3
# coding=utf-8
# Copyright 2020 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for the Blenderbot small tokenizer."""
import json
import os
import unittest
from transformers.models.blenderbot_small.tokenization_blenderbot_small import (
VOCAB_FILES_NAMES,
BlenderbotSmallTokenizer,
)
from .test_tokenization_common import TokenizerTesterMixin
class BlenderbotSmallTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = BlenderbotSmallTokenizer
def setUp(self):
super().setUp()
vocab = ["__start__", "adapt", "act", "ap@@", "te", "__end__", "__unk__"]
vocab_tokens = dict(zip(vocab, range(len(vocab))))
merges = ["#version: 0.2", "a p", "t e</w>", "ap t</w>", "a d", "ad apt</w>", "a c", "ac t</w>", ""]
self.special_tokens_map = {"unk_token": "__unk__", "bos_token": "__start__", "eos_token": "__end__"}
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
self.merges_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["merges_file"])
with open(self.vocab_file, "w", encoding="utf-8") as fp:
fp.write(json.dumps(vocab_tokens) + "\n")
with open(self.merges_file, "w", encoding="utf-8") as fp:
fp.write("\n".join(merges))
def get_tokenizer(self, **kwargs):
kwargs.update(self.special_tokens_map)
return BlenderbotSmallTokenizer.from_pretrained(self.tmpdirname, **kwargs)
def get_input_output_texts(self, tokenizer):
input_text = "adapt act apte"
output_text = "adapt act apte"
return input_text, output_text
def test_full_blenderbot_small_tokenizer(self):
tokenizer = BlenderbotSmallTokenizer(self.vocab_file, self.merges_file, **self.special_tokens_map)
text = "adapt act apte"
bpe_tokens = ["adapt", "act", "ap@@", "te"]
tokens = tokenizer.tokenize(text)
self.assertListEqual(tokens, bpe_tokens)
input_tokens = [tokenizer.bos_token] + tokens + [tokenizer.eos_token]
input_bpe_tokens = [0, 1, 2, 3, 4, 5]
self.assertListEqual(tokenizer.convert_tokens_to_ids(input_tokens), input_bpe_tokens)
def test_special_tokens_small_tok(self):
tok = BlenderbotSmallTokenizer.from_pretrained("facebook/blenderbot-90M")
assert tok("sam").input_ids == [1384]
src_text = "I am a small frog."
encoded = tok([src_text], padding=False, truncation=False)["input_ids"]
decoded = tok.batch_decode(encoded, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
assert src_text != decoded # I wish it did!
assert decoded == "i am a small frog ."
def test_empty_word_small_tok(self):
tok = BlenderbotSmallTokenizer.from_pretrained("facebook/blenderbot-90M")
src_text = "I am a small frog ."
src_text_dot = "."
encoded = tok(src_text)["input_ids"]
encoded_dot = tok(src_text_dot)["input_ids"]
assert encoded[-1] == encoded_dot[0]

View File

@@ -35,6 +35,14 @@ IGNORE_NON_TESTED = [
"BartDecoder", # Building part of bigger (tested) model. "BartDecoder", # Building part of bigger (tested) model.
"BartEncoder", # Building part of bigger (tested) model. "BartEncoder", # Building part of bigger (tested) model.
"BertLMHeadModel", # Needs to be setup as decoder. "BertLMHeadModel", # Needs to be setup as decoder.
"BlenderbotSmallEncoder", # Building part of bigger (tested) model.
"BlenderbotSmallDecoder", # Building part of bigger (tested) model.
"BlenderbotEncoder", # Building part of bigger (tested) model.
"BlenderbotDecoder", # Building part of bigger (tested) model.
"MBartEncoder", # Building part of bigger (tested) model.
"MBartDecoder", # Building part of bigger (tested) model.
"PegasusEncoder", # Building part of bigger (tested) model.
"PegasusDecoder", # Building part of bigger (tested) model.
"DPREncoder", # Building part of bigger (tested) model. "DPREncoder", # Building part of bigger (tested) model.
"DPRSpanPredictor", # Building part of bigger (tested) model. "DPRSpanPredictor", # Building part of bigger (tested) model.
"ProphetNetDecoderWrapper", # Building part of bigger (tested) model. "ProphetNetDecoderWrapper", # Building part of bigger (tested) model.
@@ -70,6 +78,10 @@ IGNORE_NON_AUTO_CONFIGURED = [
"LEDDecoder", "LEDDecoder",
"BartDecoder", "BartDecoder",
"BartEncoder", "BartEncoder",
"BlenderbotSmallEncoder",
"BlenderbotSmallDecoder",
"BlenderbotEncoder",
"BlenderbotDecoder",
"DPRContextEncoder", "DPRContextEncoder",
"DPREncoder", "DPREncoder",
"DPRReader", "DPRReader",
@@ -78,7 +90,11 @@ IGNORE_NON_AUTO_CONFIGURED = [
"FunnelBaseModel", "FunnelBaseModel",
"GPT2DoubleHeadsModel", "GPT2DoubleHeadsModel",
"MT5EncoderModel", "MT5EncoderModel",
"MBartEncoder",
"MBartDecoder",
"OpenAIGPTDoubleHeadsModel", "OpenAIGPTDoubleHeadsModel",
"PegasusEncoder",
"PegasusDecoder",
"ProphetNetDecoder", "ProphetNetDecoder",
"ProphetNetEncoder", "ProphetNetEncoder",
"ProphetNetDecoderWrapper", "ProphetNetDecoderWrapper",