diff --git a/docs/source/main_classes/optimizer_schedules.rst b/docs/source/main_classes/optimizer_schedules.rst index ec4998389b..3d8536cdfa 100644 --- a/docs/source/main_classes/optimizer_schedules.rst +++ b/docs/source/main_classes/optimizer_schedules.rst @@ -17,7 +17,6 @@ The ``.optimization`` module provides: ~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.AdamWeightDecay - :members: .. autofunction:: transformers.create_optimizer diff --git a/docs/source/main_classes/pipelines.rst b/docs/source/main_classes/pipelines.rst index 9a5fbb2e70..04f918b362 100644 --- a/docs/source/main_classes/pipelines.rst +++ b/docs/source/main_classes/pipelines.rst @@ -7,7 +7,7 @@ Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction an There are two categories of pipeline abstractions to be aware about: -- The :class:`~transformers.pipeline` which is the most powerful object encapsulating all other pipelines +- The :func:`~transformers.pipeline` which is the most powerful object encapsulating all other pipelines - The other task-specific pipelines, such as :class:`~transformers.TokenClassificationPipeline` or :class:`~transformers.QuestionAnsweringPipeline` @@ -17,8 +17,7 @@ The pipeline abstraction The `pipeline` abstraction is a wrapper around all the other available pipelines. It is instantiated as any other pipeline but requires an additional argument which is the `task`. -.. autoclass:: transformers.pipeline - :members: +... autofunction:: transformers.pipeline The task specific pipelines diff --git a/docs/source/model_doc/auto.rst b/docs/source/model_doc/auto.rst index 541d03a8e5..55140d5a99 100644 --- a/docs/source/model_doc/auto.rst +++ b/docs/source/model_doc/auto.rst @@ -30,35 +30,35 @@ Instantiating one of ``AutoModel``, ``AutoConfig`` and ``AutoTokenizer`` will di ``AutoModelForPreTraining`` -~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.AutoModelForPreTraining :members: ``AutoModelWithLMHead`` -~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.AutoModelWithLMHead :members: ``AutoModelForSequenceClassification`` -~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.AutoModelForSequenceClassification :members: ``AutoModelForQuestionAnswering`` -~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.AutoModelForQuestionAnswering :members: ``AutoModelForTokenClassification`` -~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.AutoModelForTokenClassification :members: diff --git a/docs/source/model_doc/encoderdecoder.rst b/docs/source/model_doc/encoderdecoder.rst index 71c873314c..f3105d9131 100644 --- a/docs/source/model_doc/encoderdecoder.rst +++ b/docs/source/model_doc/encoderdecoder.rst @@ -1,5 +1,5 @@ Encoder Decoder Models ------------ +------------------------ This class can wrap an encoder model, such as ``BertModel`` and a decoder modeling with a language modeling head, such as ``BertForMaskedLM`` into a encoder-decoder model. @@ -10,7 +10,7 @@ An application of this architecture could be *summarization* using two pretraine ``EncoderDecoderConfig`` -~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: transformers.EncoderDecoderConfig :members: diff --git a/docs/source/model_doc/reformer.rst b/docs/source/model_doc/reformer.rst index 2c3b3faf1c..016c8937a3 100644 --- a/docs/source/model_doc/reformer.rst +++ b/docs/source/model_doc/reformer.rst @@ -4,7 +4,7 @@ Reformer file a `Github Issue `_ Overview -~~~~~ +~~~~~~~~~~ The Reformer model was presented in `Reformer: The Efficient Transformer `_ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. Here the abstract: @@ -13,7 +13,7 @@ Here the abstract: The Authors' code can be found `here `_ . Axial Positional Encodings -~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Axial Positional Encodings were first implemented in Google's `trax library `_ and developed by the authors of this model's paper. In models that are treating very long input sequences, the conventional position id encodings store an embedings vector of size :math:`d` being the ``config.hidden_size`` for every position :math:`i, \ldots, n_s`, with :math:`n_s` being ``config.max_embedding_size``. *E.g.*, having a sequence length of :math:`n_s = 2^{19} \approx 0.5M` and a ``config.hidden_size`` of :math:`d = 2^{10} \approx 1000` would result in a position encoding matrix: .. math:: diff --git a/docs/source/pretrained_models.rst b/docs/source/pretrained_models.rst index 44e4dded6d..27f048dc25 100644 --- a/docs/source/pretrained_models.rst +++ b/docs/source/pretrained_models.rst @@ -22,10 +22,12 @@ For a list that includes community-uploaded models, refer to `https://huggingfac | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-base-multilingual-uncased`` | | (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on lower-cased text in the top 102 languages with the largest Wikipedias | +| | | | | | | (see `details `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-base-multilingual-cased`` | | (New, **recommended**) 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on cased text in the top 104 languages with the largest Wikipedias | +| | | | | | | (see `details `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-base-chinese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | @@ -33,64 +35,79 @@ For a list that includes community-uploaded models, refer to `https://huggingfac | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-base-german-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on cased German text by Deepset.ai | +| | | | | | | (see `details on deepset.ai website `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-large-uncased-whole-word-masking`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. | | | | | Trained on lower-cased English text using Whole-Word-Masking | +| | | | | | | (see `details `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-large-cased-whole-word-masking`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. | | | | | Trained on cased English text using Whole-Word-Masking | +| | | | | | | (see `details `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-large-uncased-whole-word-masking-finetuned-squad`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters. | | | | | The ``bert-large-uncased-whole-word-masking`` model fine-tuned on SQuAD | +| | | | | | | (see details of fine-tuning in the `example section `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-large-cased-whole-word-masking-finetuned-squad`` | | 24-layer, 1024-hidden, 16-heads, 340M parameters | | | | | The ``bert-large-cased-whole-word-masking`` model fine-tuned on SQuAD | +| | | | | | | (see `details of fine-tuning in the example section `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-base-cased-finetuned-mrpc`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | The ``bert-base-cased`` model fine-tuned on MRPC | +| | | | | | | (see `details of fine-tuning in the example section `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-base-german-dbmdz-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on cased German text by DBMDZ | +| | | | | | | (see `details on dbmdz repository `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``bert-base-german-dbmdz-uncased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on uncased German text by DBMDZ | +| | | | | | | (see `details on dbmdz repository `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``cl-tohoku/bert-base-japanese`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on Japanese text. Text is tokenized with MeCab and WordPiece. | | | | | `MeCab `__ is required for tokenization. | +| | | | | | | (see `details on cl-tohoku repository `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``cl-tohoku/bert-base-japanese-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized with MeCab and WordPiece. | | | | | `MeCab `__ is required for tokenization. | +| | | | | | | (see `details on cl-tohoku repository `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``cl-tohoku/bert-base-japanese-char`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on Japanese text. Text is tokenized into characters. | +| | | | | | | (see `details on cl-tohoku repository `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``cl-tohoku/bert-base-japanese-char-whole-word-masking`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on Japanese text using Whole-Word-Masking. Text is tokenized into characters. | +| | | | | | | (see `details on cl-tohoku repository `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``TurkuNLP/bert-base-finnish-cased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on cased Finnish text. | +| | | | | | | (see `details on turkunlp.org `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``TurkuNLP/bert-base-finnish-uncased-v1`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on uncased Finnish text. | +| | | | | | | (see `details on turkunlp.org `__). | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``wietsedv/bert-base-dutch-cased`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | | | | | Trained on cased Dutch text. | +| | | | | | | (see `details on wietsedv repository `__). | +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | GPT | ``openai-gpt`` | | 12-layer, 768-hidden, 12-heads, 110M parameters. | @@ -149,54 +166,67 @@ For a list that includes community-uploaded models, refer to `https://huggingfac +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | RoBERTa | ``roberta-base`` | | 12-layer, 768-hidden, 12-heads, 125M parameters | | | | | RoBERTa using the BERT-base architecture | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``roberta-large`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters | | | | | RoBERTa using the BERT-large architecture | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``roberta-large-mnli`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters | | | | | ``roberta-large`` fine-tuned on `MNLI `__. | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``distilroberta-base`` | | 6-layer, 768-hidden, 12-heads, 82M parameters | | | | | The DistilRoBERTa model distilled from the RoBERTa model `roberta-base` checkpoint. | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``roberta-base-openai-detector`` | | 12-layer, 768-hidden, 12-heads, 125M parameters | | | | | ``roberta-base`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``roberta-large-openai-detector`` | | 24-layer, 1024-hidden, 16-heads, 355M parameters | | | | | ``roberta-large`` fine-tuned by OpenAI on the outputs of the 1.5B-parameter GPT-2 model. | +| | | | | | | (see `details `__) | +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | DistilBERT | ``distilbert-base-uncased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters | | | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``distilbert-base-uncased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 66M parameters | | | | | The DistilBERT model distilled from the BERT model `bert-base-uncased` checkpoint, with an additional linear layer. | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``distilbert-base-cased`` | | 6-layer, 768-hidden, 12-heads, 65M parameters | | | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``distilbert-base-cased-distilled-squad`` | | 6-layer, 768-hidden, 12-heads, 65M parameters | | | | | The DistilBERT model distilled from the BERT model `bert-base-cased` checkpoint, with an additional question answering layer. | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``distilgpt2`` | | 6-layer, 768-hidden, 12-heads, 82M parameters | | | | | The DistilGPT2 model distilled from the GPT2 model `gpt2` checkpoint. | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``distilbert-base-german-cased`` | | 6-layer, 768-hidden, 12-heads, 66M parameters | | | | | The German DistilBERT model distilled from the German DBMDZ BERT model `bert-base-german-dbmdz-cased` checkpoint. | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``distilbert-base-multilingual-cased`` | | 6-layer, 768-hidden, 12-heads, 134M parameters | | | | | The multilingual DistilBERT model distilled from the Multilingual BERT model `bert-base-multilingual-cased` checkpoint. | +| | | | | | | (see `details `__) | +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | CTRL | ``ctrl`` | | 48-layer, 1280-hidden, 16-heads, 1.6B parameters | @@ -204,38 +234,47 @@ For a list that includes community-uploaded models, refer to `https://huggingfac +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | CamemBERT | ``camembert-base`` | | 12-layer, 768-hidden, 12-heads, 110M parameters | | | | | CamemBERT using the BERT-base architecture | +| | | | | | | (see `details `__) | +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | ALBERT | ``albert-base-v1`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters | | | | | ALBERT base model | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``albert-large-v1`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters | | | | | ALBERT large model | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``albert-xlarge-v1`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters | | | | | ALBERT xlarge model | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``albert-xxlarge-v1`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters | | | | | ALBERT xxlarge model | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``albert-base-v2`` | | 12 repeating layers, 128 embedding, 768-hidden, 12-heads, 11M parameters | | | | | ALBERT base model with no dropout, additional training data and longer training | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``albert-large-v2`` | | 24 repeating layers, 128 embedding, 1024-hidden, 16-heads, 17M parameters | | | | | ALBERT large model with no dropout, additional training data and longer training | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``albert-xlarge-v2`` | | 24 repeating layers, 128 embedding, 2048-hidden, 16-heads, 58M parameters | | | | | ALBERT xlarge model with no dropout, additional training data and longer training | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``albert-xxlarge-v2`` | | 12 repeating layer, 128 embedding, 4096-hidden, 64-heads, 223M parameters | | | | | ALBERT xxlarge model with no dropout, additional training data and longer training | +| | | | | | | (see `details `__) | +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | T5 | ``t5-small`` | | ~60M parameters with 6-layers, 512-hidden-state, 2048 feed-forward hidden-state, 8-heads, | @@ -261,21 +300,26 @@ For a list that includes community-uploaded models, refer to `https://huggingfac +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | FlauBERT | ``flaubert/flaubert_small_cased`` | | 6-layer, 512-hidden, 8-heads, 54M parameters | | | | | FlauBERT small architecture | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``flaubert/flaubert_base_uncased`` | | 12-layer, 768-hidden, 12-heads, 137M parameters | | | | | FlauBERT base architecture with uncased vocabulary | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``flaubert/flaubert_base_cased`` | | 12-layer, 768-hidden, 12-heads, 138M parameters | | | | | FlauBERT base architecture with cased vocabulary | +| | | | | | | (see `details `__) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``flaubert/flaubert_large_cased`` | | 24-layer, 1024-hidden, 16-heads, 373M parameters | | | | | FlauBERT large architecture | +| | | | | | | (see `details `__) | +-------------------+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | Bart | ``facebook/bart-large`` | | 24-layer, 1024-hidden, 16-heads, 406M parameters | +| | | | | | | (see `details `_) | | +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+ | | ``facebook/bart-base`` | | 12-layer, 768-hidden, 16-heads, 139M parameters | diff --git a/docs/source/usage.rst b/docs/source/usage.rst index 315993e6ba..5d035c4ab7 100644 --- a/docs/source/usage.rst +++ b/docs/source/usage.rst @@ -692,7 +692,8 @@ following array should be the output: :: - [('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')] + [('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')] + Summarization ---------------------------------------------------- @@ -769,7 +770,8 @@ Here Google`s T5 model is used that was only pre-trained on a multi-task mixed d # T5 uses a max_length of 512 so we cut the article to 512 tokens. inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="tf", max_length=512) outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True) - print(outputs) + print(outputs) + Translation ---------------------------------------------------- diff --git a/src/transformers/configuration_auto.py b/src/transformers/configuration_auto.py index c3806c12d5..e2c436f882 100644 --- a/src/transformers/configuration_auto.py +++ b/src/transformers/configuration_auto.py @@ -134,6 +134,7 @@ class AutoConfig: The configuration class to instantiate is selected based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `t5`: :class:`~transformers.T5Config` (T5 model) - `distilbert`: :class:`~transformers.DistilBertConfig` (DistilBERT model) - `albert`: :class:`~transformers.AlbertConfig` (ALBERT model) diff --git a/src/transformers/configuration_t5.py b/src/transformers/configuration_t5.py index 05c1d87b88..9925c14758 100644 --- a/src/transformers/configuration_t5.py +++ b/src/transformers/configuration_t5.py @@ -53,7 +53,7 @@ class T5Config(PretrainedConfig): probabilities. n_positions: The maximum sequence length that this model might ever be used with. Typically set this to something large just in case - (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings'. + (e.g., 512 or 1024 or 2048). `n_positions` can also be accessed via the property `max_position_embeddings`. type_vocab_size: The vocabulary size of the `token_type_ids` passed into `T5Model`. initializer_factor: A factor for initializing all weight matrices (should be kept to 1.0, used for initialization testing). diff --git a/src/transformers/configuration_xlnet.py b/src/transformers/configuration_xlnet.py index fbed440404..2c17696805 100644 --- a/src/transformers/configuration_xlnet.py +++ b/src/transformers/configuration_xlnet.py @@ -84,11 +84,12 @@ class XLNetConfig(PretrainedConfig): Argument used when doing sequence summary. Used in for the multiple choice head in :class:transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`. Is one of the following options: - - 'last' => take the last token hidden state (like XLNet) - - 'first' => take the first token hidden state (like Bert) - - 'mean' => take the mean of all tokens hidden states - - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2) - - 'attn' => Not implemented now, use multi-head attention + + - 'last' => take the last token hidden state (like XLNet) + - 'first' => take the first token hidden state (like Bert) + - 'mean' => take the mean of all tokens hidden states + - 'cls_index' => supply a Tensor of classification token position (GPT/GPT-2) + - 'attn' => Not implemented now, use multi-head attention summary_use_proj (:obj:`boolean`, optional, defaults to :obj:`True`): Argument used when doing sequence summary. Used in for the multiple choice head in :class:`~transformers.XLNetForSequenceClassification` and :class:`~transformers.XLNetForMultipleChoice`. diff --git a/src/transformers/data/processors/utils.py b/src/transformers/data/processors/utils.py index 0212c58643..4550e5756b 100644 --- a/src/transformers/data/processors/utils.py +++ b/src/transformers/data/processors/utils.py @@ -83,7 +83,8 @@ class DataProcessor: """Base class for data converters for sequence classification data sets.""" def get_example_from_tensor_dict(self, tensor_dict): - """Gets an example from a dict with tensorflow tensors + """Gets an example from a dict with tensorflow tensors. + Args: tensor_dict: Keys and values should match the corresponding Glue tensorflow_dataset examples. @@ -91,15 +92,15 @@ class DataProcessor: raise NotImplementedError() def get_train_examples(self, data_dir): - """Gets a collection of `InputExample`s for the train set.""" + """Gets a collection of :class:`InputExample` for the train set.""" raise NotImplementedError() def get_dev_examples(self, data_dir): - """Gets a collection of `InputExample`s for the dev set.""" + """Gets a collection of :class:`InputExample` for the dev set.""" raise NotImplementedError() def get_test_examples(self, data_dir): - """Gets a collection of `InputExample`s for the test set.""" + """Gets a collection of :class:`InputExample` for the test set.""" raise NotImplementedError() def get_labels(self): diff --git a/src/transformers/modeling_auto.py b/src/transformers/modeling_auto.py index da9281d192..e7e2a98baa 100644 --- a/src/transformers/modeling_auto.py +++ b/src/transformers/modeling_auto.py @@ -393,6 +393,7 @@ class AutoModel: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `t5`: :class:`~transformers.T5Model` (T5 model) - `distilbert`: :class:`~transformers.DistilBertModel` (DistilBERT model) - `albert`: :class:`~transformers.AlbertModel` (ALBERT model) @@ -546,6 +547,7 @@ class AutoModelForPreTraining: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `t5`: :class:`~transformers.T5ModelWithLMHead` (T5 model) - `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model) - `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model) @@ -698,6 +700,7 @@ class AutoModelWithLMHead: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `t5`: :class:`~transformers.T5ForConditionalGeneration` (T5 model) - `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model) - `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model) @@ -845,6 +848,7 @@ class AutoModelForCausalLM: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `bert`: :class:`~transformers.BertLMHeadModel` (Bert model) - `openai-gpt`: :class:`~transformers.OpenAIGPTLMHeadModel` (OpenAI GPT model) - `gpt2`: :class:`~transformers.GPT2LMHeadModel` (OpenAI GPT-2 model) @@ -982,6 +986,7 @@ class AutoModelForMaskedLM: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `distilbert`: :class:`~transformers.DistilBertForMaskedLM` (DistilBERT model) - `albert`: :class:`~transformers.AlbertForMaskedLM` (ALBERT model) - `camembert`: :class:`~transformers.CamembertForMaskedLM` (CamemBERT model) @@ -1118,6 +1123,7 @@ class AutoModelForSeq2SeqLM: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `t5`: :class:`~transformers.T5ForConditionalGeneration` (T5 model) - `bart`: :class:`~transformers.BartForConditionalGeneration` (Bert model) - `marian`: :class:`~transformers.MarianMTModel` (Marian model) @@ -1256,6 +1262,7 @@ class AutoModelForSequenceClassification: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `distilbert`: :class:`~transformers.DistilBertForSequenceClassification` (DistilBERT model) - `albert`: :class:`~transformers.AlbertForSequenceClassification` (ALBERT model) - `camembert`: :class:`~transformers.CamembertForSequenceClassification` (CamemBERT model) @@ -1402,6 +1409,7 @@ class AutoModelForQuestionAnswering: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `distilbert`: :class:`~transformers.DistilBertForQuestionAnswering` (DistilBERT model) - `albert`: :class:`~transformers.AlbertForQuestionAnswering` (ALBERT model) - `bert`: :class:`~transformers.BertForQuestionAnswering` (Bert model) @@ -1547,6 +1555,7 @@ class AutoModelForTokenClassification: The `from_pretrained()` method takes care of returning the correct model class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `distilbert`: :class:`~transformers.DistilBertForTokenClassification` (DistilBERT model) - `xlm`: :class:`~transformers.XLMForTokenClassification` (XLM model) - `xlm-roberta`: :class:`~transformers.XLMRobertaForTokenClassification` (XLM-RoBERTa?Para model) diff --git a/src/transformers/modeling_electra.py b/src/transformers/modeling_electra.py index d4aa2eb50f..e95b02b808 100644 --- a/src/transformers/modeling_electra.py +++ b/src/transformers/modeling_electra.py @@ -745,9 +745,10 @@ class ElectraForTokenClassification(ElectraPreTrainedModel): @add_start_docstrings( - """ELECTRA Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layers on top of - the hidden-states output to compute `span start logits` and `span end logits`). """, - ELECTRA_INPUTS_DOCSTRING, + """ + ELECTRA Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linear + layers on top of the hidden-states output to compute `span start logits` and `span end logits`).""", + ELECTRA_START_DOCSTRING, ) class ElectraForQuestionAnswering(ElectraPreTrainedModel): config_class = ElectraConfig diff --git a/src/transformers/modeling_longformer.py b/src/transformers/modeling_longformer.py index 7d70c311c3..fc8a064d42 100644 --- a/src/transformers/modeling_longformer.py +++ b/src/transformers/modeling_longformer.py @@ -435,7 +435,7 @@ class LongformerSelfAttention(nn.Module): LONGFORMER_START_DOCSTRING = r""" - This model is a PyTorch `torch.nn.Module `_ sub-class. + This model is a PyTorch `torch.nn.Module `__ sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. @@ -467,7 +467,7 @@ LONGFORMER_INPUTS_DOCSTRING = r""" Tokens with global attention attends to all other tokens, and all other tokens attend to them. This is important for task-specific finetuning because it makes the model more flexible at representing the task. For example, for classification, the token should be given global attention. For QA, all question tokens should also have - global attention. Please refer to the Longformer paper https://arxiv.org/abs/2004.05150 for more details. + global attention. Please refer to the `Longformer paper `__ for more details. Mask values selected in ``[0, 1]``: ``0`` for local attention (a sliding window attention), ``1`` for global attention (tokens that attend to all other tokens, and all other tokens attend to them). @@ -500,7 +500,7 @@ class LongformerModel(RobertaModel): """ This class overrides :class:`~transformers.RobertaModel` to provide the ability to process long sequences following the selfattention approach described in `Longformer: the Long-Document Transformer - `_ by Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention + `__ by Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer selfattention combines a local (sliding window) and global attention to extend to long documents without the O(n^2) increase in memory and compute. diff --git a/src/transformers/modeling_reformer.py b/src/transformers/modeling_reformer.py index b1e1df54e5..ca50d4eb8c 100644 --- a/src/transformers/modeling_reformer.py +++ b/src/transformers/modeling_reformer.py @@ -1451,14 +1451,10 @@ class ReformerPreTrainedModel(PreTrainedModel): REFORMER_START_DOCSTRING = r""" - Reformer was proposed in - `Reformer: The Efficient Transformer`_ + Reformer was proposed in `Reformer: The Efficient Transformer `__ by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. - .. _`Reformer: The Efficient Transformer`: - https://arxiv.org/abs/2001.04451 - - This model is a PyTorch `torch.nn.Module `_ sub-class. + This model is a PyTorch `torch.nn.Module `__ sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. diff --git a/src/transformers/modeling_t5.py b/src/transformers/modeling_t5.py index 76ce73d4af..a2080b4932 100644 --- a/src/transformers/modeling_t5.py +++ b/src/transformers/modeling_t5.py @@ -775,19 +775,14 @@ class T5Stack(T5PreTrainedModel): return outputs # last-layer hidden state, (presents,) (all hidden states), (all attentions) -T5_START_DOCSTRING = r""" The T5 model was proposed in - `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_ - by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. +T5_START_DOCSTRING = r""" + The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer + `__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, + Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting. - This model is a PyTorch `torch.nn.Module`_ sub-class. Use it as a regular PyTorch Module and - refer to the PyTorch documentation for all matter related to general usage and behavior. - - .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`: - https://arxiv.org/abs/1910.10683 - - .. _`torch.nn.Module`: - https://pytorch.org/docs/stable/nn.html#module + This model is a PyTorch `torch.nn.Module `__ sub-class. Use it as a + regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. Parameters: config (:class:`~transformers.T5Config`): Model configuration class with all the parameters of the model. @@ -804,7 +799,7 @@ T5_INPUTS_DOCSTRING = r""" See :func:`transformers.PreTrainedTokenizer.encode` and :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. To know more on how to prepare :obj:`input_ids` for pre-training take a look at - `T5 Training <./t5.html#training>`_ . + `T5 Training <./t5.html#training>`__. attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`): Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: @@ -817,7 +812,7 @@ T5_INPUTS_DOCSTRING = r""" Provide for sequence to sequence training. T5 uses the pad_token_id as the starting token for decoder_input_ids generation. If `decoder_past_key_value_states` is used, optionally only the last `decoder_input_ids` have to be input (see `decoder_past_key_value_states`). To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at - `T5 Training <./t5.html#training>`_ . + `T5 Training <./t5.html#training>`__. decoder_attention_mask (:obj:`torch.BoolTensor` of shape :obj:`(batch_size, tgt_seq_len)`, `optional`, defaults to :obj:`None`): Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default. decoder_past_key_value_states (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): @@ -902,8 +897,8 @@ class T5Model(T5PreTrainedModel): output_attentions=None, ): r""" - Return: - :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. + Returns: + :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs: last_hidden_state (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the model. If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output. @@ -925,13 +920,13 @@ class T5Model(T5PreTrainedModel): Examples:: - from transformers import T5Tokenizer, T5Model + from transformers import T5Tokenizer, T5Model - tokenizer = T5Tokenizer.from_pretrained('t5-small') - model = T5Model.from_pretrained('t5-small') - input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="pt") # Batch size 1 - outputs = model(input_ids=input_ids, decoder_input_ids=input_ids) - last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple + tokenizer = T5Tokenizer.from_pretrained('t5-small') + model = T5Model.from_pretrained('t5-small') + input_ids = tokenizer.encode("Hello, my dog is cute", return_tensors="pt") # Batch size 1 + outputs = model(input_ids=input_ids, decoder_input_ids=input_ids) + last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple """ @@ -1030,15 +1025,15 @@ class T5ForConditionalGeneration(T5PreTrainedModel): ): r""" labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`, defaults to :obj:`None`): - Labels for computing the sequence classification/regression loss. - Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`. - All labels set to ``-100`` are ignored (masked), the loss is only - computed for labels in ``[0, ..., config.vocab_size]`` + Labels for computing the sequence classification/regression loss. + Indices should be in :obj:`[-100, 0, ..., config.vocab_size - 1]`. + All labels set to ``-100`` are ignored (masked), the loss is only + computed for labels in ``[0, ..., config.vocab_size]`` kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): - Used to hide legacy arguments that have been deprecated. + Used to hide legacy arguments that have been deprecated. Returns: - :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. + :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs: loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`labels` is provided): Classification loss (cross entropy). prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`) diff --git a/src/transformers/modeling_tf_albert.py b/src/transformers/modeling_tf_albert.py index 58ece306f5..7fe3a4c2b9 100644 --- a/src/transformers/modeling_tf_albert.py +++ b/src/transformers/modeling_tf_albert.py @@ -705,38 +705,38 @@ class TFAlbertModel(TFAlbertPreTrainedModel): @add_start_docstrings_to_callable(ALBERT_INPUTS_DOCSTRING.format("(batch_size, sequence_length)")) def call(self, inputs, **kwargs): r""" - Returns: - :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs: - last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): - Sequence of hidden-states at the output of the last layer of the model. - pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`): - Last layer hidden-state of the first token of the sequence (classification token) - further processed by a Linear layer and a Tanh activation function. The Linear - layer weights are trained from the next sentence prediction (classification) - objective during Albert pretraining. This output is usually *not* a good summary - of the semantic content of the input, you're often better with averaging or pooling - the sequence of hidden-states for the whole input sequence. - hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`): - tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer) - of shape :obj:`(batch_size, sequence_length, hidden_size)`. + Returns: + :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs: + last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`): + Last layer hidden-state of the first token of the sequence (classification token) + further processed by a Linear layer and a Tanh activation function. The Linear + layer weights are trained from the next sentence prediction (classification) + objective during Albert pretraining. This output is usually *not* a good summary + of the semantic content of the input, you're often better with averaging or pooling + the sequence of hidden-states for the whole input sequence. + hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when :obj:`config.output_hidden_states=True`): + tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer) + of shape :obj:`(batch_size, sequence_length, hidden_size)`. - Hidden-states of the model at the output of each layer plus the initial embedding outputs. - attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or ``config.output_attentions=True``): - tuple of :obj:`tf.Tensor` (one for each layer) of shape - :obj:`(batch_size, num_heads, sequence_length, sequence_length)`: + Hidden-states of the model at the output of each layer plus the initial embedding outputs. + attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or ``config.output_attentions=True``): + tuple of :obj:`tf.Tensor` (one for each layer) of shape + :obj:`(batch_size, num_heads, sequence_length, sequence_length)`: - Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. - Examples:: + Examples:: - import tensorflow as tf - from transformers import AlbertTokenizer, TFAlbertModel + import tensorflow as tf + from transformers import AlbertTokenizer, TFAlbertModel - tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') - model = TFAlbertModel.from_pretrained('albert-base-v2') - input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1 - outputs = model(input_ids) - last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple + tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') + model = TFAlbertModel.from_pretrained('albert-base-v2') + input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1 + outputs = model(input_ids) + last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple """ outputs = self.albert(inputs, **kwargs) diff --git a/src/transformers/modeling_tf_electra.py b/src/transformers/modeling_tf_electra.py index ecb8db56a8..d29770de41 100644 --- a/src/transformers/modeling_tf_electra.py +++ b/src/transformers/modeling_tf_electra.py @@ -408,12 +408,11 @@ class TFElectraModel(TFElectraPreTrainedModel): @add_start_docstrings( - """ -Electra model with a binary classification head on top as used during pre-training for identifying generated -tokens. + """Electra model with a binary classification head on top as used during pre-training for identifying generated + tokens. -Even though both the discriminator and generator may be loaded into this model, the discriminator is -the only model of the two to have the correct classification head to be used for this model.""", + Even though both the discriminator and generator may be loaded into this model, the discriminator is + the only model of the two to have the correct classification head to be used for this model.""", ELECTRA_START_DOCSTRING, ) class TFElectraForPreTraining(TFElectraPreTrainedModel): @@ -501,11 +500,10 @@ class TFElectraMaskedLMHead(tf.keras.layers.Layer): @add_start_docstrings( - """ -Electra model with a language modeling head on top. + """Electra model with a language modeling head on top. -Even though both the discriminator and generator may be loaded into this model, the generator is -the only model of the two to have been trained for the masked language modeling task.""", + Even though both the discriminator and generator may be loaded into this model, the generator is + the only model of the two to have been trained for the masked language modeling task.""", ELECTRA_START_DOCSTRING, ) class TFElectraForMaskedLM(TFElectraPreTrainedModel): @@ -588,10 +586,9 @@ class TFElectraForMaskedLM(TFElectraPreTrainedModel): @add_start_docstrings( - """ -Electra model with a token classification head on top. + """Electra model with a token classification head on top. -Both the discriminator and generator may be loaded into this model.""", + Both the discriminator and generator may be loaded into this model.""", ELECTRA_START_DOCSTRING, ) class TFElectraForTokenClassification(TFElectraPreTrainedModel, TFTokenClassificationLoss): diff --git a/src/transformers/modeling_tf_t5.py b/src/transformers/modeling_tf_t5.py index 515ee74265..55d3e4ed39 100644 --- a/src/transformers/modeling_tf_t5.py +++ b/src/transformers/modeling_tf_t5.py @@ -772,19 +772,15 @@ class TFT5PreTrainedModel(TFPreTrainedModel): return dummy_inputs -T5_START_DOCSTRING = r""" The T5 model was proposed in - `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`_ - by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. +T5_START_DOCSTRING = r""" + The T5 model was proposed in `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer + `__ by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, + Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. It's an encoder decoder transformer pre-trained in a text-to-text denoising generative setting. - This model is a tf.keras.Model `tf.keras.Model`_ sub-class. Use it as a regular TF 2.0 Keras Model and - refer to the TF 2.0 documentation for all matter related to general usage and behavior. - - .. _`Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer`: - https://arxiv.org/abs/1910.10683 - - .. _`tf.keras.Model`: - https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model + This model is a `tf.keras.Model `__ + sub-class. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to + general usage and behavior. Note on the model inputs: TF 2.0 models accepts two formats as inputs: @@ -796,7 +792,7 @@ T5_START_DOCSTRING = r""" The T5 model was proposed in If you choose this second option, there are three possibilities you can use to gather all the input Tensors in the first positional argument : - - a single Tensor with inputs only and nothing else: `model(inputs_ids) + - a single Tensor with inputs only and nothing else: `model(inputs_ids)` - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring: `model([inputs, attention_mask])` or `model([inputs, attention_mask, token_type_ids])` - a dictionary with one or several input Tensors associaed to the input names given in the docstring: @@ -818,7 +814,7 @@ T5_INPUTS_DOCSTRING = r""" the right or the left. Indices can be obtained using :class:`transformers.T5Tokenizer`. To know more on how to prepare :obj:`inputs` for pre-training take a look at - `T5 Training <./t5.html#training>`_ . + `T5 Training <./t5.html#training>`__. See :func:`transformers.PreTrainedTokenizer.encode` and :func:`transformers.PreTrainedTokenizer.convert_tokens_to_ids` for details. decoder_input_ids (:obj:`tf.Tensor` of shape :obj:`(batch_size, target_sequence_length)`, `optional`, defaults to :obj:`None`): @@ -850,7 +846,7 @@ T5_INPUTS_DOCSTRING = r""" This is useful if you want more control over how to convert `decoder_input_ids` indices into associated vectors than the model's internal embedding lookup matrix. To know more on how to prepare :obj:`decoder_input_ids` for pre-training take a look at - `T5 Training <./t5.html#training>`_ . + `T5 Training <./t5.html#training>`__. head_mask: (:obj:`tf.Tensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`, defaults to :obj:`None`): Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``: @@ -897,8 +893,8 @@ class TFT5Model(TFT5PreTrainedModel): @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING) def call(self, inputs, **kwargs): r""" - Return: - :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. + Returns: + :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs: last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the model. If `decoder_past_key_value_states` is used only the last hidden-state of the sequences of shape :obj:`(batch_size, 1, hidden_size)` is output. @@ -1024,8 +1020,8 @@ class TFT5ForConditionalGeneration(TFT5PreTrainedModel): @add_start_docstrings_to_callable(T5_INPUTS_DOCSTRING) def call(self, inputs, **kwargs): r""" - Return: - :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs. + Returns: + :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.T5Config`) and inputs: loss (:obj:`tf.Tensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`lm_label` is provided): Classification loss (cross entropy). prediction_scores (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`) diff --git a/src/transformers/modeling_tf_utils.py b/src/transformers/modeling_tf_utils.py index b49e132bee..b16973738f 100644 --- a/src/transformers/modeling_tf_utils.py +++ b/src/transformers/modeling_tf_utils.py @@ -294,7 +294,6 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin): Parameters: pretrained_model_name_or_path: either: - - a string with the `shortcut name` of a pre-trained model to load from cache or download, e.g.: ``bert-base-uncased``. - a string with the `identifier name` of a pre-trained model that was user-uploaded to our S3, e.g.: ``dbmdz/bert-base-german-cased``. - a path to a `directory` containing model weights saved using :func:`~transformers.PreTrainedModel.save_pretrained`, e.g.: ``./my_model_directory/``. @@ -306,11 +305,11 @@ class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin): config: (`optional`) one of: - an instance of a class derived from :class:`~transformers.PretrainedConfig`, or - a string valid as input to :func:`~transformers.PretrainedConfig.from_pretrained()` - Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when: - - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or - - the model was saved using :func:`~transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory. - - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory. + Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when: + - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or + - the model was saved using :func:`~transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory. + - the model is loaded by suppling a local directory as ``pretrained_model_name_or_path`` and a configuration JSON file named `config.json` is found in the directory. from_pt: (`optional`) boolean, default False: Load the model weights from a PyTorch state_dict save file (see docstring of pretrained_model_name_or_path argument). diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py index 3f84bb4bad..3711e3724e 100644 --- a/src/transformers/modeling_utils.py +++ b/src/transformers/modeling_utils.py @@ -530,6 +530,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin): config: (`optional`) one of: - an instance of a class derived from :class:`~transformers.PretrainedConfig`, or - a string valid as input to :func:`~transformers.PretrainedConfig.from_pretrained()` + Configuration for the model to use instead of an automatically loaded configuation. Configuration can be automatically loaded when: - the model is a model provided by the library (loaded with the ``shortcut-name`` string of a pretrained model), or - the model was saved using :func:`~transformers.PreTrainedModel.save_pretrained` and is reloaded by suppling the save directory. diff --git a/src/transformers/optimization_tf.py b/src/transformers/optimization_tf.py index 47b5370f0c..65d41f29d9 100644 --- a/src/transformers/optimization_tf.py +++ b/src/transformers/optimization_tf.py @@ -97,13 +97,13 @@ def create_optimizer( class AdamWeightDecay(tf.keras.optimizers.Adam): """Adam enables L2 weight decay and clip_by_global_norm on gradients. - Just adding the square of the weights to the loss function is *not* the - correct way of using L2 regularization/weight decay with Adam, since that will - interact with the m and v parameters in strange ways. - Instead we want ot decay the weights in a manner that doesn't interact with - the m/v parameters. This is equivalent to adding the square of the weights to - the loss with plain (non-momentum) SGD. - """ + Just adding the square of the weights to the loss function is *not* the + correct way of using L2 regularization/weight decay with Adam, since that will + interact with the m and v parameters in strange ways. + Instead we want ot decay the weights in a manner that doesn't interact with + the m/v parameters. This is equivalent to adding the square of the weights to + the loss with plain (non-momentum) SGD. + """ def __init__( self, @@ -198,11 +198,11 @@ class AdamWeightDecay(tf.keras.optimizers.Adam): # Extracted from https://github.com/OpenNMT/OpenNMT-tf/blob/master/opennmt/optimizers/utils.py class GradientAccumulator(object): """Gradient accumulation utility. - When used with a distribution strategy, the accumulator should be called in a - replica context. Gradients will be accumulated locally on each replica and - without synchronization. Users should then call ``.gradients``, scale the - gradients if required, and pass the result to ``apply_gradients``. - """ + When used with a distribution strategy, the accumulator should be called in a + replica context. Gradients will be accumulated locally on each replica and + without synchronization. Users should then call ``.gradients``, scale the + gradients if required, and pass the result to ``apply_gradients``. + """ # We use the ON_READ synchronization policy so that no synchronization is # performed on assignment. To get the value, we call .value() which returns the diff --git a/src/transformers/pipelines.py b/src/transformers/pipelines.py index bbc626e0bc..7c7c4d2afc 100755 --- a/src/transformers/pipelines.py +++ b/src/transformers/pipelines.py @@ -323,6 +323,7 @@ class Pipeline(_ScikitCompat): Base class implementing pipelined operations. Pipeline workflow is defined as a sequence of the following operations: + Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output Pipeline supports running on CPU or GPU through the device argument. Users can specify diff --git a/src/transformers/tokenization_auto.py b/src/transformers/tokenization_auto.py index cdd85982fe..308136970f 100644 --- a/src/transformers/tokenization_auto.py +++ b/src/transformers/tokenization_auto.py @@ -103,6 +103,7 @@ class AutoTokenizer: The `from_pretrained()` method takes care of returning the correct tokenizer class instance based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `t5`: T5Tokenizer (T5 model) - `distilbert`: DistilBertTokenizer (DistilBert model) - `albert`: AlbertTokenizer (ALBERT model) @@ -136,6 +137,7 @@ class AutoTokenizer: The tokenizer class to instantiate is selected based on the `model_type` property of the config object, or when it's missing, falling back to using pattern matching on the `pretrained_model_name_or_path` string: + - `t5`: T5Tokenizer (T5 model) - `distilbert`: DistilBertTokenizer (DistilBert model) - `albert`: AlbertTokenizer (ALBERT model) diff --git a/src/transformers/tokenization_utils_base.py b/src/transformers/tokenization_utils_base.py index 64e3634372..28e8354c88 100644 --- a/src/transformers/tokenization_utils_base.py +++ b/src/transformers/tokenization_utils_base.py @@ -1408,7 +1408,7 @@ class PreTrainedTokenizerBase(SpecialTokensMixin): The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pre-tokenized string). If the sequences are provided as list of strings (pretokenized), you must set `is_pretokenized=True` - (to lift the ambiguity with a batch of sequences) + (to lift the ambiguity with a batch of sequences) text_pair (:obj:`str`, :obj:`List[str]`, :obj:`List[List[str]]``): The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pre-tokenized string).