Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
This commit is contained in:
@@ -8,8 +8,8 @@ The RoBERTa model was proposed in `RoBERTa: A Robustly Optimized BERT Pretrainin
|
||||
<https://arxiv.org/abs/1907.11692>`_ by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
|
||||
Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
|
||||
|
||||
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining
|
||||
objective and training with much larger mini-batches and learning rates.
|
||||
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with
|
||||
much larger mini-batches and learning rates.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
@@ -17,15 +17,15 @@ The abstract from the paper is the following:
|
||||
approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes,
|
||||
and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication
|
||||
study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and
|
||||
training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of
|
||||
every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These
|
||||
results highlight the importance of previously overlooked design choices, and raise questions about the source
|
||||
of recently reported improvements. We release our models and code.*
|
||||
training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every
|
||||
model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results
|
||||
highlight the importance of previously overlooked design choices, and raise questions about the source of recently
|
||||
reported improvements. We release our models and code.*
|
||||
|
||||
Tips:
|
||||
|
||||
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a
|
||||
setup for Roberta pretrained models.
|
||||
- This implementation is the same as :class:`~transformers.BertModel` with a tiny embeddings tweak as well as a setup
|
||||
for Roberta pretrained models.
|
||||
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a
|
||||
different pretraining scheme.
|
||||
- RoBERTa doesn't have :obj:`token_type_ids`, you don't need to indicate which token belongs to which segment. Just
|
||||
|
||||
Reference in New Issue
Block a user