Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
This commit is contained in:
@@ -12,34 +12,28 @@ identify which tokens were replaced by the generator in the sequence.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Masked language modeling (MLM) pre-training methods such as BERT corrupt
|
||||
the input by replacing some tokens with [MASK] and then train a model to
|
||||
reconstruct the original tokens. While they produce good results when transferred
|
||||
to downstream NLP tasks, they generally require large amounts of compute to be
|
||||
effective. As an alternative, we propose a more sample-efficient pre-training task
|
||||
called replaced token detection. Instead of masking the input, our approach
|
||||
corrupts it by replacing some tokens with plausible alternatives sampled from a small
|
||||
generator network. Then, instead of training a model that predicts the original
|
||||
identities of the corrupted tokens, we train a discriminative model that predicts
|
||||
whether each token in the corrupted input was replaced by a generator sample
|
||||
or not. Thorough experiments demonstrate this new pre-training task is more
|
||||
efficient than MLM because the task is defined over all input tokens rather than
|
||||
just the small subset that was masked out. As a result, the contextual representations
|
||||
learned by our approach substantially outperform the ones learned by BERT
|
||||
given the same model size, data, and compute. The gains are particularly strong
|
||||
for small models; for example, we train a model on one GPU for 4 days that
|
||||
outperforms GPT (trained using 30x more compute) on the GLUE natural language
|
||||
understanding benchmark. Our approach also works well at scale, where it
|
||||
performs comparably to RoBERTa and XLNet while using less than 1/4 of their
|
||||
compute and outperforms them when using the same amount of compute.*
|
||||
*Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with
|
||||
[MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to
|
||||
downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a
|
||||
more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach
|
||||
corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead
|
||||
of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that
|
||||
predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments
|
||||
demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens
|
||||
rather than just the small subset that was masked out. As a result, the contextual representations learned by our
|
||||
approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are
|
||||
particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained
|
||||
using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale,
|
||||
where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when
|
||||
using the same amount of compute.*
|
||||
|
||||
Tips:
|
||||
|
||||
- ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The
|
||||
only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller,
|
||||
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from
|
||||
their embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no
|
||||
projection layer is used.
|
||||
while the hidden size is larger. An additional projection layer (linear) is used to project the embeddings from their
|
||||
embedding size to the hidden size. In the case where the embedding size is the same as the hidden size, no projection
|
||||
layer is used.
|
||||
- The ELECTRA checkpoints saved using `Google Research's implementation <https://github.com/google-research/electra>`__
|
||||
contain both the generator and discriminator. The conversion script requires the user to name which model to export
|
||||
into the correct architecture. Once converted to the HuggingFace format, these checkpoints may be loaded into all
|
||||
|
||||
Reference in New Issue
Block a user