Doc styling (#8067)
* Important files * Styling them all * Revert "Styling them all" This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e. * Syling them for realsies * Fix syntax error * Fix benchmark_utils * More fixes * Fix modeling auto and script * Remove new line * Fixes * More fixes * Fix more files * Style * Add FSMT * More fixes * More fixes * More fixes * More fixes * Fixes * More fixes * More fixes * Last fixes * Make sphinx happy
This commit is contained in:
@@ -19,14 +19,14 @@ downstream tasks. However, at some point further model increases become harder d
|
||||
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
|
||||
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
|
||||
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
|
||||
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
|
||||
tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
|
||||
RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
|
||||
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
|
||||
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
|
||||
SQuAD benchmarks while having fewer parameters compared to BERT-large.*
|
||||
|
||||
Tips:
|
||||
|
||||
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
|
||||
the right rather than the left.
|
||||
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
|
||||
than the left.
|
||||
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
|
||||
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
|
||||
number of (repeating) layers.
|
||||
|
||||
Reference in New Issue
Block a user