ALBERT Modeling + required changes to utilities

2020-01-15 14:20:17 -05:00
parent f81b6c95f2
commit 00df3d4de0
4 changed files with 259 additions and 166 deletions
--- a/docs/source/model_doc/albert.rst
+++ b/docs/source/model_doc/albert.rst
@@ -1,63 +1,92 @@
 ALBERT
 ----------------------------------------------------

-``AlbertConfig``
+Overview
+~~~~~~~~~~~~~~~~~~~~~
+
+The ALBERT model was proposed in `ALBERT: A Lite BERT for Self-supervised Learning of Language Representations <https://arxiv.org/abs/1909.11942>`_
+by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents
+two parameter-reduction techniques to lower memory consumption and increase the trainig speed of BERT:
+
+- Splitting the embedding matrix into two smaller matrices
+- Using repeating layers split among groups
+
+The abstract from the paper is the following:
+
+*Increasing model size when pretraining natural language representations often results in improved performance on
+downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
+longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
+techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
+that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
+self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream
+tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE,
+RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.*
+
+Tips:
+
+- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
+  the right rather than the left.
+- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
+  similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
+  number of (repeating) layers.
+
+AlbertConfig
 ~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AlbertConfig
    :members:


-``AlbertTokenizer``
+AlbertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AlbertTokenizer
    :members:


-``AlbertModel``
+AlbertModel
 ~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AlbertModel
    :members:


-``AlbertForMaskedLM``
+AlbertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AlbertForMaskedLM
    :members:


-``AlbertForSequenceClassification``
+AlbertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AlbertForSequenceClassification
    :members:


-``AlbertForQuestionAnswering``
+AlbertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AlbertForQuestionAnswering
    :members:


-``TFAlbertModel``
+TFAlbertModel
 ~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFAlbertModel
    :members:


-``TFAlbertForMaskedLM``
+TFAlbertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFAlbertForMaskedLM
    :members:


-``TFAlbertForSequenceClassification``
+TFAlbertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFAlbertForSequenceClassification