Tips + whitespaces

2020-01-21 15:58:25 -05:00
parent 0e9899f451
commit 9ddf60b694
34 changed files with 452 additions and 369 deletions
--- a/docs/source/model_doc/distilbert.rst
+++ b/docs/source/model_doc/distilbert.rst
@@ -1,86 +1,96 @@
 DistilBERT
 ----------------------------------------------------

-DistilBERT is a small, fast, cheap and light Transformer model
-trained by distilling Bert base. It has 40% less parameters than
-`bert-base-uncased`, runs 60% faster while preserving over 95% of
-Bert's performances as measured on the GLUE language understanding benchmark.
+The DistilBERT model was proposed in the blog post
+`Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT <https://medium.com/huggingface/distilbert-8cf3380435b5>`__,
+and the paper `DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter <https://arxiv.org/abs/1910.01108>`__.
+DistilBERT is a small, fast, cheap and light Transformer model trained by distilling Bert base. It has 40% less
+parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on
+the GLUE language understanding benchmark.

-Here are the differences between the interface of Bert and DistilBert:
+The abstract from the paper is the following:
+
+*As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP),
+operating these large models in on-the-edge and/or under constrained computational training or inference budgets
+remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation
+model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger
+counterparts. While most prior work investigated the use of distillation for building task-specific models, we
+leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a
+BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage
+the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language
+modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train
+and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative
+on-device study.*
+
+Tips:

 - DistilBert doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`)
 - DistilBert doesn't have options to select the input positions (`position_ids` input). This could be added if necessary though, just let's us know if you need this option.

-For more information on DistilBERT, please refer to our
-`detailed blog post`_

-.. _`detailed blog post`:
-    https://medium.com/huggingface/distilbert-8cf3380435b5
-
-
-``DistilBertConfig``
+DistilBertConfig
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.DistilBertConfig
    :members:


-``DistilBertTokenizer``
+DistilBertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.DistilBertTokenizer
    :members:


-``DistilBertModel``
+DistilBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.DistilBertModel
    :members:


-``DistilBertForMaskedLM``
+DistilBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.DistilBertForMaskedLM
    :members:


-``DistilBertForSequenceClassification``
+DistilBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.DistilBertForSequenceClassification
    :members:


-``DistilBertForQuestionAnswering``
+DistilBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.DistilBertForQuestionAnswering
    :members:

-``TFDistilBertModel``
+TFDistilBertModel
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFDistilBertModel
    :members:


-``TFDistilBertForMaskedLM``
+TFDistilBertForMaskedLM
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFDistilBertForMaskedLM
    :members:


-``TFDistilBertForSequenceClassification``
+TFDistilBertForSequenceClassification
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFDistilBertForSequenceClassification
    :members:


-``TFDistilBertForQuestionAnswering``
+TFDistilBertForQuestionAnswering
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.TFDistilBertForQuestionAnswering