GPT-2 PyTorch models + better tips for BERT

2020-01-16 17:16:26 -05:00
parent dbeb7fb4e6
commit bd0d3fd76e
3 changed files with 196 additions and 146 deletions
--- a/docs/source/model_doc/bert.rst
+++ b/docs/source/model_doc/bert.rst
@@ -27,7 +27,13 @@ Tips:

 - BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on
  the right rather than the left.
-
+- BERT was trained with a masked language modeling (MLM) objective. It is therefore efficient at predicting masked
+  tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language
+  modeling (CLM) objective are better in that regard.
+- Alongside MLM, BERT was trained using a next sentence prediction (NSP) objective using the [CLS] token as a sequence
+  approximate. The user may use this token (the first token in a sequence built with special tokens) to get a sequence
+  prediction rather than a token prediction. However, averaging over the sequence may yield better results than using
+  the [CLS] token.

 BertConfig
 ~~~~~~~~~~~~~~~~~~~~~