diff --git a/docs/source/index.rst b/docs/source/index.rst
index ada3fc1656..ded234354d 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -38,32 +38,32 @@ Pytorch-Transformers
 This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for:
 
 
-* `Google's BERT model <https://github.com/google-research/bert>`_\ ,
-* `OpenAI's GPT model <https://github.com/openai/finetune-transformer-lm>`_\ ,
-* `Google/CMU's Transformer-XL model <https://github.com/kimiyoung/transformer-xl>`_\ , and
-* `OpenAI's GPT-2 model <https://blog.openai.com/better-language-models/>`_.
+* `Google's BERT model <https://github.com/google-research/bert>`__\ ,
+* `OpenAI's GPT model <https://github.com/openai/finetune-transformer-lm>`__\ ,
+* `Google/CMU's Transformer-XL model <https://github.com/kimiyoung/transformer-xl>`__\ , and
+* `OpenAI's GPT-2 model <https://blog.openai.com/better-language-models/>`__.
 
-These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <./examples.html>`_ section.
+These implementations have been tested on several datasets (see the examples) and should match the performances of the associated TensorFlow implementations (e.g. ~91 F1 on SQuAD for BERT, ~88 F1 on RocStories for OpenAI GPT and ~18.3 perplexity on WikiText 103 for the Transformer-XL). You can find more details in the `Examples <./examples.html>`__ section.
 
 Here are some information on these models:
 
-**BERT** was released together with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`_ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
-This PyTorch implementation of BERT is provided with `Google's pre-trained models <https://github.com/google-research/bert>`_\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
+**BERT** was released together with the paper `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
+This PyTorch implementation of BERT is provided with `Google's pre-trained models <https://github.com/google-research/bert>`__\ , examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.
 
-**OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`_ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
-This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`__ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
+**OpenAI GPT** was released together with the paper `Improving Language Understanding by Generative Pre-Training <https://blog.openai.com/language-unsupervised/>`__ by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
+This PyTorch implementation of OpenAI GPT is an adaptation of the `PyTorch implementation by HuggingFace <https://github.com/huggingface/pytorch-openai-transformer-lm>`__ and is provided with `OpenAI's pre-trained model <https://github.com/openai/finetune-transformer-lm>`__ and a command-line interface that was used to convert the pre-trained NumPy checkpoint in PyTorch.
 
-**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`_ by Zihang Dai\*, Zhilin Yang\* , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`_ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
+**Google/CMU's Transformer-XL** was released together with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context <http://arxiv.org/abs/1901.02860>`__ by Zihang Dai\*, Zhilin Yang\* , Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
+This PyTorch implementation of Transformer-XL is an adaptation of the original `PyTorch implementation <https://github.com/kimiyoung/transformer-xl>`__ which has been slightly modified to match the performances of the TensorFlow implementation and allow to re-use the pretrained weights. A command-line interface is provided to convert TensorFlow checkpoints in PyTorch models.
 
-**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`_ by Alec Radford\*, Jeffrey Wu\* , Rewon Child, David Luan, Dario Amodei\*\* and Ilya Sutskever\*\*.
-This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`_ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
+**OpenAI GPT-2** was released together with the paper `Language Models are Unsupervised Multitask Learners <https://blog.openai.com/better-language-models/>`__ by Alec Radford\*, Jeffrey Wu\* , Rewon Child, David Luan, Dario Amodei\*\* and Ilya Sutskever\*\*.
+This PyTorch implementation of OpenAI GPT-2 is an adaptation of the `OpenAI's implementation <https://github.com/openai/gpt-2>`__ and is provided with `OpenAI's pre-trained model <https://github.com/openai/gpt-2>`__ and a command-line interface that was used to convert the TensorFlow checkpoint in PyTorch.
 
-**Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`_ by Guillaume Lample and Alexis Conneau.
-This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`_. TODO Lysandre filled
+**Facebook Research's XLM** was released together with the paper `Cross-lingual Language Model Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
+This PyTorch implementation of XLM is an adaptation of the original `PyTorch implementation <https://github.com/facebookresearch/XLM>`__. TODO Lysandre filled
 
-**Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`_ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le.
-This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`_. TODO Lysandre filled
+**Google's XLNet** was released together with the paper `XLNet: Generalized Autoregressive Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang\*, Zihang Dai\*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov and Quoc V. Le.
+This PyTorch implementation of XLM is an adaptation of the `Tensorflow implementation <https://github.com/zihangdai/xlnet>`__. TODO Lysandre filled
 
 
 Content
@@ -74,25 +74,25 @@ Content
 
    * - Section
      - Description
-   * - `Installation <./installation.html>`_
+   * - `Installation <./installation.html>`__
      - How to install the package
-   * - `Philosphy <./philosophy.html>`_
+   * - `Philosphy <./philosophy.html>`__
      - The philosophy behind this package
-   * - `Usage <./usage.html>`_
+   * - `Usage <./usage.html>`__
      - Quickstart examples
-   * - `Examples <./examples.html>`_
+   * - `Examples <./examples.html>`__
      - Detailed examples on how to fine-tune Bert
-   * - `Notebooks <./notebooks.html>`_
+   * - `Notebooks <./notebooks.html>`__
      - Introduction on the provided Jupyter Notebooks
-   * - `TPU <./tpu.html>`_
+   * - `TPU <./tpu.html>`__
      - Notes on TPU support and pretraining scripts
-   * - `Command-line interface <./cli.html>`_
+   * - `Command-line interface <./cli.html>`__
      - Convert a TensorFlow checkpoint in a PyTorch dump
-   * - `Migration <./migration.html>`_
+   * - `Migration <./migration.html>`__
      - Migrating from ``pytorch_pretrained_BERT`` (v0.6) to ``pytorch_transformers`` (v1.0)
-   * - `Bertology <./bertology.html>`_
+   * - `Bertology <./bertology.html>`__
      - TODO Lysandre didn't know how to fill
-   * - `TorchScript <./torchscript.html>`_
+   * - `TorchScript <./torchscript.html>`__
      - Convert a model to TorchScript for use in other programming languages
 
 .. list-table::
@@ -100,19 +100,19 @@ Content
 
    * - Section
      - Description
-   * - `Overview <./model_doc/overview.html>`_
+   * - `Overview <./model_doc/overview.html>`__
      - Overview of the package
-   * - `BERT <./model_doc/bert.html>`_
+   * - `BERT <./model_doc/bert.html>`__
      - BERT Models, Tokenizers and optimizers
-   * - `OpenAI GPT <./model_doc/gpt.html>`_
+   * - `OpenAI GPT <./model_doc/gpt.html>`__
      - GPT Models, Tokenizers and optimizers
-   * - `TransformerXL <./model_doc/transformerxl.html>`_
+   * - `TransformerXL <./model_doc/transformerxl.html>`__
      - TransformerXL Models, Tokenizers and optimizers
-   * - `OpenAI GPT2 <./model_doc/gpt2.html>`_
+   * - `OpenAI GPT2 <./model_doc/gpt2.html>`__
      - GPT2 Models, Tokenizers and optimizers
-   * - `XLM <./model_doc/xlm.html>`_
+   * - `XLM <./model_doc/xlm.html>`__
      - XLM Models, Tokenizers and optimizers
-   * - `XLNet <./model_doc/xlnet.html>`_
+   * - `XLNet <./model_doc/xlnet.html>`__
      - XLNet Models, Tokenizers and optimizers
 
 TODO Lysandre filled: might need an introduction for both parts. Is it even necessary, since there is a summary? Up to you Thom.
@@ -120,68 +120,68 @@ TODO Lysandre filled: might need an introduction for both parts. Is it even nece
 Overview
 --------
 
-This package comprises the following classes that can be imported in Python and are detailed in the `documentation <./model_doc/overview.html>`_ section of this package:
+This package comprises the following classes that can be imported in Python and are detailed in the `documentation <./model_doc/overview.html>`__ section of this package:
 
 
 *
-  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`_ file):
+  Eight **Bert** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`__ file):
 
 
-  * `BertModel <./model_doc/bert.html#pytorch_transformers.BertModel>`_ - raw BERT Transformer model (\ **fully pre-trained**\ ),
-  * `BertForMaskedLM <./model_doc/bert.html#pytorch_transformers.BertForMaskedLM>`_ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
-  * `BertForNextSentencePrediction <./model_doc/bert.html#pytorch_transformers.BertForNextSentencePrediction>`_ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
-  * `BertForPreTraining <./model_doc/bert.html#pytorch_transformers.BertForPreTraining>`_ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
-  * `BertForSequenceClassification <./model_doc/bert.html#pytorch_transformers.BertForSequenceClassification>`_ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
-  * `BertForMultipleChoice <./model_doc/bert.html#pytorch_transformers.BertForMultipleChoice>`_ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
-  * `BertForTokenClassification <./model_doc/bert.html#pytorch_transformers.BertForTokenClassification>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
-  * `BertForQuestionAnswering <./model_doc/bert.html#pytorch_transformers.BertForQuestionAnswering>`_ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).
+  * `BertModel <./model_doc/bert.html#pytorch_transformers.BertModel>`__ - raw BERT Transformer model (\ **fully pre-trained**\ ),
+  * `BertForMaskedLM <./model_doc/bert.html#pytorch_transformers.BertForMaskedLM>`__ - BERT Transformer with the pre-trained masked language modeling head on top (\ **fully pre-trained**\ ),
+  * `BertForNextSentencePrediction <./model_doc/bert.html#pytorch_transformers.BertForNextSentencePrediction>`__ - BERT Transformer with the pre-trained next sentence prediction classifier on top  (\ **fully pre-trained**\ ),
+  * `BertForPreTraining <./model_doc/bert.html#pytorch_transformers.BertForPreTraining>`__ - BERT Transformer with masked language modeling head and next sentence prediction classifier on top (\ **fully pre-trained**\ ),
+  * `BertForSequenceClassification <./model_doc/bert.html#pytorch_transformers.BertForSequenceClassification>`__ - BERT Transformer with a sequence classification head on top (BERT Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `BertForMultipleChoice <./model_doc/bert.html#pytorch_transformers.BertForMultipleChoice>`__ - BERT Transformer with a multiple choice head on top (used for task like Swag) (BERT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `BertForTokenClassification <./model_doc/bert.html#pytorch_transformers.BertForTokenClassification>`__ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ),
+  * `BertForQuestionAnswering <./model_doc/bert.html#pytorch_transformers.BertForQuestionAnswering>`__ - BERT Transformer with a token classification head on top (BERT Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ ).
 
 *
-  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`_ file):
+  Three **OpenAI GPT** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`__ file):
 
 
-  * `OpenAIGPTModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTModel>`_ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
-  * `OpenAIGPTLMHeadModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTLMHeadModel>`_ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `OpenAIGPTDoubleHeadsModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTDoubleHeadsModel>`_ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `OpenAIGPTModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTModel>`__ - raw OpenAI GPT Transformer model (\ **fully pre-trained**\ ),
+  * `OpenAIGPTLMHeadModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTLMHeadModel>`__ - OpenAI GPT Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `OpenAIGPTDoubleHeadsModel <./model_doc/gpt.html#pytorch_transformers.OpenAIGPTDoubleHeadsModel>`__ - OpenAI GPT Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
 
 *
-  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`_ file):
+  Two **Transformer-XL** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`__ file):
 
 
-  * `TransfoXLModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLModel>`_ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
-  * `TransfoXLLMHeadModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLLMHeadModel>`_ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLModel>`__ - Transformer-XL model which outputs the last hidden state and memory cells (\ **fully pre-trained**\ ),
+  * `TransfoXLLMHeadModel <./model_doc/transformerxl.html#pytorch_transformers.TransfoXLLMHeadModel>`__ - Transformer-XL with the tied adaptive softmax head on top for language modeling which outputs the logits/loss and memory cells (\ **fully pre-trained**\ ),
 
 *
-  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <./_modules/pytorch_transformers/modeling_gpt2.html>`_ file):
+  Three **OpenAI GPT-2** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_gpt2.py <./_modules/pytorch_transformers/modeling_gpt2.html>`__ file):
 
 
-  * `GPT2Model <./model_doc/gpt2.html#pytorch_transformers.GPT2Model>`_ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
-  * `GPT2LMHeadModel <./model_doc/gpt2.html#pytorch_transformers.GPT2LMHeadModel>`_ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `GPT2DoubleHeadsModel <./model_doc/gpt2.html#pytorch_transformers.GPT2DoubleHeadsModel>`_ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
+  * `GPT2Model <./model_doc/gpt2.html#pytorch_transformers.GPT2Model>`__ - raw OpenAI GPT-2 Transformer model (\ **fully pre-trained**\ ),
+  * `GPT2LMHeadModel <./model_doc/gpt2.html#pytorch_transformers.GPT2LMHeadModel>`__ - OpenAI GPT-2 Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `GPT2DoubleHeadsModel <./model_doc/gpt2.html#pytorch_transformers.GPT2DoubleHeadsModel>`__ - OpenAI GPT-2 Transformer with the tied language modeling head and a multiple choice classification head on top (OpenAI GPT-2 Transformer is **pre-trained**\ , the multiple choice classification head **is only initialized and has to be trained**\ ),
 
 *
-  Four **XLM** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`_ file):
+  Four **XLM** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`__ file):
 
 
-  * `XLMModel <./model_doc/xlm.html#pytorch_transformers.XLMModel>`_ - raw XLM Transformer model (\ **fully pre-trained**\ ),
-  * `XLMWithLMHeadModel <./model_doc/xlm.html#pytorch_transformers.XLMWithLMHeadModel>`_ - XLM Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `XLMForSequenceClassification <./model_doc/xlm.html#pytorch_transformers.XLMForSequenceClassification>`_ - XLM Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
-  * `XLMForQuestionAnswering <./model_doc/xlm.html#pytorch_transformers.XLMForQuestionAnswering>`_ - XLM Transformer with a token classification head on top (XLM Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
+  * `XLMModel <./model_doc/xlm.html#pytorch_transformers.XLMModel>`__ - raw XLM Transformer model (\ **fully pre-trained**\ ),
+  * `XLMWithLMHeadModel <./model_doc/xlm.html#pytorch_transformers.XLMWithLMHeadModel>`__ - XLM Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `XLMForSequenceClassification <./model_doc/xlm.html#pytorch_transformers.XLMForSequenceClassification>`__ - XLM Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `XLMForQuestionAnswering <./model_doc/xlm.html#pytorch_transformers.XLMForQuestionAnswering>`__ - XLM Transformer with a token classification head on top (XLM Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
 
 *
-  Four **XLNet** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`_ file):
+  Four **XLNet** PyTorch models (\ ``torch.nn.Module``\ ) with pre-trained weights (in the `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`__ file):
 
 
-  * `XLNetModel <./model_doc/xlnet.html#pytorch_transformers.XLNetModel>`_ - raw XLNet Transformer model (\ **fully pre-trained**\ ),
-  * `XLNetLMHeadModel <./model_doc/xlnet.html#pytorch_transformers.XLNetLMHeadModel>`_ - XLNet Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
-  * `XLNetForSequenceClassification <./model_doc/xlnet.html#pytorch_transformers.XLNetForSequenceClassification>`_ - XLNet Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
-  * `XLNetForQuestionAnswering <./model_doc/xlnet.html#pytorch_transformers.XLNetForQuestionAnswering>`_ - XLNet Transformer with a token classification head on top (XLNet Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
+  * `XLNetModel <./model_doc/xlnet.html#pytorch_transformers.XLNetModel>`__ - raw XLNet Transformer model (\ **fully pre-trained**\ ),
+  * `XLNetLMHeadModel <./model_doc/xlnet.html#pytorch_transformers.XLNetLMHeadModel>`__ - XLNet Transformer with the tied language modeling head on top (\ **fully pre-trained**\ ),
+  * `XLNetForSequenceClassification <./model_doc/xlnet.html#pytorch_transformers.XLNetForSequenceClassification>`__ - XLNet Transformer with a sequence classification head on top (XLM Transformer is **pre-trained**\ , the sequence classification head **is only initialized and has to be trained**\ ),
+  * `XLNetForQuestionAnswering <./model_doc/xlnet.html#pytorch_transformers.XLNetForQuestionAnswering>`__ - XLNet Transformer with a token classification head on top (XLNet Transformer is **pre-trained**\ , the token classification head **is only initialized and has to be trained**\ )
 
 
 TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers because I don't know the current philosophy behind them.
 
 *
-  Tokenizers for **BERT** (using word-piece) (in the `tokenization_bert.py <./_modules/pytorch_transformers/tokenization_bert.html>`_ file):
+  Tokenizers for **BERT** (using word-piece) (in the `tokenization_bert.py <./_modules/pytorch_transformers/tokenization_bert.html>`__ file):
 
   * ``BasicTokenizer`` - basic tokenization (punctuation splitting, lower casing, etc.),
   * ``WordpieceTokenizer`` - WordPiece tokenization,
@@ -189,44 +189,44 @@ TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers beca
 
 
 *
-  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <./_modules/pytorch_transformers/tokenization_openai.html>`_ file):
+  Tokenizer for **OpenAI GPT** (using Byte-Pair-Encoding) (in the `tokenization_openai.py <./_modules/pytorch_transformers/tokenization_openai.html>`__ file):
 
   * ``OpenAIGPTTokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.
 
 
 *
-  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <./_modules/pytorch_transformers/tokenization_gpt2.html>`_ file):
+  Tokenizer for **OpenAI GPT-2** (using byte-level Byte-Pair-Encoding) (in the `tokenization_gpt2.py <./_modules/pytorch_transformers/tokenization_gpt2.html>`__ file):
 
   * ``GPT2Tokenizer`` - perform byte-level Byte-Pair-Encoding (BPE) tokenization.
 
 
 *
-  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <./_modules/pytorch_transformers/tokenization_transfo_xl.html>`_ file):
+  Tokenizer for **Transformer-XL** (word tokens ordered by frequency for adaptive softmax) (in the `tokenization_transfo_xl.py <./_modules/pytorch_transformers/tokenization_transfo_xl.html>`__ file):
 
   * ``OpenAIGPTTokenizer`` - perform word tokenization and can order words by frequency in a corpus for use in an adaptive softmax.
 
 
 *
-  Tokenizer for **XLNet** (SentencePiece based tokenizer) (in the `tokenization_xlnet.py <./_modules/pytorch_transformers/tokenization_xlnet.html>`_ file):
+  Tokenizer for **XLNet** (SentencePiece based tokenizer) (in the `tokenization_xlnet.py <./_modules/pytorch_transformers/tokenization_xlnet.html>`__ file):
 
   * ``XLNetTokenizer`` - perform SentencePiece tokenization.
 
 
 *
-  Tokenizer for **XLM** (using Byte-Pair-Encoding) (in the `tokenization_xlm.py <./_modules/pytorch_transformers/tokenization_xlm.html>`_ file):
+  Tokenizer for **XLM** (using Byte-Pair-Encoding) (in the `tokenization_xlm.py <./_modules/pytorch_transformers/tokenization_xlm.html>`__ file):
 
   * ``GPT2Tokenizer`` - perform Byte-Pair-Encoding (BPE) tokenization.
 
 
 *
-  Optimizer for **BERT** (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`_ file):
+  Optimizer for **BERT** (in the `optimization.py <./_modules/pytorch_transformers/optimization.html>`__ file):
 
 
   * ``BertAdam`` - Bert version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
 
 
 *
-  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <./_modules/pytorch_transformers/optimization_openai.html>`_ file):
+  Optimizer for **OpenAI GPT** (in the `optimization_openai.py <./_modules/pytorch_transformers/optimization_openai.html>`__ file):
 
 
   * ``OpenAIAdam`` - OpenAI GPT version of Adam algorithm with weight decay fix, warmup and linear decay of the learning rate.
@@ -234,11 +234,11 @@ TODO Lysandre filled: I filled in XLM and XLNet. I didn't do the Tokenizers beca
 
 *
   Configuration classes for BERT, OpenAI GPT, Transformer-XL, XLM and XLNet (in the respective \
-  `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`_\ , \
-  `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`_\ , \
-  `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`_, \
-  `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`_, \
-  `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`_ \
+  `modeling_bert.py <./_modules/pytorch_transformers/modeling_bert.html>`__\ , \
+  `modeling_openai.py <./_modules/pytorch_transformers/modeling_openai.html>`__\ , \
+  `modeling_transfo_xl.py <./_modules/pytorch_transformers/modeling_transfo_xl.html>`__, \
+  `modeling_xlm.py <./_modules/pytorch_transformers/modeling_xlm.html>`__, \
+  `modeling_xlnet.py <./_modules/pytorch_transformers/modeling_xlnet.html>`__ \
   files):
 
 
@@ -253,47 +253,47 @@ The repository further comprises:
 
 
 *
-  Five examples on how to use **BERT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  Five examples on how to use **BERT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
 
 
-  * `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_extract_features.py>`_ - Show how to extract hidden states from an instance of ``BertModel``\ ,
-  * `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_classifier.py>`_ - Show how to fine-tune an instance of ``BertForSequenceClassification`` on GLUE's MRPC task,
-  * `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_squad.py>`_ - Show how to fine-tune an instance of ``BertForQuestionAnswering`` on SQuAD v1.0 and SQuAD v2.0 tasks.
-  * `run_swag.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_swag.py>`_ - Show how to fine-tune an instance of ``BertForMultipleChoice`` on Swag task.
-  * `simple_lm_finetuning.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/simple_lm_finetuning.py>`_ - Show how to fine-tune an instance of ``BertForPretraining`` on a target text corpus.
+  * `run_bert_extract_features.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_extract_features.py>`__ - Show how to extract hidden states from an instance of ``BertModel``\ ,
+  * `run_bert_classifier.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_classifier.py>`__ - Show how to fine-tune an instance of ``BertForSequenceClassification`` on GLUE's MRPC task,
+  * `run_bert_squad.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_bert_squad.py>`__ - Show how to fine-tune an instance of ``BertForQuestionAnswering`` on SQuAD v1.0 and SQuAD v2.0 tasks.
+  * `run_swag.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_swag.py>`__ - Show how to fine-tune an instance of ``BertForMultipleChoice`` on Swag task.
+  * `simple_lm_finetuning.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/lm_finetuning/simple_lm_finetuning.py>`__ - Show how to fine-tune an instance of ``BertForPretraining`` on a target text corpus.
 
 *
-  One example on how to use **OpenAI GPT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  One example on how to use **OpenAI GPT** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
 
 
-  * `run_openai_gpt.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_openai_gpt.py>`_ - Show how to fine-tune an instance of ``OpenGPTDoubleHeadsModel`` on the RocStories task.
+  * `run_openai_gpt.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_openai_gpt.py>`__ - Show how to fine-tune an instance of ``OpenGPTDoubleHeadsModel`` on the RocStories task.
 
 *
-  One example on how to use **Transformer-XL** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  One example on how to use **Transformer-XL** (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
 
 
-  * `run_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_transfo_xl.py>`_ - Show how to load and evaluate a pre-trained model of ``TransfoXLLMHeadModel`` on WikiText 103.
+  * `run_transfo_xl.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_transfo_xl.py>`__ - Show how to load and evaluate a pre-trained model of ``TransfoXLLMHeadModel`` on WikiText 103.
 
 *
-  One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`_\ ):
+  One example on how to use **OpenAI GPT-2** in the unconditional and interactive mode (in the `examples folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples>`__\ ):
 
 
-  * `run_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_gpt2.py>`_ - Show how to use OpenAI GPT-2 an instance of ``GPT2LMHeadModel`` to generate text (same as the original OpenAI GPT-2 examples).
+  * `run_gpt2.py <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_gpt2.py>`__ - Show how to use OpenAI GPT-2 an instance of ``GPT2LMHeadModel`` to generate text (same as the original OpenAI GPT-2 examples).
 
-  These examples are detailed in the `Examples <#examples>`_ section of this readme.
+  These examples are detailed in the `Examples <#examples>`__ section of this readme.
 
 *
-  Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the `notebooks folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks>`_\ ):
+  Three notebooks that were used to check that the TensorFlow and PyTorch models behave identically (in the `notebooks folder <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks>`__\ ):
 
 
-  * `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models.ipynb>`_ - Compare the hidden states predicted by ``BertModel``\ ,
-  * `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_ - Compare the spans predicted by  ``BertForQuestionAnswering`` instances,
-  * `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_ - Compare the predictions of the ``BertForPretraining`` instances.
+  * `Comparing-TF-and-PT-models.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models.ipynb>`__ - Compare the hidden states predicted by ``BertModel``\ ,
+  * `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`__ - Compare the spans predicted by  ``BertForQuestionAnswering`` instances,
+  * `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`__ - Compare the predictions of the ``BertForPretraining`` instances.
 
-  These notebooks are detailed in the `Notebooks <#notebooks>`_ section of this readme.
+  These notebooks are detailed in the `Notebooks <#notebooks>`__ section of this readme.
 
 
 *
   A command-line interface to convert TensorFlow checkpoints (BERT, Transformer-XL) or NumPy checkpoint (OpenAI) in a PyTorch save of the associated PyTorch model:
 
-  This CLI is detailed in the `Command-line interface <#Command-line-interface>`_ section of this readme.
+  This CLI is detailed in the `Command-line interface <#Command-line-interface>`__ section of this readme.
diff --git a/docs/source/model_doc/overview.rst b/docs/source/model_doc/overview.rst
index d70fa3beb9..00e538e68d 100644
--- a/docs/source/model_doc/overview.rst
+++ b/docs/source/model_doc/overview.rst
@@ -14,6 +14,8 @@ Here is a detailed documentation of the classes in the package and how to use th
    * - `Serialization best-practices <#serialization-best-practices>`__
      - How to save and reload a fine-tuned model
    * - `Configurations <#configurations>`__
+     - API of the configuration classes for BERT, GPT, GPT-2 and Transformer-XL
+
 
 TODO Lysandre filled: Removed Models/Tokenizers/Optimizers as no single link can be made.
 
diff --git a/docs/source/notebooks.rst b/docs/source/notebooks.rst
index 35d54370ba..592867a862 100644
--- a/docs/source/notebooks.rst
+++ b/docs/source/notebooks.rst
@@ -11,6 +11,6 @@ We include `three Jupyter Notebooks <https://github.com/huggingface/pytorch-pret
   The second NoteBook (\ `Comparing-TF-and-PT-models-SQuAD.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/notebooks/Comparing-TF-and-PT-models-SQuAD.ipynb>`_\ ) compares the loss computed by the TensorFlow and the PyTorch models for identical initialization of the fine-tuning layer of the ``BertForQuestionAnswering`` and computes the standard deviation between them. In the given example, we get a standard deviation of 2.5e-7 between the models.
 
 *
-  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/mnotebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
+  The third NoteBook (\ `Comparing-TF-and-PT-models-MLM-NSP.ipynb <https://github.com/huggingface/pytorch-pretrained-BERT/tree/notebooks/Comparing-TF-and-PT-models-MLM-NSP.ipynb>`_\ ) compares the predictions computed by the TensorFlow and the PyTorch models for masked token language modeling using the pre-trained masked language modeling model.
 
 Please follow the instructions given in the notebooks to run and modify them.