diff --git a/docs/source/quickstart.md b/docs/source/quickstart.md
index 530aff8eb0..60e2cf3fd8 100644
--- a/docs/source/quickstart.md
+++ b/docs/source/quickstart.md
@@ -219,4 +219,97 @@ sequence = tokenizer.decode(generated)
 print(sequence)
 ```
 
-The model only requires a single token as input as all the previous tokens' key/value pairs are contained in the `past`.
\ No newline at end of file
+The model only requires a single token as input as all the previous tokens' key/value pairs are contained in the `past`.
+
+### Model2Model example
+
+Encoder-decoder architectures require two tokenized inputs: one for the encoder and the other one for the decoder. Let's assume that we want to use `Model2Model` for generative question answering, and start by tokenizing the question and answer that will be fed to the model.
+
+```python
+import torch
+from transformers import BertTokenizer, Model2Model
+
+# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
+import logging
+logging.basicConfig(level=logging.INFO)
+
+# Load pre-trained model tokenizer (vocabulary)
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+
+# Encode the input to the encoder (the question)
+question = "Who was Jim Henson?"
+encoded_question = tokenizer.encode(question)
+
+# Encode the input to the decoder (the answer)
+answer = "Jim Henson was a puppeteer"
+encoded_answer = tokenizer.encode(answer)
+
+# Convert inputs to PyTorch tensors
+question_tensor = torch.tensor([encoded_question])
+answer_tensor = torch.tensor([encoded_answer])
+```
+
+Let's see how we can use `Model2Model` to get the value of the loss associated with this (question, answer) pair:
+
+```python
+# In order to compute the loss we need to provide language model
+# labels (the token ids that the model should have produced) to
+# the decoder.
+lm_labels =  encoded_answer
+labels_tensor = torch.tensor([lm_labels])
+
+# Load pre-trained model (weights)
+model = Model2Model.from_pretrained('bert-base-uncased')
+
+# Set the model in evaluation mode to deactivate the DropOut modules
+# This is IMPORTANT to have reproducible results during evaluation!
+model.eval()
+
+# If you have a GPU, put everything on cuda
+question_tensor = question_tensor.to('cuda')
+answer_tensor = answer_tensor.to('cuda')
+labels_tensor = labels_tensor.to('cuda')
+model.to('cuda')
+
+# Predict hidden states features for each layer
+with torch.no_grad():
+    # See the models docstrings for the detail of the inputs
+    outputs = model(question_tensor, answer_tensor, decoder_lm_labels=labels_tensor)
+    # Transformers models always output tuples.
+    # See the models docstrings for the detail of all the outputs
+    # In our case, the first element is the value of the LM loss 
+    lm_loss = outputs[0]
+```
+
+This loss can be used to fine-tune `Model2Model` on the question answering task. Assuming that we fine-tuned the model, let us now see how to generate an answer:
+
+```python
+# Let's re-use the previous question
+question = "Who was Jim Henson?"
+encoded_question = tokenizer.encode(question)
+question_tensor = torch.tensor([encoded_question])
+
+# This time we try to generate the answer, so we start with an empty sequence
+answer = "[CLS]"
+encoded_answer = tokenizer.encode(answer, add_special_tokens=False)
+answer_tensor = torch.tensor([encoded_answer])
+
+# Load pre-trained model (weights)
+model = Model2Model.from_pretrained('fine-tuned-weights')
+model.eval()
+
+# If you have a GPU, put everything on cuda
+question_tensor = encoded_question.to('cuda')
+answer_tensor = encoded_answer.to('cuda')
+model.to('cuda')
+
+# Predict all tokens
+with torch.no_grad():
+    outputs = model(question_tensor, answer_tensor)
+    predictions = outputs[0]
+
+# confirm we were able to predict 'jim'
+predicted_index = torch.argmax(predictions[0, -1]).item()
+predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
+assert predicted_token == 'jim'
+```