From ae3e84f3bab83796de06b00206c2b1e0cd70ec0e Mon Sep 17 00:00:00 2001
From: Ola Piktus <ola.piktus@gmail.com>
Date: Mon, 28 Sep 2020 09:06:39 +0100
Subject: [PATCH] [RAG] Clean Rag readme in examples (#7413)

* Improve README + consolidation script

* Reformat README

* Reformat README

Co-authored-by: Your Name <you@example.com>
---
 examples/rag/README.md                     | 121 +++++++++++++--------
 examples/rag/consolidate_rag_checkpoint.py |  99 +++++++++++++++++
 2 files changed, 173 insertions(+), 47 deletions(-)
 create mode 100644 examples/rag/consolidate_rag_checkpoint.py

diff --git a/examples/rag/README.md b/examples/rag/README.md
index e4453868bc..c35fe63005 100644
--- a/examples/rag/README.md
+++ b/examples/rag/README.md
@@ -1,24 +1,28 @@
 # Intro
-RAG is a seq2seq model which encapsulates two core components: a question encoder and a generator.
+Aimed at tackling the knowledge-intensive NLP tasks (think tasks a human wouldn't be expected to solve without access to external knowledge sources), RAG models are seq2seq models with access to a retrieval mechanism providing relevant context documents at training and evaluation time.
+
+A RAG model encapsulates two core components: a question encoder and a generator.
 During a forward pass, we encode the input with the question encoder and pass it
 to the retriever to extract relevant context documents. The documents are then prepended to the input.
-Such contextualized inputs is passed to the generator.
-
-The question encoder can be any `autoencoding` model, preferably :obj:`~transformers.DPRQuestionEncoder`, and the generator can be any `seq2seq` model, preferably :obj:`~transformers.BartForConditionalGeneration`.
-
-The model can be initialized with a :obj:`~transformers.RagRetriever` for end-to-end generation or used in combination with the outputs of a retriever in multiple steps - see examples for more details.
-The model is compatible any `autoencoding` model as the ``question_encoder`` and any `seq2seq` model with language model head as the ``generator``.
-The model has been tested with :class:`~transformers.DPRQuestionEncoder` as the ``question_encoder`` and :class:`~transformers.BartForConditionalGeneration` or :class:`~transformers.T5ForConditionalGeneration` as the ``generator``.
-
-RAG models were released with the paper `Retrieval-Augmented Generation for
-Knowledge-Intensive NLP Tasks <https://arxiv.org/abs/2005.11401>`_ by Patrick Lewis, Ethan Perez, Aleksandra Piktus et al.
-
+Such contextualized inputs are passed to the generator.
 
+Read more about RAG  at https://arxiv.org/abs/2005.11401.
 # Finetuning
-Our finetuning logic is based on scripts from [`examples/seq2seq`](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).
-Follow instructions there regarding data preprocessing. A sample finetuning command:
 
+
+Our finetuning logic is based on scripts from [`examples/seq2seq`](https://github.com/huggingface/transformers/tree/master/examples/seq2seq). We accept training data in the same format as specified there - we expect a directory consisting of 6 text files:
+```bash
+train.source
+train.target
+val.source
+val.target
+test.source
+test.target
 ```
+
+A sample finetuning command (run ` ./examples/rag/finetune.py --help` to list all available options):
+
+```bash
 python examples/rag/finetune.py \
     --data_dir $DATA_DIR \
     --output_dir $OUTPUT_DIR \
@@ -27,62 +31,85 @@ python examples/rag/finetune.py \
     --fp16 \
     --gpus 8
 ```
+We publish two `base` models which can serve as a starting point for finetuning on downstream tasks (use them as `model_name_or_path`):
+- [`facebook/rag-sequence-base`](https://huggingface.co/facebook/rag-sequence-base) - a base for finetuning `RagSequenceForGeneration` models,
+- [`facebook/rag-token-base`](https://huggingface.co/facebook/rag-token-base) - a base for finetuning `RagTokenForGeneration` models.
+
+The `base` models initialize the question encoder with [`facebook/dpr-question_encoder-single-nq-base`](https://huggingface.co/facebook/dpr-question_encoder-single-nq-base) and the generator with [`facebook/bart-large`](https://huggingface.co/facebook/bart-large).
+
+If you would like to initialize finetuning with a base model using different question encoder and generator architectures, you can build it with a consolidation script, e.g.:
+```
+python examples/rag/consolidate_rag_checkpoint.py \
+    --model_type rag_sequence \
+    --generator_name_or_path facebook/bart-large-cnn \
+    --question_encoder_name_or_path facebook/dpr-question_encoder-single-nq-base \
+    --dest path/to/checkpoint
+```
+You will then be able to pass `path/to/checkpoint` as `model_name_or_path` to the `finetune.py` script.
 
 
 # Evaluation
-Apart from the parameters specifying the model to evaluate and some extra parameters, the evaluation script expects paths to two files:
-- `evaluation_set` - a path to a file specifying the evaluation dataset, a single datapoint per line, e.g.
-```who is the owner of reading football club```
-- `gold_data_path` - a path to a file contaning ground truth answers for datapoints from the `evaluation_set`.
+Our evaluation script enables two modes of evaluation (controlled by the `eval_mode` argument): `e2e` - end2end evaluation, returns EM (exact match) and F1 scores calculated for the downstream task and `retrieval` - which returns precision@k of the documents retrieved for provided inputs.
 
-We expect the following formats of the gold data file:
+The evaluation script expects paths to two files:
+- `evaluation_set` - a path to a file specifying the evaluation dataset, a single input per line.
+- `gold_data_path` - a path to a file contaning ground truth answers for datapoints from the `evaluation_set`, a single output per line. Check below for expected formats of the gold data files.
 
-- for e2e evaluation, we support two formats of the gold file:
-    - `qa` - where a single line in the following format: input [tab] output_list, e.g.:
-    ```
-    who is the owner of reading football club	['Xiu Li Dai', 'Dai Yongge', 'Dai Xiuli', 'Yongge Dai']
-    ```
-    - `ans` - where a single line of the gold file contains the expected output string, e.g.:
-    ```
-    Xiu Li Dai
-    ```
 
-- for retrieval evaluation, we expect a tab-separated list of Wikipedia page titles constituting positive contexts for a given query, e.g. given a question `who sings does he love me with reba`, a line with ground truth retrieval data could look as follows:
+## Retrieval evaluation
+For `retrieval` evaluation, we expect a gold data file where each line will consist of a tab-separated list of document titles constituting positive contexts for respective datapoints from the `evaluation_set`. E.g. given a question `who sings does he love me with reba` in the `evaluation_set`, a respective ground truth line could look as follows:
 ```
 Does He Love You	Does He Love You	Red Sandy Spika dress of Reba McEntire	Greatest Hits Volume Two (Reba McEntire album)	Shoot for the Moon (album)
 ```
 
-## Retrieval evaluation
-
 We demonstrate how to evaluate retrieval against DPR evaluation data. You can download respective files from links listed [here](https://github.com/facebookresearch/DPR/blob/master/data/download_data.py#L39-L45).
 
 1. Download and unzip the gold data file. We use the `biencoder-nq-dev` from https://dl.fbaipublicfiles.com/dpr/data/retriever/biencoder-nq-dev.json.gz.
 2. Parse the unziped file using the `parse_dpr_relevance_data.py`
-```
-python examples/rag/parse_dpr_relevance_data.py --src_path path/to/unziped/biencoder-nq-dev.json --evaluation_set path/to/output/biencoder-nq-dev.questions --gold_data_path path/to/output/biencoder-nq-dev.pages
-```
+    ```bash
+    python examples/rag/parse_dpr_relevance_data.py \
+        --src_path path/to/unziped/biencoder-nq-dev.json \
+        --evaluation_set path/to/output/biencoder-nq-dev.questions \
+        --gold_data_path path/to/output/biencoder-nq-dev.pages
+    ```
 3. Run evaluation:
-```
-python examples/rag/eval_rag.py \
-    --model_name_or_path $MODEL_NAME_OR_PATH \ # model name or path of the model we're evaluating
-    --model_type rag_sequence \ # RAG model type (rag_token or rag_sequence)
-    --evaluation_set path/to/output/biencoder-nq-dev.questions \ # an input dataset for evaluation
-    --gold_data_path path/to/output/biencoder-nq-dev.pages \ # a dataset containing ground truth answers for samples from the evaluation_set
-    --predictions_path path/to/retrieval_preds.tsv  \ # name of file in which predictions will be stored
-    --eval_mode retrieval \ # indicates whether we're performing retrieval evaluation or e2e evaluation
-    --recalculate # if predictions_filename already exists, and this option is set - we regenerate the answers, otherwise we reuse the predicsion file to calculate metrics.
-```
+    ```bash
+    python examples/rag/eval_rag.py \
+        --model_name_or_path facebook/rag-sequence-nq \ # model name or path of the model we're evaluating
+        --model_type rag_sequence \ # RAG model type (rag_token or rag_sequence)
+        --evaluation_set path/to/output/biencoder-nq-dev.questions \ # an input dataset for evaluation
+        --gold_data_path path/to/output/biencoder-nq-dev.pages \ # a dataset containing ground truth answers for samples from the evaluation_set
+        --predictions_path path/to/retrieval_preds.tsv  \ # name of file where predictions will be stored
+        --eval_mode retrieval \ # indicates whether we're performing retrieval evaluation or e2e evaluation
+        --k 1 # parameter k for the precision@k metric
+    ```
 
 
 ## End-to-end evaluation
+
+We support two formats of the gold data file (controlled by the `gold_data_mode` parameter):
+- `qa` - where a single line has the following format: `input [tab] output_list`, e.g.:
 ```
+who is the owner of reading football club	['Xiu Li Dai', 'Dai Yongge', 'Dai Xiuli', 'Yongge Dai']
+```
+- `ans` - where a single line contains a single expected answer, e.g.:
+```
+Xiu Li Dai
+```
+
+Predictions of the model for the samples from the `evaluation_set` will be saved under the path specified by the `predictions_path` parameter. If this path already exists, the script will use saved predictions to calculate metrics. Add `--recalculate` parameter to force the script to perform inference from scratch.
+
+An example e2e evaluation run could look as follows:
+```bash
 python examples/rag/eval_rag.py \
-	--model_name_or_path $MODEL_NAME_OR_PATH \
+    --model_name_or_path facebook/rag-sequence-nq \
     --model_type rag_sequence \
     --evaluation_set path/to/test.source \
     --gold_data_path path/to/gold_data \
     --predictions_path path/to/e2e_preds.txt \
-    --eval_mode e2e \ # indicates whether we're performing retrieval evaluation or e2e evaluation (default)
+    --eval_mode e2e \
+    --gold_data_mode qa \
     --n_docs 5 \ # You can experiment with retrieving different number of documents at evaluation time
-    --print_predictions
+    --print_predictions \
+    --recalculate \ # adding this parameter will force recalculating predictions even if predictions_path already exists
 ```
diff --git a/examples/rag/consolidate_rag_checkpoint.py b/examples/rag/consolidate_rag_checkpoint.py
new file mode 100644
index 0000000000..b9ed7ec0f8
--- /dev/null
+++ b/examples/rag/consolidate_rag_checkpoint.py
@@ -0,0 +1,99 @@
+"""
+A script creating a RAG checkpoint from a generator and a question encoder checkpoints.
+"""
+
+import argparse
+from pathlib import Path
+
+from transformers import AutoConfig, AutoTokenizer, RagConfig, RagSequenceForGeneration, RagTokenForGeneration
+
+
+def consolidate(
+    model_type,
+    generator_name_or_path: str,
+    question_encoder_name_or_path: str,
+    dest_dir: Path,
+    config_name_or_path: str = None,
+    generator_tokenizer_name_or_path: str = None,
+    question_encoder_tokenizer_name_or_path: str = None,
+):
+
+    if config_name_or_path is None:
+        config_name_or_path = "facebook/rag-token-base" if model_type == "rag_token" else "facebook/rag-sequence-base"
+
+    if generator_tokenizer_name_or_path is None:
+        generator_tokenizer_name_or_path = generator_name_or_path
+
+    if question_encoder_tokenizer_name_or_path is None:
+        question_encoder_tokenizer_name_or_path = question_encoder_name_or_path
+
+    model_class = RagTokenForGeneration if model_type == "rag_token" else RagSequenceForGeneration
+
+    # Save model.
+    rag_config = RagConfig.from_pretrained(config_name_or_path)
+    gen_config = AutoConfig.from_pretrained(generator_name_or_path)
+    question_encoder_config = AutoConfig.from_pretrained(question_encoder_name_or_path)
+
+    rag_config.generator = gen_config
+    rag_config.question_encoder = question_encoder_config
+
+    rag_model = model_class.from_pretrained_question_encoder_generator(
+        question_encoder_name_or_path, generator_name_or_path, config=rag_config
+    )
+    rag_model.save_pretrained(dest_dir)
+
+    # Sanity check.
+    model_class.from_pretrained(dest_dir)
+
+    # Save tokenizers.
+    gen_tokenizer = AutoTokenizer.from_pretrained(generator_tokenizer_name_or_path)
+    gen_tokenizer.save_pretrained(dest_dir / "generator_tokenizer/")
+    question_encoder_tokenizer = AutoTokenizer.from_pretrained(question_encoder_tokenizer_name_or_path)
+    question_encoder_tokenizer.save_pretrained(dest_dir / "question_encoder_tokenizer/")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_type",
+        choices=["rag_sequence", "rag_token"],
+        required=True,
+        type=str,
+        help="RAG model type: rag_sequence, rag_token",
+    )
+    parser.add_argument("--dest", type=str, required=True, help="Path to the output checkpoint directory.")
+    parser.add_argument("--generator_name_or_path", type=str, required=True, help="Generator model identifier")
+    parser.add_argument(
+        "--question_encoder_name_or_path", type=str, required=True, help="Question encoder model identifier"
+    )
+
+    parser.add_argument(
+        "--generator_tokenizer_name_or_path",
+        type=str,
+        help="Generator tokenizer identifier, if not specified, resolves to ``generator_name_or_path``",
+    )
+    parser.add_argument(
+        "--question_encoder_tokenizer_name_or_path",
+        type=str,
+        help="Question encoder tokenizer identifier, if not specified, resolves to ``question_encoder_name_or_path``",
+    )
+    parser.add_argument(
+        "--config_name_or_path",
+        type=str,
+        help="Identifier of the model config to use, if not provided, resolves to a base config for a given ``model_type``",
+    )
+
+    args = parser.parse_args()
+
+    dest_dir = Path(args.dest)
+    dest_dir.mkdir(exist_ok=True)
+
+    consolidate(
+        args.model_type,
+        args.generator_name_or_path,
+        args.question_encoder_name_or_path,
+        dest_dir,
+        args.config_name_or_path,
+        args.generator_tokenizer_name_or_path,
+        args.question_encoder_tokenizer_name_or_path,
+    )