Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792)

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
2020-07-18 13:57:33 -04:00
parent 4b506a37e3
commit 09a2f40684
6 changed files with 182 additions and 170 deletions
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@@ -7,6 +7,15 @@ For `bertabs` instructions, see `bertabs/README.md`.


 ### Data
+XSUM Data:
+```bash
+cd examples/seq2seq
+wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
+tar -xzvf xsum.tar.gz
+export XSUM_DIR=${PWD}/xsum
+```
+this should make a directory called cnn_dm/ with files like `test.source`.
+To use your own data, copy that files format. Each article to be summarized is on its own line.

 CNN/DailyMail data
 ```bash
@@ -17,18 +26,6 @@ tar -xzvf cnn_dm.tgz
 export CNN_DIR=${PWD}/cnn_dm
 ```

-this should make a directory called cnn_dm/ with files like `test.source`.
-To use your own data, copy that files format. Each article to be summarized is on its own line.
-
-XSUM Data:
-```bash
-cd examples/seq2seq
-wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
-tar -xzvf xsum.tar.gz
-export XSUM_DIR=${PWD}/xsum
-```
-
-
 WMT16 English-Romanian Translation Data:
 ```bash
 cd examples/seq2seq
@@ -40,7 +37,7 @@ export ENRO_DIR=${PWD}/wmt_en_ro
 If you are using your own data, it must be formatted as one directory with 6 files: train.source, train.target, val.source, val.target, test.source, test.target.  
 The `.source` files are the input, the `.target` files are the desired output.

-
+ 
 ### Tips and Tricks

 General Tips:
@@ -64,6 +61,10 @@ Summarization Tips:
 - If you are finetuning on your own dataset, start from `distilbart-cnn-12-6` if you want long summaries and `distilbart-xsum-12-6` if you want short summaries.
 (It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods). 

+**Update 2018-07-18**
+Datasets: Seq2SeqDataset will be used for all models besides MBart, for which MBartDataset will be used.**
+A new dataset is needed to support multilingual tasks.
+
 ### Summarization Finetuning
 Run/modify `finetune.sh`

@@ -78,8 +79,6 @@ The following command should work on a 16GB GPU:
    --model_name_or_path facebook/bart-large
 ```

-
-
 ### Translation Finetuning

 First, follow the wmt_en_ro download instructions.
@@ -124,23 +123,6 @@ from transformers import AutoModelForSeq2SeqLM
 model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
 ```

-#### XSUM Shared Task
-Compare XSUM results with others by using `--logger_name wandb_shared`. This requires `wandb` registration.
-
-Here is an example command, but you can do whatever you want. Hopefully this will make debugging and collaboration easier!
-```bash
-WANDB_PROJECT='hf_xsum' ./finetune.sh \
-    --data_dir $XSUM_DIR \
-    --output_dir xsum_frozen_embs \
-    --model_name_or_path facebook/bart-large \
-    --train_batch_size 16 --eval_batch_size 16 --freeze_embeds --freeze_encoder \
-    --num_train_epochs 6 \
-    --max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
-    --logger_name wandb
-```
-
-You can see your wandb logs [here](https://app.wandb.ai/sshleifer/hf_xsum?workspace=user-)
-
 ### Evaluation Commands

 To create summaries for each article in dataset, we use `run_eval.py`, here are a few commands that run eval for different tasks and models.