Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792)
Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
This commit is contained in:
@@ -7,6 +7,15 @@ For `bertabs` instructions, see `bertabs/README.md`.
|
||||
|
||||
|
||||
### Data
|
||||
XSUM Data:
|
||||
```bash
|
||||
cd examples/seq2seq
|
||||
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
|
||||
tar -xzvf xsum.tar.gz
|
||||
export XSUM_DIR=${PWD}/xsum
|
||||
```
|
||||
this should make a directory called cnn_dm/ with files like `test.source`.
|
||||
To use your own data, copy that files format. Each article to be summarized is on its own line.
|
||||
|
||||
CNN/DailyMail data
|
||||
```bash
|
||||
@@ -17,18 +26,6 @@ tar -xzvf cnn_dm.tgz
|
||||
export CNN_DIR=${PWD}/cnn_dm
|
||||
```
|
||||
|
||||
this should make a directory called cnn_dm/ with files like `test.source`.
|
||||
To use your own data, copy that files format. Each article to be summarized is on its own line.
|
||||
|
||||
XSUM Data:
|
||||
```bash
|
||||
cd examples/seq2seq
|
||||
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
|
||||
tar -xzvf xsum.tar.gz
|
||||
export XSUM_DIR=${PWD}/xsum
|
||||
```
|
||||
|
||||
|
||||
WMT16 English-Romanian Translation Data:
|
||||
```bash
|
||||
cd examples/seq2seq
|
||||
@@ -40,7 +37,7 @@ export ENRO_DIR=${PWD}/wmt_en_ro
|
||||
If you are using your own data, it must be formatted as one directory with 6 files: train.source, train.target, val.source, val.target, test.source, test.target.
|
||||
The `.source` files are the input, the `.target` files are the desired output.
|
||||
|
||||
|
||||
|
||||
### Tips and Tricks
|
||||
|
||||
General Tips:
|
||||
@@ -64,6 +61,10 @@ Summarization Tips:
|
||||
- If you are finetuning on your own dataset, start from `distilbart-cnn-12-6` if you want long summaries and `distilbart-xsum-12-6` if you want short summaries.
|
||||
(It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods).
|
||||
|
||||
**Update 2018-07-18**
|
||||
Datasets: Seq2SeqDataset will be used for all models besides MBart, for which MBartDataset will be used.**
|
||||
A new dataset is needed to support multilingual tasks.
|
||||
|
||||
### Summarization Finetuning
|
||||
Run/modify `finetune.sh`
|
||||
|
||||
@@ -78,8 +79,6 @@ The following command should work on a 16GB GPU:
|
||||
--model_name_or_path facebook/bart-large
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Translation Finetuning
|
||||
|
||||
First, follow the wmt_en_ro download instructions.
|
||||
@@ -124,23 +123,6 @@ from transformers import AutoModelForSeq2SeqLM
|
||||
model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
|
||||
```
|
||||
|
||||
#### XSUM Shared Task
|
||||
Compare XSUM results with others by using `--logger_name wandb_shared`. This requires `wandb` registration.
|
||||
|
||||
Here is an example command, but you can do whatever you want. Hopefully this will make debugging and collaboration easier!
|
||||
```bash
|
||||
WANDB_PROJECT='hf_xsum' ./finetune.sh \
|
||||
--data_dir $XSUM_DIR \
|
||||
--output_dir xsum_frozen_embs \
|
||||
--model_name_or_path facebook/bart-large \
|
||||
--train_batch_size 16 --eval_batch_size 16 --freeze_embeds --freeze_encoder \
|
||||
--num_train_epochs 6 \
|
||||
--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 \
|
||||
--logger_name wandb
|
||||
```
|
||||
|
||||
You can see your wandb logs [here](https://app.wandb.ai/sshleifer/hf_xsum?workspace=user-)
|
||||
|
||||
### Evaluation Commands
|
||||
|
||||
To create summaries for each article in dataset, we use `run_eval.py`, here are a few commands that run eval for different tasks and models.
|
||||
|
||||
Reference in New Issue
Block a user