[s2s] create doc for pegasus/fsmt replication (#7934)

This commit is contained in:
Stas Bekman
2020-10-20 12:07:52 -07:00
committed by GitHub
parent 96f4828ace
commit 0e24e4c136

View File

@@ -15,7 +15,8 @@ For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
## Datasets ## Datasets
#### XSUM: #### XSUM
```bash ```bash
cd examples/seq2seq cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
@@ -26,6 +27,7 @@ this should make a directory called `xsum/` with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line. To use your own data, copy that files format. Each article to be summarized is on its own line.
#### CNN/DailyMail #### CNN/DailyMail
```bash ```bash
cd examples/seq2seq cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
@@ -35,7 +37,8 @@ export CNN_DIR=${PWD}/cnn_dm
``` ```
this should make a directory called `cnn_dm/` with 6 files. this should make a directory called `cnn_dm/` with 6 files.
#### WMT16 English-Romanian Translation Data: #### WMT16 English-Romanian Translation Data
download with this command: download with this command:
```bash ```bash
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
@@ -44,13 +47,25 @@ export ENRO_DIR=${PWD}/wmt_en_ro
``` ```
this should make a directory called `wmt_en_ro/` with 6 files. this should make a directory called `wmt_en_ro/` with 6 files.
#### WMT English-German: #### WMT English-German
```bash ```bash
wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
tar -xzvf wmt_en_de.tgz tar -xzvf wmt_en_de.tgz
export DATA_DIR=${PWD}/wmt_en_de export DATA_DIR=${PWD}/wmt_en_de
``` ```
#### FSMT datasets (wmt)
Refer to the scripts starting with `eval_` under:
https://github.com/huggingface/transformers/tree/master/scripts/fsmt
#### Pegasus (multiple datasets)
Multiple eval datasets are available for download from:
https://github.com/stas00/porting/tree/master/datasets/pegasus
#### Private Data #### Private Data
If you are using your own data, it must be formatted as one directory with 6 files: If you are using your own data, it must be formatted as one directory with 6 files:
@@ -64,7 +79,6 @@ test.target
``` ```
The `.source` files are the input, the `.target` files are the desired output. The `.source` files are the input, the `.target` files are the desired output.
### Tips and Tricks ### Tips and Tricks
General Tips: General Tips: