[research_projects] deal with security alerts (#15594)

* [research_projects] deal with security alerts

* add a note of the original PL ver and warning
This commit is contained in:
Stas Bekman
2022-02-11 11:31:09 -08:00
committed by GitHub
parent f15c99fabf
commit fcb0f74397
8 changed files with 55 additions and 42 deletions

View File

@@ -13,6 +13,10 @@ Author: Sam Shleifer (https://github.com/sshleifer)
- `FSMTForConditionalGeneration`
- `T5ForConditionalGeneration`
# Note
⚠️ This project should be run with pytorch-lightning==1.0.4 which has a potential security vulnerability
## Datasets
#### XSUM
@@ -62,7 +66,7 @@ https://github.com/huggingface/transformers/tree/master/scripts/fsmt
#### Pegasus (multiple datasets)
Multiple eval datasets are available for download from:
Multiple eval datasets are available for download from:
https://github.com/stas00/porting/tree/master/datasets/pegasus
@@ -210,7 +214,7 @@ model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
### Converting pytorch-lightning checkpoints
pytorch lightning ``-do_predict`` often fails, after you are done training, the best way to evaluate your model is to convert it.
This should be done for you, with a file called `{save_dir}/best_tfmr`.
This should be done for you, with a file called `{save_dir}/best_tfmr`.
If that file doesn't exist but you have a lightning `.ckpt` file, you can run
```bash
@@ -219,7 +223,7 @@ python convert_pl_checkpoint_to_hf.py PATH_TO_CKPT randomly_initialized_hf_mode
Then either `run_eval` or `run_distributed_eval` with `save_dir/best_tfmr` (see previous sections)
# Experimental Features
# Experimental Features
These features are harder to use and not always useful.
### Dynamic Batch Size for MT
@@ -230,7 +234,7 @@ This feature can only be used:
- without sortish sampler
- after calling `./save_len_file.py $tok $data_dir`
For example,
For example,
```bash
./save_len_file.py Helsinki-NLP/opus-mt-en-ro wmt_en_ro
./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs
@@ -254,10 +258,10 @@ This section describes all code and artifacts from our [Paper](http://arxiv.org/
![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)
+ For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works, which we call "Shrink and Fine-tune", or SFT.
you just copy alternating layers from `facebook/bart-large-cnn` and fine-tune more on the cnn/dm data. `sshleifer/distill-pegasus-cnn-16-4`, `sshleifer/distilbart-cnn-12-6` and all other checkpoints under `sshleifer` that start with `distilbart-cnn` were trained this way.
you just copy alternating layers from `facebook/bart-large-cnn` and fine-tune more on the cnn/dm data. `sshleifer/distill-pegasus-cnn-16-4`, `sshleifer/distilbart-cnn-12-6` and all other checkpoints under `sshleifer` that start with `distilbart-cnn` were trained this way.
+ For the XSUM dataset, training on pseudo-labels worked best for Pegasus (`sshleifer/distill-pegasus-16-4`), while training with KD worked best for `distilbart-xsum-12-6`
+ For `sshleifer/dbart-xsum-12-3`
+ We ran 100s experiments, and didn't want to document 100s of commands. If you want a command to replicate a figure from the paper that is not documented below, feel free to ask on the [forums](https://discuss.huggingface.co/t/seq2seq-distillation-methodology-questions/1270) and tag `@sshleifer`.
+ We ran 100s experiments, and didn't want to document 100s of commands. If you want a command to replicate a figure from the paper that is not documented below, feel free to ask on the [forums](https://discuss.huggingface.co/t/seq2seq-distillation-methodology-questions/1270) and tag `@sshleifer`.
+ You can see the performance tradeoffs of model sizes [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=0).
and more granular timing results [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=1753259047&range=B2:I23).
@@ -303,10 +307,10 @@ deval 1 sshleifer/distill-pegasus-xsum-16-4 xsum dpx_xsum_eval
+ Find a teacher model [Pegasus](https://huggingface.co/models?search=pegasus) (slower, better ROUGE) or `facebook/bart-large-xsum`/`facebook/bart-large-cnn` (faster, slightly lower.).
Choose the checkpoint where the corresponding dataset is most similar (or identical to) your dataset.
+ Follow the sections in order below. You can stop after SFT if you are satisfied, or move on to pseudo-labeling if you want more performance.
+ student size: If you want a close to free 50% speedup, cut the decoder in half. If you want a larger speedup, cut it in 4.
+ student size: If you want a close to free 50% speedup, cut the decoder in half. If you want a larger speedup, cut it in 4.
+ If your SFT run starts at a validation ROUGE-2 that is more than 10 pts below the teacher's validation ROUGE-2, you have a bug. Switching to a more expensive technique will not help. Try setting a breakpoint and looking at generation and truncation defaults/hyper-parameters, and share your experience on the forums!
#### Initialization
We use [make_student.py](./make_student.py) to copy alternating layers from the teacher, and save the resulting model to disk
```bash
@@ -319,7 +323,7 @@ python make_student.py google/pegasus-xsum --save_path dpx_xsum_16_4 --e 16 --d
we now have an initialized student saved to `dbart_xsum_12_3`, which we will use for the following commands.
+ Extension: To replicate more complicated initialize experiments in section 6.1, or try your own. Use the `create_student_by_copying_alternating_layers` function.
#### Pegasus
#### Pegasus
+ The following commands are written for BART and will require, at minimum, the following modifications
+ reduce batch size, and increase gradient accumulation steps so that the product `gpus * batch size * gradient_accumulation_steps = 256`. We used `--learning-rate` = 1e-4 * gradient accumulation steps.
+ don't use fp16
@@ -379,7 +383,7 @@ python finetune.py \
--output_dir dbart_xsum_12_3_PL --gpus 1 --logger_name wandb
```
To combine datasets, as in Section 6.2, try something like:
```bash
@@ -413,7 +417,7 @@ The command that produced `sshleifer/distilbart-xsum-12-6` is at [./train_distil
```bibtex
@misc{shleifer2020pretrained,
title={Pre-trained Summarization Distillation},
title={Pre-trained Summarization Distillation},
author={Sam Shleifer and Alexander M. Rush},
year={2020},
eprint={2010.13002},

View File

@@ -4,7 +4,7 @@ psutil
sacrebleu
rouge-score
tensorflow_datasets
pytorch-lightning==1.0.4
pytorch-lightning
matplotlib
git-python==1.0.3
faiss-cpu