From 769948fad219b3212100f70ba0c69ccaec54b0a8 Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Sun, 7 Feb 2021 17:51:34 -0800 Subject: [PATCH] json to jsonlines, and doc, and typo (#10043) --- examples/seq2seq/README.md | 45 +++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/examples/seq2seq/README.md b/examples/seq2seq/README.md index 22faf54acc..3a5bdc6757 100644 --- a/examples/seq2seq/README.md +++ b/examples/seq2seq/README.md @@ -33,7 +33,9 @@ This directory is in a bit of messy state and is undergoing some cleaning, pleas ## New script -The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (json or csv), then fine-tune one of the architectures above on it. +The new script for fine-tuning a model on a summarization or translation task is `run_seq2seq.py`. It is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it. + +For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files Here is an example on a summarization task: ```bash @@ -50,22 +52,22 @@ python examples/seq2seq/run_seq2seq.py \ --predict_with_generate ``` -And here is how you would use it on your own files (replace `path_to_csv_or_json_file`, `text_column_name` and `summary_column_name` by the relevant values): +And here is how you would use it on your own files (replace `path_to_csv_or_jsonlines_file`, `text_column_name` and `summary_column_name` by the relevant values): ```bash python examples/seq2seq/run_seq2seq.py \ - -model_name_or_path t5-small \ + --model_name_or_path t5-small \ --do_train \ --do_eval \ --task summarization \ - --train_file path_to_csv_or_json_file \ - --validation_file path_to_csv_or_json_file \ + --train_file path_to_csv_or_jsonlines_file \ + --validation_file path_to_csv_or_jsonlines_file \ --output_dir ~/tmp/tst-summarization \ --overwrite_output_dir \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --predict_with_generate \ --text_column text_column_name \ - --summary_column summary_column_name + --summary_column summary_column_name ``` The training and validation files should have a column for the inputs texts and a column for the summaries. @@ -87,7 +89,7 @@ python examples/seq2seq/run_seq2seq.py \ --predict_with_generate ``` -And here is how you would use it on your own files (replace `path_to_json_file`, by the relevant values): +And here is how you would use it on your own files (replace `path_to_jsonlines_file`, by the relevant values): ```bash python examples/seq2seq/run_seq2seq.py \ --model_name_or_path sshleifer/student_marian_en_ro_6_1 \ @@ -98,15 +100,15 @@ python examples/seq2seq/run_seq2seq.py \ --dataset_config_name ro-en \ --source_lang en_XX \ --target_lang ro_RO\ - --train_file path_to_json_file \ - --validation_file path_to_json_file \ + --train_file path_to_jsonlines_file \ + --validation_file path_to_jsonlines_file \ --output_dir ~/tmp/tst-translation \ --per_device_train_batch_size=4 \ --per_device_eval_batch_size=4 \ --overwrite_output_dir \ --predict_with_generate ``` -Here the files are expected to be JSON files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`). +Here the files are expected to be JSONLINES files, with each input being a dictionary with a key `"translation"` containing one key per language (here `"en"` and `"ro"`). ## Old script @@ -162,7 +164,7 @@ https://github.com/huggingface/transformers/tree/master/scripts/fsmt #### Pegasus (multiple datasets) -Multiple eval datasets are available for download from: +Multiple eval datasets are available for download from: https://github.com/stas00/porting/tree/master/datasets/pegasus @@ -294,8 +296,8 @@ th 56 \ ``` ### Multi-GPU Evaluation -here is a command to run xsum evaluation on 8 GPUS. It is more than linearly faster than run_eval.py in some cases -because it uses SortishSampler to minimize padding. You can also use it on 1 GPU. `data_dir` must have +here is a command to run xsum evaluation on 8 GPUS. It is more than linearly faster than run_eval.py in some cases +because it uses SortishSampler to minimize padding. You can also use it on 1 GPU. `data_dir` must have `{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs. ```bash @@ -320,17 +322,17 @@ When using `run_eval.py`, the following features can be useful: `--info` is an additional argument available for the same purpose of tracking the conditions of the experiment. It's useful to pass things that weren't in the argument list, e.g. a language pair `--info "lang:en-ru"`. But also if you pass `--info` without a value it will fallback to the current date/time string, e.g. `2020-09-13 18:44:43`. If using `--dump-args --info`, the output will be: - + ``` {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': '2020-09-13 18:44:43'} ``` If using `--dump-args --info "pair:en-ru chkpt=best`, the output will be: - + ``` {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': 'pair=en-ru chkpt=best'} ``` - + * if you need to perform a parametric search in order to find the best ones that lead to the highest BLEU score, let `run_eval_search.py` to do the searching for you. @@ -341,14 +343,14 @@ When using `run_eval.py`, the following features can be useful: --search "num_beams=5:10 length_penalty=0.8:1.0:1.2 early_stopping=true:false" ``` which will generate `12` `(2*3*2)` searches for a product of each hparam. For example the example that was just used will invoke `run_eval.py` repeatedly with: - + ``` --num_beams 5 --length_penalty 0.8 --early_stopping true --num_beams 5 --length_penalty 0.8 --early_stopping false [...] --num_beams 10 --length_penalty 1.2 --early_stopping false ``` - + On completion, this function prints a markdown table of the results sorted by the best BLEU score and the winning arguments. ``` @@ -381,7 +383,7 @@ pytest examples/seq2seq/ ### Converting pytorch-lightning checkpoints pytorch lightning ``-do_predict`` often fails, after you are done training, the best way to evaluate your model is to convert it. -This should be done for you, with a file called `{save_dir}/best_tfmr`. +This should be done for you, with a file called `{save_dir}/best_tfmr`. If that file doesn't exist but you have a lightning `.ckpt` file, you can run ```bash @@ -390,7 +392,7 @@ python convert_pl_checkpoint_to_hf.py PATH_TO_CKPT randomly_initialized_hf_mode Then either `run_eval` or `run_distributed_eval` with `save_dir/best_tfmr` (see previous sections) -# Experimental Features +# Experimental Features These features are harder to use and not always useful. ### Dynamic Batch Size for MT @@ -401,7 +403,7 @@ This feature can only be used: - without sortish sampler - after calling `./save_len_file.py $tok $data_dir` -For example, +For example, ```bash ./save_len_file.py Helsinki-NLP/opus-mt-en-ro wmt_en_ro ./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs @@ -417,4 +419,3 @@ uses 12,723 batches of length 48 and takes slightly more time 9.5 minutes. The feature is still experimental, because: + we can make it much more robust if we have memory mapped/preprocessed datasets. + The speedup over sortish sampler is not that large at the moment. -