From 95d1962b9c8460b4cec5a88eb9915e8e25f5bc1e Mon Sep 17 00:00:00 2001 From: Sam Shleifer Date: Tue, 21 Jul 2020 14:12:48 -0400 Subject: [PATCH] [Doc] explaining romanian postprocessing for MBART BLEU hacking (#5943) --- examples/seq2seq/romanian_postprocessing.md | 65 +++++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 examples/seq2seq/romanian_postprocessing.md diff --git a/examples/seq2seq/romanian_postprocessing.md b/examples/seq2seq/romanian_postprocessing.md new file mode 100644 index 0000000000..aa20829e8a --- /dev/null +++ b/examples/seq2seq/romanian_postprocessing.md @@ -0,0 +1,65 @@ +### Motivation +Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data. +With post processing, it can score 37.. +Here is the postprocessing code, stolen from @mjpost in this [issue](https://github.com/pytorch/fairseq/issues/1758) + + + +### Instructions +Note: You need to have your test_generations.txt before you start this process. +(1) Setup `mosesdecoder` and `wmt16-scripts` +```bash +cd $HOME +git clone git@github.com:moses-smt/mosesdecoder.git +cd mosesdecoder +git@github.com:rsennrich/wmt16-scripts.git +``` + +(2) define a function for post processing. + It removes diacritics and does other things I don't understand +```bash +ro_post_process () { + sys=$1 + ref=$2 + export MOSES_PATH=$HOME/mosesdecoder + REPLACE_UNICODE_PUNCT=$MOSES_PATH/scripts/tokenizer/replace-unicode-punctuation.perl + NORM_PUNC=$MOSES_PATH/scripts/tokenizer/normalize-punctuation.perl + REM_NON_PRINT_CHAR=$MOSES_PATH/scripts/tokenizer/remove-non-printing-char.perl + REMOVE_DIACRITICS=$MOSES_PATH/wmt16-scripts/preprocess/remove-diacritics.py + NORMALIZE_ROMANIAN=$MOSES_PATH/wmt16-scripts/preprocess/normalise-romanian.py + TOKENIZER=$MOSES_PATH/scripts/tokenizer/tokenizer.perl + + + + lang=ro + for file in $sys $ref; do + cat $file \ + | $REPLACE_UNICODE_PUNCT \ + | $NORM_PUNC -l $lang \ + | $REM_NON_PRINT_CHAR \ + | $NORMALIZE_ROMANIAN \ + | $REMOVE_DIACRITICS \ + | $TOKENIZER -no-escape -l $lang \ + > $(basename $file).tok + done + # compute BLEU + cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok +} +``` + +(3) Call the function on test_generations.txt and test.target +For example, +```bash +ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target +``` +This will split out a new blue score and write a new fine called `test_generations.tok` with post-processed outputs. + + + + + + + + + +```