[WIP][examples/seq2seq] move old s2s scripts to legacy (#10136)
* move old s2s scripts to legacy * add the tests back * proper rename * restore * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Stas Bekman <stas@stason.org> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
65
examples/legacy/seq2seq/romanian_postprocessing.md
Normal file
65
examples/legacy/seq2seq/romanian_postprocessing.md
Normal file
@@ -0,0 +1,65 @@
|
||||
### Motivation
|
||||
Without processing, english-> romanian mbart-large-en-ro gets BLEU score 26.8 on the WMT data.
|
||||
With post processing, it can score 37..
|
||||
Here is the postprocessing code, stolen from @mjpost in this [issue](https://github.com/pytorch/fairseq/issues/1758)
|
||||
|
||||
|
||||
|
||||
### Instructions
|
||||
Note: You need to have your test_generations.txt before you start this process.
|
||||
(1) Setup `mosesdecoder` and `wmt16-scripts`
|
||||
```bash
|
||||
cd $HOME
|
||||
git clone git@github.com:moses-smt/mosesdecoder.git
|
||||
cd mosesdecoder
|
||||
git clone git@github.com:rsennrich/wmt16-scripts.git
|
||||
```
|
||||
|
||||
(2) define a function for post processing.
|
||||
It removes diacritics and does other things I don't understand
|
||||
```bash
|
||||
ro_post_process () {
|
||||
sys=$1
|
||||
ref=$2
|
||||
export MOSES_PATH=$HOME/mosesdecoder
|
||||
REPLACE_UNICODE_PUNCT=$MOSES_PATH/scripts/tokenizer/replace-unicode-punctuation.perl
|
||||
NORM_PUNC=$MOSES_PATH/scripts/tokenizer/normalize-punctuation.perl
|
||||
REM_NON_PRINT_CHAR=$MOSES_PATH/scripts/tokenizer/remove-non-printing-char.perl
|
||||
REMOVE_DIACRITICS=$MOSES_PATH/wmt16-scripts/preprocess/remove-diacritics.py
|
||||
NORMALIZE_ROMANIAN=$MOSES_PATH/wmt16-scripts/preprocess/normalise-romanian.py
|
||||
TOKENIZER=$MOSES_PATH/scripts/tokenizer/tokenizer.perl
|
||||
|
||||
|
||||
|
||||
lang=ro
|
||||
for file in $sys $ref; do
|
||||
cat $file \
|
||||
| $REPLACE_UNICODE_PUNCT \
|
||||
| $NORM_PUNC -l $lang \
|
||||
| $REM_NON_PRINT_CHAR \
|
||||
| $NORMALIZE_ROMANIAN \
|
||||
| $REMOVE_DIACRITICS \
|
||||
| $TOKENIZER -no-escape -l $lang \
|
||||
> $(basename $file).tok
|
||||
done
|
||||
# compute BLEU
|
||||
cat $(basename $sys).tok | sacrebleu -tok none -s none -b $(basename $ref).tok
|
||||
}
|
||||
```
|
||||
|
||||
(3) Call the function on test_generations.txt and test.target
|
||||
For example,
|
||||
```bash
|
||||
ro_post_process enro_finetune/test_generations.txt wmt_en_ro/test.target
|
||||
```
|
||||
This will split out a new blue score and write a new fine called `test_generations.tok` with post-processed outputs.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
```
|
||||
Reference in New Issue
Block a user