# Add whole word mask support for lm fine-tune (#7925)

* ADD: add whole word mask proxy for both eng and chinese * MOD: adjust format * MOD: reformat code * MOD: update import * MOD: fix bug * MOD: add import * MOD: fix bug * MOD: decouple code and update readme * MOD: reformat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change wwm to whole_word_mask * reformat code * reformat * format * Code quality * ADD: update chinese ref readme * MOD: small changes * MOD: small changes2 * update readme Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2020-10-22 21:19:00 +08:00
parent 64b4d25cf3
commit a16e568f22
8 changed files with 394 additions and 7 deletions
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -45,6 +45,8 @@ slightly slower (over-fitting takes more epochs).

 We use the `--mlm` flag so that the script may change its loss function.

+If using whole-word masking, use both the`--mlm` and `--wwm` flags.
+
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export TEST_FILE=/path/to/dataset/wiki.test.raw
@@ -57,7 +59,55 @@ python run_language_modeling.py \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
-    --mlm
+    --mlm \
+    --wwm
+```
+
+For Chinese models, it's same with English model with only --mlm`. If using whole-word masking, we need to generate a reference files, case it's char level.
+
+**Q :** Why ref file ?
+
+**A :** Suppose we have a Chinese sentence like : `我喜欢你` The original Chinese-BERT will tokenize it as `['我','喜','欢','你']` in char level.
+Actually, `喜欢` is a whole word. For whole word mask proxy, We need res like `['我','喜','##欢','你']`.
+So we need a ref file to tell model which pos of BERT original token should be added `##`.
+
+**Q :** Why LTP ?
+
+**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT. It works well on so many Chines Task like CLUE (Chinese GLUE).
+They use LTP, so if we want to fine-tune their model, we need LTP.
+
+```bash
+export TRAIN_FILE=/path/to/dataset/wiki.train.raw
+export LTP_RESOURCE=/path/to/ltp/tokenizer
+export BERT_RESOURCE=/path/to/bert/tokenizer
+export SAVE_PATH=/path/to/data/ref.txt
+
+python chinese_ref.py \
+    --file_name=$TRAIN_FILE \
+    --ltp=$LTP_RESOURCE
+    --bert=$BERT_RESOURCE \
+    --save_path=$SAVE_PATH 
+```
+Now Chinese Ref is only supported by `LineByLineWithRefDataset` Class, so we need add `line_by_line` flag: 
+
+
+```bash
+export TRAIN_FILE=/path/to/dataset/wiki.train.raw
+export TEST_FILE=/path/to/dataset/wiki.test.raw
+export REF_FILE=/path/to/ref.txt
+
+python run_language_modeling.py \
+    --output_dir=output \
+    --model_type=roberta \
+    --model_name_or_path=roberta-base \
+    --do_train \
+    --train_data_file=$TRAIN_FILE \
+    --chinese_ref_file=$REF_FILE \
+    --do_eval \
+    --eval_data_file=$TEST_FILE \
+    --mlm \
+    --line_by_line \
+    --wwm
 ```

 ### XLNet and permutation language modeling