Fix eval ref miss in Chinese WWM. (#8115)

* ADD: add whole word mask proxy for both eng and chinese * MOD: adjust format * MOD: reformat code * MOD: update import * MOD: fix bug * MOD: add import * MOD: fix bug * MOD: decouple code and update readme * MOD: reformat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update examples/language-modeling/run_language_modeling.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * change wwm to whole_word_mask * reformat code * reformat * format * Code quality * ADD: update chinese ref readme * MOD: small changes * MOD: small changes2 * update readme * fix eval ref file miss bug * format file * MOD: move ref code to contrib * MOD: add delimeter check * reformat code * refomat code * Update examples/language-modeling/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-30 05:08:39 +08:00
parent fdf893c441
commit 9a21b50614
4 changed files with 26 additions and 16 deletions
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -63,7 +63,7 @@ python run_language_modeling.py \
    --whole_word_mask
 ```

-For Chinese models, it's same with English model with only --mlm`. If using whole-word masking, we need to generate a reference files, case it's char level.
+For Chinese models, it's same with English model with only `--mlm`. If using whole-word masking, we need to generate a reference files, because it's char level.

 **Q :** Why ref file ?

@@ -76,15 +76,19 @@ So we need a ref file to tell model which pos of BERT original token should be a
 **A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT. It works well on so many Chines Task like CLUE (Chinese GLUE).
 They use LTP, so if we want to fine-tune their model, we need LTP.

+Now LTP only only works well on `transformers==3.2.0`. So we don't add it to requirements.txt.
+You need to check to `3.2.0` for `run_chinese_ref.py`. And the code  could be found in `examples/contrib`.
+
+
 ```bash
 export TRAIN_FILE=/path/to/dataset/wiki.train.raw
 export LTP_RESOURCE=/path/to/ltp/tokenizer
 export BERT_RESOURCE=/path/to/bert/tokenizer
 export SAVE_PATH=/path/to/data/ref.txt

-python chinese_ref.py \
+python examples/contrib/run_chinese_ref.py \
    --file_name=$TRAIN_FILE \
-    --ltp=$LTP_RESOURCE
+    --ltp=$LTP_RESOURCE \
    --bert=$BERT_RESOURCE \
    --save_path=$SAVE_PATH 
 ```