Fit chinese wwm to new datasets (#9887)

* MOD: fit chinese wwm to new datasets * MOD: move wwm to new folder * MOD: formate code * Styling * MOD add param and recover trainer Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2021-02-01 16:37:59 +08:00
parent 24881008a6
commit 1682804ebd
6 changed files with 249 additions and 67 deletions
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -100,72 +100,7 @@ sure all your batches have the same length.

 ### Whole word masking

-The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
-selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
-to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).
-
-To fine-tune a model using whole word masking, use the following script:
-```bash
-python run_mlm_wwm.py \
-    --model_name_or_path roberta-base \
-    --dataset_name wikitext \
-    --dataset_config_name wikitext-2-raw-v1 \
-    --do_train \
-    --do_eval \
-    --output_dir /tmp/test-mlm-wwm
-```
-
-For Chinese models, we need to generate a reference files (which requires the ltp library), because it's tokenized at
-the character level.
-
-**Q :** Why a reference file?
-
-**A :** Suppose we have a Chinese sentence like: `我喜欢你` The original Chinese-BERT will tokenize it as
-`['我','喜','欢','你']` (character level). But `喜欢` is a whole word. For whole word masking proxy, we need a result
-like `['我','喜','##欢','你']`, so we need a reference file to tell the model which position of the BERT original token
-should be added `##`.
-
-**Q :** Why LTP ?
-
-**A :** Cause the best known Chinese WWM BERT is [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm) by HIT.
-It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so if we want to fine-tune their model,
-we need LTP.
-
-Now LTP only only works well on `transformers==3.2.0`. So we don't add it to requirements.txt.
-You need to create a separate environment with this version of Transformers to run the `run_chinese_ref.py` script that
-will create the reference files. The script is in `examples/contrib`. Once in the proper environment, run the
-following:
-
-
-```bash
-export TRAIN_FILE=/path/to/dataset/wiki.train.raw
-export LTP_RESOURCE=/path/to/ltp/tokenizer
-export BERT_RESOURCE=/path/to/bert/tokenizer
-export SAVE_PATH=/path/to/data/ref.txt
-
-python examples/contrib/run_chinese_ref.py \
-    --file_name=path_to_train_or_eval_file \
-    --ltp=path_to_ltp_tokenizer \
-    --bert=path_to_bert_tokenizer \
-    --save_path=path_to_reference_file
-```
-
-Then you can run the script like this: 
-
-
-```bash
-python run_mlm_wwm.py \
-    --model_name_or_path roberta-base \
-    --train_file path_to_train_file \
-    --validation_file path_to_validation_file \
-    --train_ref_file path_to_train_chinese_ref_file \
-    --validation_ref_file path_to_validation_chinese_ref_file \
-    --do_train \
-    --do_eval \
-    --output_dir /tmp/test-mlm-wwm
-```
-
-**Note:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
+This part was moved to `examples/research_projects/mlm_wwm`. 

 ### XLNet and permutation language modeling