Add line by line option to mlm/plm scripts (#8240)

* Make line by line optional in run_mlm * Add option to disable dynamic padding * Add option to plm too and update README * Typos * More typos * Even more typos * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-11-02 12:27:04 -05:00
parent ebec410c71
commit e1b1b614b1
4 changed files with 181 additions and 26 deletions
--- a/examples/language-modeling/README.md
+++ b/examples/language-modeling/README.md
@@ -77,10 +77,16 @@ python run_clm.py \
    --output_dir /tmp/test-clm
 ```

+If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
+concatenates all texts and then splits them in blocks of the same length).
+
+**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
+sure all your batches have the same length.
+
 ### Whole word masking

 The BERT authors released a new version of BERT using Whole Word Masking in May 2019. Instead of masking randomly
-selected tokens (which may be aprt of words), they mask randomly selected words (masking all the tokens corresponding
+selected tokens (which may be part of words), they mask randomly selected words (masking all the tokens corresponding
 to that word). This technique has been refined for Chinese in [this paper](https://arxiv.org/abs/1906.08101).

 To fine-tune a model using whole word masking, use the following script:
@@ -111,8 +117,8 @@ It works well on so many Chines Task like CLUE (Chinese GLUE). They use LTP, so
 we need LTP.

 Now LTP only only works well on `transformers==3.2.0`. So we don't add it to requirements.txt.
-You need to create a separate enviromnent with this version of Transformers to run the `run_chinese_ref.py` script that
-will create the reference files. The script is in `examples/contrib`. Once in the proper enviromnent, run the
+You need to create a separate environment with this version of Transformers to run the `run_chinese_ref.py` script that
+will create the reference files. The script is in `examples/contrib`. Once in the proper environment, run the
 following:


@@ -144,6 +150,8 @@ python run_mlm_wwm.py \
    --output_dir /tmp/test-mlm-wwm
 ```

+**Note:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length.
+
 ### XLNet and permutation language modeling

 XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method 
@@ -179,3 +187,9 @@ python run_plm.py \
    --do_eval \
    --output_dir /tmp/test-plm
 ```
+
+If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
+concatenates all texts and then splits them in blocks of the same length).
+
+**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
+sure all your batches have the same length.