[run_(clm|mlm).py examples] add streaming dataset support (#21343)

* [run_clm example] add streaming dataset support

* unrefactor kwargs

* fix

* fix

* require datasets>=2.0.0

* port to mlm
This commit is contained in:
Stas Bekman
2023-01-30 14:01:35 -08:00
committed by GitHub
parent 95be242adc
commit 98d88b23f5
3 changed files with 104 additions and 43 deletions

View File

@@ -174,6 +174,9 @@ concatenates all texts and then splits them in blocks of the same length).
**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.
## Streaming
To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is currently supported by `run_mlm.py` and `run_clm.py`.
## Creating a model on the fly