Validation split added: custom data files @sgugger, @patil-suraj (#12407)
* Validation split added: custom data files Validation split added in case of no validation file and loading custom data * Updated documentation with custom file usage Updated documentation with custom file usage * Update README.md * Update README.md * Update README.md * Made some suggested stylistic changes * Used logger instead of print. Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Made similar changes to add validation split In case of a missing validation file, a validation split will be used now. * max_train_samples to be used for training only max_train_samples got misplaced, now corrected so that it is applied on training data only, not whole data. * styled * changed ordering * Improved language of documentation Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Improved language of documentation Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Fixed styling issue * Update run_mlm.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
f929462b25
commit
d5b8fe3b90
@@ -49,6 +49,14 @@ python run_mlm.py \
|
||||
--dataset_config_name wikitext-103-raw-v1
|
||||
```
|
||||
|
||||
When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation.
|
||||
```
|
||||
python run_mlm.py \
|
||||
--model_name_or_path distilbert-base-cased \
|
||||
--output_dir output \
|
||||
--train_file train_file_path
|
||||
```
|
||||
|
||||
## run_clm.py
|
||||
|
||||
This script trains a causal language model.
|
||||
@@ -61,3 +69,12 @@ python run_clm.py \
|
||||
--dataset_name wikitext \
|
||||
--dataset_config_name wikitext-103-raw-v1
|
||||
```
|
||||
|
||||
When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation.
|
||||
|
||||
```
|
||||
python run_clm.py \
|
||||
--model_name_or_path distilgpt2 \
|
||||
--output_dir output \
|
||||
--train_file train_file_path
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user