Validation split added: custom data files @sgugger, @patil-suraj (#12407)

* Validation split added: custom data files

Validation split added in case of no validation file and loading custom data

* Updated documentation with custom file usage

Updated documentation with custom file usage

* Update README.md

* Update README.md

* Update README.md

* Made some suggested stylistic changes

* Used logger instead of print.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Made similar changes to add validation split

In case of a missing validation file, a validation split will be used now.

* max_train_samples to be used for training only

max_train_samples got misplaced, now corrected so that it is applied on training data only, not whole data.

* styled

* changed ordering

* Improved language of documentation

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Improved language of documentation

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fixed styling issue

* Update run_mlm.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
Souvic Chakraborty
2021-07-01 22:52:42 +05:30
committed by GitHub
parent f929462b25
commit d5b8fe3b90
3 changed files with 46 additions and 2 deletions

View File

@@ -49,6 +49,14 @@ python run_mlm.py \
--dataset_config_name wikitext-103-raw-v1
```
When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation.
```
python run_mlm.py \
--model_name_or_path distilbert-base-cased \
--output_dir output \
--train_file train_file_path
```
## run_clm.py
This script trains a causal language model.
@@ -61,3 +69,12 @@ python run_clm.py \
--dataset_name wikitext \
--dataset_config_name wikitext-103-raw-v1
```
When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation.
```
python run_clm.py \
--model_name_or_path distilgpt2 \
--output_dir output \
--train_file train_file_path
```