* Validation split percentage to be used for custom data files also
Issue same as https://github.com/huggingface/transformers/issues/12406 fixed for pytorch branch run_mlm.py
* Validation split added in the right place
* Update run_clm.py
* validation split added for custom files
* Validation split added for custom files
* Update run_plm.py
* fixed validation split for custom files as input for pytorch examples in lm
* Update run_clm_no_trainer.py
* args modified
* fix_torch_device_generate_test
* remove @
* upload
* finish dataset streaming
* adapt readme
* finish
* up
* up
* up
* up
* Apply suggestions from code review
* finish
* make style
* make style2
* finish
Co-authored-by: Patrick von Platen <patrick@huggingface.co>
* Validation split added: custom data files
Validation split added in case of no validation file and loading custom data
* Updated documentation with custom file usage
Updated documentation with custom file usage
* Update README.md
* Update README.md
* Update README.md
* Made some suggested stylistic changes
* Used logger instead of print.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Made similar changes to add validation split
In case of a missing validation file, a validation split will be used now.
* max_train_samples to be used for training only
max_train_samples got misplaced, now corrected so that it is applied on training data only, not whole data.
* styled
* changed ordering
* Improved language of documentation
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Improved language of documentation
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Fixed styling issue
* Update run_mlm.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add philosophy doc
* fix typos
* update doc
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* address Patricks suggestions
* add a training example and fix typos
* jit the training step
* jit train step
* fix example code
* typo
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Tensorflow MLM example
* Add CLM example
* Style fixes, adding missing checkpoint code from the CLM example
* Fix TPU training, avoid massive dataset warnings
* Fix incorrect training length calculation for multi-GPU training
* Fix incorrect training length calculation for multi-GPU training
* Refactors and nitpicks from the review
* Style pass
* Adding README
Before the code could not be used for validation only because of this line:
extension = data_args.train_file.split(".")[-1]
was assuming that extension must be extracted from the training dataset. This line would run regardless of the training or validation options of the user. This would lead to an error if the user only wants to run an evaluation only and does not want to do train (because the training file does not exist). I modified it to extract extension from the training file if the user wants to do train and extract it from the validation file if the user wants to run eval. This way the code can be used for both training and validation separately.