Commit Graph

60 Commits

Author SHA1 Message Date
Stas Bekman
2df34f4aba [trainer] deepspeed integration (#9211)
* deepspeed integration

* style

* add test

* ds wants to do its own backward

* fp16 assert

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

* for clarity extract what args are being passed to deepspeed

* introduce the concept of self.wrapped_model

* s/self.wrapped_model/self.model_wrapped/

* complete transition to self.wrapped_model / self.model

* fix

* doc

* give ds its own init

* add custom overrides, handle bs correctly

* fix test

* clean up model_init logic, fix small bug

* complete fix

* collapse --deepspeed_config into --deepspeed

* style

* start adding doc notes

* style

* implement hf2ds optimizer and scheduler configuration remapping

* oops

* call get_num_training_steps absolutely when needed

* workaround broken auto-formatter

* deepspeed_config arg is no longer needed - fixed in deepspeed master

* use hf's fp16 args in config

* clean

* start on the docs

* rebase cleanup

* finish up --fp16

* clarify the supported stages

* big refactor thanks to discovering deepspeed.init_distributed

* cleanup

* revert fp16 part

* add checkpoint-support

* more init ds into integrations

* extend docs

* cleanup

* unfix docs

* clean up old code

* imports

* move docs

* fix logic

* make it clear which file it's referring to

* document nodes/gpus

* style

* wrong format

* style

* deepspeed handles gradient clipping

* easier to read

* major doc rewrite

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* docs

* switch to AdamW optimizer

* style

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* clarify doc

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-12 19:05:18 -08:00
Stas Bekman
33b7422839 [trainer] remove --model_parallel (#9451)
* fix bad merge - dropped code

* remove --model_parallel

* Deal with TrainingArguments

* Use a private attr and fix batch sizes

* fix _n_gpu

* add is_parallel helper wrapper

* fix attribute

* introduce a new attribute is_model_parallel

* docs

* docs

* Put back init False and rearrange doc

* Ignore non-init args in HFArgumentParser

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2021-01-11 09:39:28 -05:00
Stas Bekman
29acabd886 [trainer] group fp16 args together (#9409)
* [t5 doc] typos

a few run away backticks

@sgugger

* style

* [trainer] put fp16 args together

this PR proposes a purely cosmetic change that puts all the fp16 args together - so they are easier to manager/read

@sgugger

* style
2021-01-05 09:39:38 -05:00
Stas Bekman
748006c0b3 [trainer] --model_parallel hasn't been implemented for most models (#9347)
* --model_parallel hasn't been implemented for most models

* make the help clear as well

* implement is_parallelizable; use it

* oops

* remove property
2021-01-05 04:01:30 -05:00
Sylvain Gugger
490b39e614 Seq2seq trainer (#9241)
* Add label smoothing in Trainer

* Add options for scheduler and Adafactor in Trainer

* Put Seq2SeqTrainer in the main lib

* Apply suggestions from code review

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Address review comments and adapt scripts

* Documentation

* Move test not using script to tests folder

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2020-12-22 11:33:44 -05:00
Sylvain Gugger
1198ba8fba Add timing inside Trainer (#9196)
* Add timing inside Trainer

* Fix tests

* Add n_objs for train

* Sort logs
2020-12-18 15:10:39 -05:00
Sylvain Gugger
9a67185344 Experimental support for fairscale ShardedDDP (#9139)
* Experimental stupport for fairscale ShardedDDP

* Add import error if fairscale not available

* Address review comments

* Fix seq2seq trainer
2020-12-16 13:47:48 -05:00
Sylvain Gugger
51adb97cd6 Fix fp16_backend field 2020-12-15 17:14:37 -05:00
Sylvain Gugger
ad895af98d Add possibility to switch between APEX and AMP in Trainer (#9137)
* Add possibility to switch between APEX and AMP in Trainer

* Update src/transformers/training_args.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Address review comments

* Update src/transformers/training_args.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2020-12-15 16:38:10 -05:00
lewtun
ed1845ef4c Clarify use of TrainingArguments.disable_tqdm in Jupyter Notebooks (#9076)
* Clarify impact of disable_tqdm on Jupyter Notebooks

* Add weblink to argparse

* Replace "dev set" with more common "validation set" in do_eval

* Tweak prediction_loss_only

* Tweak description of Adam hyperparameters

* Add weblink to TensorBoard

* Capitalise apex

* Tweak local_rank description

* Add weblink for wandb

* Replace nlp with datasets

* Tweak grammar in model_parallel

* Capitalise apex

* Update TensorFlow training args to match PyTorch ones

* Fix style

* Fix underscore in weblink

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix underscore in weblink

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix underscore in weblink

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix underscore in weblink

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add obj to datasets.Dataset

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-12-15 09:00:19 -05:00
Navjot
d6af344c9e correct var name in TrainingArguments docstring (#9096) 2020-12-14 09:02:54 -05:00
Sylvain Gugger
00aa9dbca2 Copyright (#8970)
* Add copyright everywhere missing

* Style
2020-12-07 18:36:34 -05:00
Sylvain Gugger
b08843cf4d Add a parallel_mode property to TrainingArguments (#8877)
* Add a `distributed_env` property to TrainingArguments

* Change name

* Address comment
2020-12-01 13:46:09 -05:00
Sylvain Gugger
7c10dd22ae Better support for resuming training (#8878) 2020-12-01 13:45:21 -05:00
Sylvain Gugger
49759c0cda Document new training argument 2020-11-23 15:02:59 -05:00
alexorona
1cd9be2aeb gpt2 and t5 parallel modeling (#8696)
* gpt2 and t5 parallel modeling

* model_parallel utils update

* adding missing model_parallel_utils

Adds missing model_parallel_utils and reverses the changes to code in modeling_gpt2 and modeling_t5

* training_args reformat

Reformatted training_args

* style formatting

Style formatting doc string length on training_args and model_parallel_utils

* style changes

make style && make quality for training_args and model_parallel_utils.

* adding tests

* minor change in trainer

reverts loss calculation

* Update training_args.py

* Update training_args.py

added back docstring language for adam_beta1 and adam_beta2

* Update trainer.py

* Update src/transformers/trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Fix style & rebase

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-11-23 14:41:23 -05:00
Sylvain Gugger
63e91f5fde Document adam betas TrainingArguments (#8688) 2020-11-20 09:27:25 -05:00
Sylvain Gugger
dd52804f5f Remove deprecated (#8604)
* Remove old deprecated arguments

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>

* Remove needless imports

* Fix tests

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-11-17 15:11:29 -05:00
Philip May
6a064447f2 improve documentation of training_args.py (#8270)
* improve documentation of training_args.py

- do_train
- do_eval
- do_predict

* fix line too long

* fix style with black on training_args.py

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix line length with utils/style_doc

* black reformatting

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-11-03 15:57:17 -05:00
Abi See
8f1c960ee7 Fix two bugs with --logging_first_step (#8193)
* make sure that logging_first_step evaluates

* fix bug with incorrect loss on logging_first_step

* fix style

* logging_first_step only logs, not evals
2020-10-30 16:45:38 -04:00
Santiago Castro
969859d5f6 Fix doc errors and typos across the board (#8139)
* Fix doc errors and typos across the board

* Fix a typo

* Fix the CI

* Fix more typos

* Fix CI

* More fixes

* Fix CI

* More fixes

* More fixes
2020-10-29 10:33:33 -04:00
Sylvain Gugger
c42596bc07 Doc styling fixes (#8074)
* Fix a few docstrings

* More fixes

* Styling
2020-10-27 07:54:50 -04:00
Sylvain Gugger
08f534d2da Doc styling (#8067)
* Important files

* Styling them all

* Revert "Styling them all"

This reverts commit 7d029395fdae8513b8281cbc2a6c239f8093503e.

* Syling them for realsies

* Fix syntax error

* Fix benchmark_utils

* More fixes

* Fix modeling auto and script

* Remove new line

* Fixes

* More fixes

* Fix more files

* Style

* Add FSMT

* More fixes

* More fixes

* More fixes

* More fixes

* Fixes

* More fixes

* More fixes

* Last fixes

* Make sphinx happy
2020-10-26 18:26:02 -04:00
Lysandre Debut
3a10764574 Fix TF training arguments instantiation (#8063) 2020-10-26 14:39:25 -04:00
Bram Vanroy
55bcd0cb59 Raise error when using AMP on non-CUDA device (#7869)
* Raise error when using AMP on non-CUDA device

* make style

* make style
2020-10-19 15:59:30 -04:00
Sylvain Gugger
bb9559a7f9 Don't use store_xxx on optional bools (#7786)
* Don't use `store_xxx` on optional bools

* Refine test

* Refine test
2020-10-14 12:05:02 -04:00
Sylvain Gugger
a1d1b332d0 Add predict step accumulation (#7767)
* Add eval_accumulation_step and clean distributed eval

* Add TPU test

* Add TPU stuff

* Fix arg name

* Fix Seq2SeqTrainer

* Fix total_size

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Doc and add test to TPU

* Add unit test

* Adapt name

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-14 11:41:45 -04:00
Tiger
7e73c12805 fixed lots of typos. (#7758) 2020-10-13 10:00:20 -04:00
Sylvain Gugger
08ba4b4902 Trainer callbacks (#7596)
* Initial callback proposal

* Finish various callbacks

* Post-rebase conflicts

* Fix tests

* Don't use something that's not set

* Documentation

* Remove unwanted print.

* Document all models can work

* Add tests + small fixes

* Update docs/source/internal/trainer_utils.rst

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* Fix TF tests

* Real fix this time

* This one should work

* Fix typo

* Really fix typo

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-07 10:50:21 -04:00
Sylvain Gugger
ca05c2a47d Fix post_init of some TrainingArguments (#7525) 2020-10-05 09:19:16 -04:00
Sylvain Gugger
a97a73e0ee Small QOL improvements to TrainingArguments (#7475)
* Small QOL improvements to TrainingArguments

* With the self.
2020-09-30 12:12:03 -04:00
Sylvain Gugger
52e8392b7e Add automatic best model loading to Trainer (#7431)
* Add automatic best model loading to Trainer

* Some small fixes

* Formatting
2020-09-29 10:41:18 -04:00
Sylvain Gugger
f5518e5631 Formatting 2020-09-22 14:55:12 -04:00
Chady Kamar
17099ebd58 Add num workers cli arg (#7322)
* Add dataloader_num_workers to TrainingArguments

This argument is meant to be used to set the
number of workers for the PyTorch DataLoader.

* Pass num_workers argument on DataLoader init
2020-09-22 14:44:42 -04:00
Sylvain Gugger
89edf504bf Add possibility to evaluate every epoch (#7302)
* Add possibility to evaluate every epoch

* Remove multitype arg

* Remove needless import

* Use a proper enum

* Apply suggestions from @LysandreJik

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* One else and formatting

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-09-22 09:52:29 -04:00
Sylvain Gugger
492bb6aa48 Trainer multi label (#7191)
* Trainer accep multiple labels

* Missing import

* Fix dosctrings
2020-09-17 08:15:37 -04:00
Sylvain Gugger
08de989a0a Trainer with grad accum (#6930)
* Add warning for gradient accumulation

* Formatting
2020-09-07 04:54:00 -04:00
Lysandre
a75c64d80c Black 20 release 2020-08-26 17:20:22 +02:00
Lysandre Debut
77abd1e79f Centralize logging (#6434)
* Logging

* Style

* hf_logging > utils.logging

* Address @thomwolf's comments

* Update test

* Update src/transformers/benchmark/benchmark_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Revert bad change

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-08-26 11:10:36 -04:00
Sylvain Gugger
3a7fdd3f52 Add hyperparameter search to Trainer (#6576)
* Add optuna hyperparameter search to Trainer

* @julien-c suggestions

Co-authored-by: Julien Chaumond <chaumond@gmail.com>

* Make compute_objective an arg function

* Formatting

* Rework to make it easier to add ray

* Formatting

* Initial support for Ray

* Formatting

* Polish and finalize

* Add trial id to checkpoint with Ray

* Smaller default

* Use GPU in ray if available

* Formatting

* Fix test

* Update install instruction

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Address review comments

* Formatting post-merge

Co-authored-by: Julien Chaumond <chaumond@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-08-24 11:48:45 -04:00
Sylvain Gugger
b30879fe0c Don't reset the dataset type + plug for rm unused columns (#6683)
* Don't reset the type of the dataset

* Formatting

* Update trainer.py

Co-authored-by: Teven <teven.lescao@gmail.com>
2020-08-24 09:22:03 -04:00
Sylvain Gugger
573bdb0a5d Add tests to Trainer (#6605)
* Add tests to Trainer

* Test if removing long breaks everything

* Remove ugly hack

* Fix distributed test

* Use float for number of epochs
2020-08-20 11:13:50 -04:00
Sylvain Gugger
34fabe1697 Move prediction_loss_only to TrainingArguments (#6426) 2020-08-12 08:03:45 -04:00
Teven
bd0eab351a Trainer + wandb quality of life logging tweaks (#6241)
* added `name` argument for wandb logging, also logging model config with trainer arguments

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* added tf, post-review changes

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-08-05 09:05:52 -04:00
Jay Mody
cedc547e7e Adds train_batch_size, eval_batch_size, and n_gpu to to_sanitized_dict output for logging. (#5331)
* Adds train_batch_size, eval_batch_size, and n_gpu to to_sanitized_dict() output

* Update wandb config logging to use to_sanitized_dict

* removed n_gpu from sanitized dict

* fix quality check errors
2020-08-03 09:00:39 -04:00
Gong Linyuan
b21993b362 Allow to set Adam beta1, beta2 in TrainingArgs (#5592)
* Add Adam beta1, beta2 to trainier

* Make style consistent
2020-07-27 05:31:37 -04:00
Alan deLevie
223bad242d fix typo in (#5893) 2020-07-20 03:53:03 -04:00
Sylvain Gugger
734a28a767 Clean up diffs in Trainer/TFTrainer (#5417)
* Cleanup and unify Trainer/TFTrainer

* Forgot to adapt TFTrainingArgs

* In tf scripts n_gpu -> n_replicas

* Update src/transformers/training_args.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments

* Formatting

* Fix typo

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-07-01 11:00:20 -04:00
Sylvain Gugger
64e3d966b1 Add support for past states (#5399)
* Add support for past states

* Style and forgotten self

* You mean, documenting is not enough? I have to actually add it too?

* Add memory support during evaluation

* Fix tests in eval and add TF support

* No need to change this line anymore
2020-07-01 08:11:55 -04:00
Sylvain Gugger
87716a6d07 Documentation for the Trainer API (#5383)
* Documentation for the Trainer API

* Address review comments

* Address comments
2020-06-30 11:43:43 -04:00