From 393739194e59c01a1f89aa1df49614ea2207dc1e Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Wed, 17 Mar 2021 12:48:35 -0700 Subject: [PATCH] [examples] document resuming (#10776) * document resuming in examples * fix * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * put trainer code last, adjust notes Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> --- examples/README.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/examples/README.md b/examples/README.md index f95d76d8df..53bb8a5f6a 100644 --- a/examples/README.md +++ b/examples/README.md @@ -95,6 +95,21 @@ Coming soon! | [**`translation`**](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) | WMT | ✅ | - | - | - + +## Resuming training + +You can resume training from a previous checkpoint like this: + +1. Pass `--output_dir previous_output_dir` without `--overwrite_output_dir` to resume training from the latest checkpoint in `output_dir` (what you would use if the training was interrupted, for instance). +2. Pass `--model_name_or_path path_to_a_specific_checkpoint` to resume training from that checkpoint folder. + +Should you want to turn an example into a notebook where you'd no longer have access to the command +line, 🤗 Trainer supports resuming from a checkpoint via `trainer.train(resume_from_checkpoint)`. + +1. If `resume_from_checkpoint` is `True` it will look for the last checkpoint in the value of `output_dir` passed via `TrainingArguments`. +2. If `resume_from_checkpoint` is a path to a specific checkpoint it will use that saved checkpoint folder to resume the training from. + + ## Distributed training and mixed precision All the PyTorch scripts mentioned above work out of the box with distributed training and mixed precision, thanks to @@ -104,7 +119,7 @@ use the following command: ```bash python -m torch.distributed.launch \ --nproc_per_node number_of_gpu_you_have path_to_script.py \ - --all_arguments_of_the_script + --all_arguments_of_the_script ``` As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text @@ -148,7 +163,7 @@ regular training script with its arguments (this is similar to the `torch.distri ```bash python xla_spawn.py --num_cores num_tpu_you_have \ path_to_script.py \ - --all_arguments_of_the_script + --all_arguments_of_the_script ``` As an example, here is how you would fine-tune the BERT large model (with whole word masking) on the text