docs: replace torch.distributed.run by torchrun (#27528)
* docs: replace torch.distributed.run by torchrun `transformers` now officially support pytorch >= 1.10. The entrypoint `torchrun`` is present from 1.10 onwards. Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> * Update src/transformers/trainer.py with @ArthurZucker's suggestion Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
@@ -18,7 +18,7 @@ in Huang et al. [Improve Transformer Models with Better Relative Position Embedd
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
|
||||
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
|
||||
--model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \
|
||||
--dataset_name squad \
|
||||
--do_train \
|
||||
@@ -46,7 +46,7 @@ gpu training leads to the f1 score of 90.71.
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
|
||||
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
|
||||
--model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \
|
||||
--dataset_name squad \
|
||||
--do_train \
|
||||
@@ -68,7 +68,7 @@ Training with the above command leads to the f1 score of 93.52, which is slightl
|
||||
Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1:
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \
|
||||
torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \
|
||||
--model_name_or_path bert-large-uncased-whole-word-masking \
|
||||
--dataset_name squad \
|
||||
--do_train \
|
||||
|
||||
@@ -140,7 +140,7 @@ python finetune_trainer.py --help
|
||||
|
||||
For multi-gpu training use `torch.distributed.launch`, e.g. with 2 gpus:
|
||||
```bash
|
||||
python -m torch.distributed.launch --nproc_per_node=2 finetune_trainer.py ...
|
||||
torchrun --nproc_per_node=2 finetune_trainer.py ...
|
||||
```
|
||||
|
||||
**At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.**
|
||||
@@ -214,7 +214,7 @@ because it uses SortishSampler to minimize padding. You can also use it on 1 GPU
|
||||
`{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs.
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch --nproc_per_node=8 run_distributed_eval.py \
|
||||
torchrun --nproc_per_node=8 run_distributed_eval.py \
|
||||
--model_name sshleifer/distilbart-large-xsum-12-3 \
|
||||
--save_dir xsum_generations \
|
||||
--data_dir xsum \
|
||||
|
||||
@@ -98,7 +98,7 @@ the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
|
||||
use the following command:
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch \
|
||||
torchrun \
|
||||
--nproc_per_node number_of_gpu_you_have path_to_script.py \
|
||||
--all_arguments_of_the_script
|
||||
```
|
||||
@@ -107,7 +107,7 @@ As an example, here is how you would fine-tune the BERT large model (with whole
|
||||
classification MNLI task using the `run_glue` script, with 8 GPUs:
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch \
|
||||
torchrun \
|
||||
--nproc_per_node 8 pytorch/text-classification/run_glue.py \
|
||||
--model_name_or_path bert-large-uncased-whole-word-masking \
|
||||
--task_name mnli \
|
||||
|
||||
@@ -100,7 +100,7 @@ of **0.35**.
|
||||
The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision.
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch \
|
||||
torchrun \
|
||||
--nproc_per_node 8 run_speech_recognition_ctc.py \
|
||||
--dataset_name="common_voice" \
|
||||
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
|
||||
@@ -147,7 +147,7 @@ However, the `--shuffle_buffer_size` argument controls how many examples we can
|
||||
|
||||
|
||||
```bash
|
||||
**python -m torch.distributed.launch \
|
||||
**torchrun \
|
||||
--nproc_per_node 4 run_speech_recognition_ctc_streaming.py \
|
||||
--dataset_name="common_voice" \
|
||||
--model_name_or_path="facebook/wav2vec2-xls-r-300m" \
|
||||
@@ -404,7 +404,7 @@ If training on a different language, you should be sure to change the `language`
|
||||
#### Multi GPU Whisper Training
|
||||
The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 2 GPU devices in half-precision:
|
||||
```bash
|
||||
python -m torch.distributed.launch \
|
||||
torchrun \
|
||||
--nproc_per_node 2 run_speech_recognition_seq2seq.py \
|
||||
--model_name_or_path="openai/whisper-small" \
|
||||
--dataset_name="mozilla-foundation/common_voice_11_0" \
|
||||
@@ -572,7 +572,7 @@ cross-entropy loss of **0.405** and word error rate of **0.0728**.
|
||||
The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision.
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.launch \
|
||||
torchrun \
|
||||
--nproc_per_node 8 run_speech_recognition_seq2seq.py \
|
||||
--dataset_name="librispeech_asr" \
|
||||
--model_name_or_path="./" \
|
||||
|
||||
Reference in New Issue
Block a user