feat: add support for tensor parallel training workflow with accelerate (#34194)
* feat: add support for tensor parallel flow using accelerate Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: add tp degree to env variable Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: add version check for accelerate to allow TP Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * docs: tensor parallelism Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * nit: rename plugin name Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: guard accelerate version before allow tp Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * docs: add more docs and updates related to TP Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
e6cc410d5b
commit
c3ba53303b
@@ -799,6 +799,29 @@ tpu_use_sudo: false
|
||||
use_cpu: false
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="Tensor Parallelism with PyTorch 2">
|
||||
|
||||
```yml
|
||||
compute_environment: LOCAL_MACHINE
|
||||
tp_config:
|
||||
tp_size: 4
|
||||
distributed_type: TP
|
||||
downcast_bf16: 'no'
|
||||
machine_rank: 0
|
||||
main_training_function: main
|
||||
mixed_precision: 'no'
|
||||
num_machines: 1
|
||||
num_processes: 4
|
||||
rdzv_backend: static
|
||||
same_network: true
|
||||
tpu_env: []
|
||||
tpu_use_cluster: false
|
||||
tpu_use_sudo: false
|
||||
use_cpu: false
|
||||
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
|
||||
Reference in New Issue
Block a user