Fix pad across processes dim in trainer and not being able to set the timeout (#24775)

* dim, and rm copy

* Don't rm copy for now

* Oops

* pad index

* Should be a working test

* Tickle down ddp timeout

* Put fix back in now that testing locally is done

* Better comment specifying timeout

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

---------

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
Zach Mueller
2023-07-12 10:01:51 -04:00
committed by GitHub
parent 4f85aaa6c9
commit 0284285501
3 changed files with 52 additions and 4 deletions

View File

@@ -1714,7 +1714,9 @@ class TrainingArguments:
del os.environ["ACCELERATE_USE_DEEPSPEED"]
self._n_gpu = 1
else:
self.distributed_state = PartialState(backend=self.ddp_backend)
self.distributed_state = PartialState(
backend=self.ddp_backend, timeout=timedelta(seconds=self.ddp_timeout)
)
self._n_gpu = 1
if not is_sagemaker_mp_enabled():
device = self.distributed_state.device