From 101186bc1f4f4ebf8bf7b5949ceffd542f4a9ea8 Mon Sep 17 00:00:00 2001 From: Stas Bekman Date: Mon, 26 Oct 2020 05:15:05 -0700 Subject: [PATCH] [docs] [testing] distributed training (#7993) * distributed training * fix * fix formatting * wording --- docs/source/testing.rst | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/docs/source/testing.rst b/docs/source/testing.rst index 4f294be9ea..3b1d97f573 100644 --- a/docs/source/testing.rst +++ b/docs/source/testing.rst @@ -451,6 +451,24 @@ Inside tests: +Distributed training +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``pytest`` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right thing and end up thinking they are ``pytest`` and start running the test suite in loops. It works, however, if one spawns a normal process that then spawns off multiple workers and manages the IO pipes. + +This is still under development but you can study 2 different tests that perform this successfully: + +* `test_seq2seq_examples_multi_gpu.py `__ - a ``pytorch-lightning``-running test (had to use PL's ``ddp`` spawning method which is the default) +* `test_finetune_trainer.py `__ - a normal (non-PL) test + +To jump right into the execution point, search for the ``execute_async_std`` function in those tests. + +You will need at least 2 GPUs to see these tests in action: + +.. code-block:: bash + + CUDA_VISIBLE_DEVICES="0,1" RUN_SLOW=1 pytest -sv examples/seq2seq/test_finetune_trainer.py \ + examples/seq2seq/test_seq2seq_examples_multi_gpu.py Output capture