add a network debug script and document it (#15652)
* add a network debug script and document it * doc
This commit is contained in:
@@ -12,6 +12,35 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Debugging
|
||||
|
||||
## Multi-GPU Network Issues Debug
|
||||
|
||||
When training or inferencing with `DistributedDataParallel` and multiple GPU, if you run into issue of inter-communication between processes and/or nodes, you can use the following script to diagnose network issues.
|
||||
|
||||
```bash
|
||||
wget https://raw.githubusercontent.com/huggingface/transformers/master/scripts/distributed/torch-distributed-gpu-test.py
|
||||
```
|
||||
|
||||
For example to test how 2 GPUs interact do:
|
||||
|
||||
```bash
|
||||
python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
|
||||
```
|
||||
If both processes can talk to each and allocate GPU memory each will print an OK status.
|
||||
|
||||
For more GPUs or nodes adjust the arguments in the script.
|
||||
|
||||
You will find a lot more details inside the diagnostics script and even a recipe to how you could run it in a SLURM environment.
|
||||
|
||||
An additional level of debug is to add `NCCL_DEBUG=INFO` environment variable as follows:
|
||||
|
||||
```bash
|
||||
NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
|
||||
```
|
||||
|
||||
This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported. Or if you're not sure how to interpret the output you can share the log file in an Issue.
|
||||
|
||||
|
||||
|
||||
## Underflow and Overflow Detection
|
||||
|
||||
<Tip>
|
||||
|
||||
Reference in New Issue
Block a user