[Deepspeed] adapt multiple models, add zero_to_fp32 tests (#12477)

* zero_to_fp32 tests

* args change

* remove unnecessary work

* use transformers.trainer_utils.get_last_checkpoint

* document the new features

* cleanup

* wip

* fix fsmt

* add bert

* cleanup

* add xlm-roberta

* electra works

* cleanup

* sync

* split off the model zoo tests

* cleanup

* cleanup

* cleanup

* cleanup

* reformat

* cleanup

* casing

* deepspeed>=0.4.3

* adjust distilbert

* Update docs/source/main_classes/deepspeed.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
Stas Bekman
2021-07-13 12:07:32 -07:00
committed by GitHub
parent 65bf05cd18
commit 78f5fe1416
10 changed files with 444 additions and 80 deletions

View File

@@ -1456,8 +1456,56 @@ won't be possible to load it back.
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
weights. This cannot be done during training since this is a process that requires a lot of memory, and therefore this
is performed offline.
weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
memory it can be done in the same training script. The following sections will discuss both approaches.
**Live FP32 Weights Recovery:**
This approach may not work if you model is large and you have little free CPU memory left, at the end of the training.
If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
.. code-block:: python
from transformers.trainer_utils import get_last_checkpoint
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best
checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
.. code-block:: python
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
.. note::
Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the
DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end
of the training.
Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own
trainer.
If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply
these yourself as is shown in the following example:
.. code-block:: python
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
model = model.cpu()
model.load_state_dict(state_dict)
**Offline FP32 Weights Recovery:**
DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
@@ -1486,15 +1534,16 @@ weights just run:
.. code-block:: bash
python zero_to_fp32.py global_step1 pytorch_model.bin
python zero_to_fp32.py . pytorch_model.bin
The script will automatically handle either ZeRO-2 or ZeRO-3 checkpoint.
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
``python zero_to_fp32.py -h`` will give you usage details.
If you have multiple DeepSpeed checkpoint sub-folders, pick the one you know to have the desired weights.
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current
example will contain ``global_step1``.
Note: currently the script requires 2x general RAM of the final fp32 model weights.