[Deepspeed] adapt multiple models, add zero_to_fp32 tests (#12477)
* zero_to_fp32 tests * args change * remove unnecessary work * use transformers.trainer_utils.get_last_checkpoint * document the new features * cleanup * wip * fix fsmt * add bert * cleanup * add xlm-roberta * electra works * cleanup * sync * split off the model zoo tests * cleanup * cleanup * cleanup * cleanup * reformat * cleanup * casing * deepspeed>=0.4.3 * adjust distilbert * Update docs/source/main_classes/deepspeed.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * style Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -1456,8 +1456,56 @@ won't be possible to load it back.
|
||||
|
||||
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
|
||||
the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
|
||||
weights. This cannot be done during training since this is a process that requires a lot of memory, and therefore this
|
||||
is performed offline.
|
||||
weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
|
||||
therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
|
||||
memory it can be done in the same training script. The following sections will discuss both approaches.
|
||||
|
||||
|
||||
**Live FP32 Weights Recovery:**
|
||||
|
||||
This approach may not work if you model is large and you have little free CPU memory left, at the end of the training.
|
||||
|
||||
If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from transformers.trainer_utils import get_last_checkpoint
|
||||
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
|
||||
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
|
||||
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
|
||||
|
||||
If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best
|
||||
checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
|
||||
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
|
||||
trainer.deepspeed.save_checkpoint(checkpoint_dir)
|
||||
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
|
||||
|
||||
.. note::
|
||||
|
||||
Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the
|
||||
DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
|
||||
``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end
|
||||
of the training.
|
||||
|
||||
Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own
|
||||
trainer.
|
||||
|
||||
If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply
|
||||
these yourself as is shown in the following example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
|
||||
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
|
||||
model = model.cpu()
|
||||
model.load_state_dict(state_dict)
|
||||
|
||||
|
||||
**Offline FP32 Weights Recovery:**
|
||||
|
||||
DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
|
||||
folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
|
||||
@@ -1486,15 +1534,16 @@ weights just run:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python zero_to_fp32.py global_step1 pytorch_model.bin
|
||||
python zero_to_fp32.py . pytorch_model.bin
|
||||
|
||||
The script will automatically handle either ZeRO-2 or ZeRO-3 checkpoint.
|
||||
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
|
||||
|
||||
The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
|
||||
|
||||
``python zero_to_fp32.py -h`` will give you usage details.
|
||||
|
||||
If you have multiple DeepSpeed checkpoint sub-folders, pick the one you know to have the desired weights.
|
||||
|
||||
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
|
||||
The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current
|
||||
example will contain ``global_step1``.
|
||||
|
||||
Note: currently the script requires 2x general RAM of the final fp32 model weights.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user