[Deepspeed] adapt multiple models, add zero_to_fp32 tests (#12477)
* zero_to_fp32 tests * args change * remove unnecessary work * use transformers.trainer_utils.get_last_checkpoint * document the new features * cleanup * wip * fix fsmt * add bert * cleanup * add xlm-roberta * electra works * cleanup * sync * split off the model zoo tests * cleanup * cleanup * cleanup * cleanup * reformat * cleanup * casing * deepspeed>=0.4.3 * adjust distilbert * Update docs/source/main_classes/deepspeed.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * style Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -1456,8 +1456,56 @@ won't be possible to load it back.
|
|||||||
|
|
||||||
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
|
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to
|
||||||
the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
|
the `models hub <https://huggingface.co/models>`__ or pass it to someone else you most likely will want to get the fp32
|
||||||
weights. This cannot be done during training since this is a process that requires a lot of memory, and therefore this
|
weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and
|
||||||
is performed offline.
|
therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU
|
||||||
|
memory it can be done in the same training script. The following sections will discuss both approaches.
|
||||||
|
|
||||||
|
|
||||||
|
**Live FP32 Weights Recovery:**
|
||||||
|
|
||||||
|
This approach may not work if you model is large and you have little free CPU memory left, at the end of the training.
|
||||||
|
|
||||||
|
If you have saved at least one checkpoint, and you want to use the latest one, you can do the following:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from transformers.trainer_utils import get_last_checkpoint
|
||||||
|
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
|
||||||
|
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
|
||||||
|
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
|
||||||
|
|
||||||
|
If you're using the ``--load_best_model_at_end`` class:`~transformers.TrainingArguments` argument (to track the best
|
||||||
|
checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
|
||||||
|
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
|
||||||
|
trainer.deepspeed.save_checkpoint(checkpoint_dir)
|
||||||
|
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Note, that once ``load_state_dict_from_zero_checkpoint`` was run, the ``model`` will no longer be useable in the
|
||||||
|
DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since
|
||||||
|
``model.load_state_dict(state_dict)`` will remove all the DeepSpeed magic from it. So do this only at the very end
|
||||||
|
of the training.
|
||||||
|
|
||||||
|
Of course, you don't have to use class:`~transformers.Trainer` and you can adjust the examples above to your own
|
||||||
|
trainer.
|
||||||
|
|
||||||
|
If for some reason you want more refinement, you can also extract the fp32 ``state_dict`` of the weights and apply
|
||||||
|
these yourself as is shown in the following example:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
|
||||||
|
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
|
||||||
|
model = model.cpu()
|
||||||
|
model.load_state_dict(state_dict)
|
||||||
|
|
||||||
|
|
||||||
|
**Offline FP32 Weights Recovery:**
|
||||||
|
|
||||||
DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
|
DeepSpeed creates a special conversion script ``zero_to_fp32.py`` which it places in the top-level of the checkpoint
|
||||||
folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
|
folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to
|
||||||
@@ -1486,15 +1534,16 @@ weights just run:
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
python zero_to_fp32.py global_step1 pytorch_model.bin
|
python zero_to_fp32.py . pytorch_model.bin
|
||||||
|
|
||||||
The script will automatically handle either ZeRO-2 or ZeRO-3 checkpoint.
|
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
|
||||||
|
|
||||||
|
The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint.
|
||||||
|
|
||||||
``python zero_to_fp32.py -h`` will give you usage details.
|
``python zero_to_fp32.py -h`` will give you usage details.
|
||||||
|
|
||||||
If you have multiple DeepSpeed checkpoint sub-folders, pick the one you know to have the desired weights.
|
The script will auto-discover the deepspeed sub-folder using the contents of the file ``latest``, which in the current
|
||||||
|
example will contain ``global_step1``.
|
||||||
This is it. ``pytorch_model.bin`` will now contain the full fp32 model weights consolidated from multiple GPUs.
|
|
||||||
|
|
||||||
Note: currently the script requires 2x general RAM of the final fp32 model weights.
|
Note: currently the script requires 2x general RAM of the final fp32 model weights.
|
||||||
|
|
||||||
|
|||||||
2
setup.py
2
setup.py
@@ -91,7 +91,7 @@ _deps = [
|
|||||||
"cookiecutter==1.7.2",
|
"cookiecutter==1.7.2",
|
||||||
"dataclasses",
|
"dataclasses",
|
||||||
"datasets",
|
"datasets",
|
||||||
"deepspeed>=0.4.0",
|
"deepspeed>=0.4.3",
|
||||||
"docutils==0.16.0",
|
"docutils==0.16.0",
|
||||||
"fairscale>0.3",
|
"fairscale>0.3",
|
||||||
"faiss-cpu",
|
"faiss-cpu",
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ deps = {
|
|||||||
"cookiecutter": "cookiecutter==1.7.2",
|
"cookiecutter": "cookiecutter==1.7.2",
|
||||||
"dataclasses": "dataclasses",
|
"dataclasses": "dataclasses",
|
||||||
"datasets": "datasets",
|
"datasets": "datasets",
|
||||||
"deepspeed": "deepspeed>=0.4.0",
|
"deepspeed": "deepspeed>=0.4.3",
|
||||||
"docutils": "docutils==0.16.0",
|
"docutils": "docutils==0.16.0",
|
||||||
"fairscale": "fairscale>0.3",
|
"fairscale": "fairscale>0.3",
|
||||||
"faiss-cpu": "faiss-cpu",
|
"faiss-cpu": "faiss-cpu",
|
||||||
|
|||||||
@@ -819,9 +819,17 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
|
|||||||
if new_num_tokens is None:
|
if new_num_tokens is None:
|
||||||
return old_lm_head
|
return old_lm_head
|
||||||
|
|
||||||
old_num_tokens, old_lm_head_dim = (
|
if is_deepspeed_zero3_enabled():
|
||||||
old_lm_head.weight.size() if not transposed else old_lm_head.weight.t().size()
|
import deepspeed
|
||||||
)
|
|
||||||
|
with deepspeed.zero.GatheredParameters(old_lm_head.weight, modifier_rank=None):
|
||||||
|
old_num_tokens, old_lm_head_dim = (
|
||||||
|
old_lm_head.weight.size() if not transposed else old_lm_head.weight.t().size()
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
old_num_tokens, old_lm_head_dim = (
|
||||||
|
old_lm_head.weight.size() if not transposed else old_lm_head.weight.t().size()
|
||||||
|
)
|
||||||
|
|
||||||
if old_num_tokens == new_num_tokens:
|
if old_num_tokens == new_num_tokens:
|
||||||
return old_lm_head
|
return old_lm_head
|
||||||
@@ -829,7 +837,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
|
|||||||
if not isinstance(old_lm_head, nn.Linear):
|
if not isinstance(old_lm_head, nn.Linear):
|
||||||
raise TypeError(
|
raise TypeError(
|
||||||
f"Old language model head is of type {type(old_lm_head)}, which is not an instance of {nn.Linear}."
|
f"Old language model head is of type {type(old_lm_head)}, which is not an instance of {nn.Linear}."
|
||||||
f"You should either use a different resize function or make sure that `old_embeddings` are an instance of {nn.Linear}."
|
f"You should either use a different resize function or make sure that `old_lm_head` are an instance of {nn.Linear}."
|
||||||
)
|
)
|
||||||
|
|
||||||
# Build new lm head
|
# Build new lm head
|
||||||
@@ -842,15 +850,35 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
|
|||||||
|
|
||||||
num_tokens_to_copy = min(old_num_tokens, new_num_tokens)
|
num_tokens_to_copy = min(old_num_tokens, new_num_tokens)
|
||||||
|
|
||||||
# Copy old lm head weights to new lm head
|
# XXX: put the long block of code in a wrapper
|
||||||
if not transposed:
|
if is_deepspeed_zero3_enabled():
|
||||||
new_lm_head.weight.data[:num_tokens_to_copy, :] = old_lm_head.weight.data[:num_tokens_to_copy, :]
|
import deepspeed
|
||||||
else:
|
|
||||||
new_lm_head.weight.data[:, :num_tokens_to_copy] = old_lm_head.weight.data[:, :num_tokens_to_copy]
|
|
||||||
|
|
||||||
# Copy bias weights to new lm head
|
with deepspeed.zero.GatheredParameters(old_lm_head.weight, modifier_rank=0):
|
||||||
if has_new_lm_head_bias:
|
if torch.distributed.get_rank() == 0:
|
||||||
new_lm_head.bias.data[:num_tokens_to_copy] = old_lm_head.bias.data[:num_tokens_to_copy]
|
# Copy old lm head weights to new lm head
|
||||||
|
if not transposed:
|
||||||
|
new_lm_head.weight.data[:num_tokens_to_copy, :] = old_lm_head.weight.data[
|
||||||
|
:num_tokens_to_copy, :
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
new_lm_head.weight.data[:, :num_tokens_to_copy] = old_lm_head.weight.data[
|
||||||
|
:, :num_tokens_to_copy
|
||||||
|
]
|
||||||
|
|
||||||
|
# Copy bias weights to new lm head
|
||||||
|
if has_new_lm_head_bias:
|
||||||
|
new_lm_head.bias.data[:num_tokens_to_copy] = old_lm_head.bias.data[:num_tokens_to_copy]
|
||||||
|
else:
|
||||||
|
# Copy old lm head weights to new lm head
|
||||||
|
if not transposed:
|
||||||
|
new_lm_head.weight.data[:num_tokens_to_copy, :] = old_lm_head.weight.data[:num_tokens_to_copy, :]
|
||||||
|
else:
|
||||||
|
new_lm_head.weight.data[:, :num_tokens_to_copy] = old_lm_head.weight.data[:, :num_tokens_to_copy]
|
||||||
|
|
||||||
|
# Copy bias weights to new lm head
|
||||||
|
if has_new_lm_head_bias:
|
||||||
|
new_lm_head.bias.data[:num_tokens_to_copy] = old_lm_head.bias.data[:num_tokens_to_copy]
|
||||||
|
|
||||||
return new_lm_head
|
return new_lm_head
|
||||||
|
|
||||||
|
|||||||
@@ -18,7 +18,6 @@
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
import copy
|
|
||||||
import math
|
import math
|
||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
@@ -27,6 +26,7 @@ from torch import nn
|
|||||||
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
|
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
|
||||||
|
|
||||||
from ...activations import gelu
|
from ...activations import gelu
|
||||||
|
from ...deepspeed import is_deepspeed_zero3_enabled
|
||||||
from ...file_utils import (
|
from ...file_utils import (
|
||||||
add_code_sample_docstrings,
|
add_code_sample_docstrings,
|
||||||
add_start_docstrings,
|
add_start_docstrings,
|
||||||
@@ -85,9 +85,19 @@ class Embeddings(nn.Module):
|
|||||||
self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)
|
self.word_embeddings = nn.Embedding(config.vocab_size, config.dim, padding_idx=config.pad_token_id)
|
||||||
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
|
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.dim)
|
||||||
if config.sinusoidal_pos_embds:
|
if config.sinusoidal_pos_embds:
|
||||||
create_sinusoidal_embeddings(
|
|
||||||
n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight
|
if is_deepspeed_zero3_enabled():
|
||||||
)
|
import deepspeed
|
||||||
|
|
||||||
|
with deepspeed.zero.GatheredParameters(self.position_embeddings.weight, modifier_rank=0):
|
||||||
|
if torch.distributed.get_rank() == 0:
|
||||||
|
create_sinusoidal_embeddings(
|
||||||
|
n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
create_sinusoidal_embeddings(
|
||||||
|
n_pos=config.max_position_embeddings, dim=config.dim, out=self.position_embeddings.weight
|
||||||
|
)
|
||||||
|
|
||||||
self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)
|
self.LayerNorm = nn.LayerNorm(config.dim, eps=1e-12)
|
||||||
self.dropout = nn.Dropout(config.dropout)
|
self.dropout = nn.Dropout(config.dropout)
|
||||||
@@ -274,9 +284,7 @@ class Transformer(nn.Module):
|
|||||||
def __init__(self, config):
|
def __init__(self, config):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.n_layers = config.n_layers
|
self.n_layers = config.n_layers
|
||||||
|
self.layer = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
|
||||||
layer = TransformerBlock(config)
|
|
||||||
self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.n_layers)])
|
|
||||||
|
|
||||||
def forward(
|
def forward(
|
||||||
self, x, attn_mask=None, head_mask=None, output_attentions=False, output_hidden_states=False, return_dict=None
|
self, x, attn_mask=None, head_mask=None, output_attentions=False, output_hidden_states=False, return_dict=None
|
||||||
|
|||||||
@@ -36,6 +36,7 @@ from torch import Tensor, nn
|
|||||||
from torch.nn import CrossEntropyLoss, LayerNorm
|
from torch.nn import CrossEntropyLoss, LayerNorm
|
||||||
|
|
||||||
from ...activations import ACT2FN
|
from ...activations import ACT2FN
|
||||||
|
from ...deepspeed import is_deepspeed_zero3_enabled
|
||||||
from ...file_utils import (
|
from ...file_utils import (
|
||||||
add_code_sample_docstrings,
|
add_code_sample_docstrings,
|
||||||
add_end_docstrings,
|
add_end_docstrings,
|
||||||
@@ -658,11 +659,14 @@ class FSMTDecoder(nn.Module):
|
|||||||
[DecoderLayer(config) for _ in range(config.decoder_layers)]
|
[DecoderLayer(config) for _ in range(config.decoder_layers)]
|
||||||
) # type: List[DecoderLayer]
|
) # type: List[DecoderLayer]
|
||||||
|
|
||||||
self.output_projection = nn.Linear(
|
if is_deepspeed_zero3_enabled():
|
||||||
self.embed_tokens.weight.shape[1],
|
import deepspeed
|
||||||
self.embed_tokens.weight.shape[0],
|
|
||||||
bias=False,
|
with deepspeed.zero.GatheredParameters(self.embed_tokens.weight, modifier_rank=None):
|
||||||
)
|
embed_tokens_weight_shape = self.embed_tokens.weight.shape
|
||||||
|
else:
|
||||||
|
embed_tokens_weight_shape = self.embed_tokens.weight.shape
|
||||||
|
self.output_projection = nn.Linear(embed_tokens_weight_shape[1], embed_tokens_weight_shape[0], bias=False)
|
||||||
self.output_projection.weight = self.embed_tokens.weight
|
self.output_projection.weight = self.embed_tokens.weight
|
||||||
|
|
||||||
def forward(
|
def forward(
|
||||||
@@ -1127,19 +1131,6 @@ class FSMTForConditionalGeneration(PretrainedFSMTModel):
|
|||||||
base_model = FSMTModel(config)
|
base_model = FSMTModel(config)
|
||||||
self.model = base_model
|
self.model = base_model
|
||||||
|
|
||||||
def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:
|
|
||||||
new_embeddings = super().resize_token_embeddings(new_num_tokens)
|
|
||||||
self.model.encoder.embed_tokens = new_embeddings
|
|
||||||
|
|
||||||
new_embeddings = super().resize_token_embeddings(new_num_tokens)
|
|
||||||
self.model.decoder.embed_tokens = new_embeddings
|
|
||||||
|
|
||||||
# XXX: this is not quite correct, as we have 2 different `new_embeddings`, and
|
|
||||||
# only one return value is expected. Needs to be redesigned in the core to support dual dicts
|
|
||||||
raise NotImplementedError("this method needs re-thinking for models with 2 separate dictionaries")
|
|
||||||
|
|
||||||
return new_embeddings
|
|
||||||
|
|
||||||
@add_start_docstrings_to_model_forward(FSMT_INPUTS_DOCSTRING)
|
@add_start_docstrings_to_model_forward(FSMT_INPUTS_DOCSTRING)
|
||||||
@replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
|
@replace_return_docstrings(output_type=Seq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
|
||||||
@add_end_docstrings(FSMT_GENERATION_EXAMPLE)
|
@add_end_docstrings(FSMT_GENERATION_EXAMPLE)
|
||||||
@@ -1257,6 +1248,9 @@ class FSMTForConditionalGeneration(PretrainedFSMTModel):
|
|||||||
def get_output_embeddings(self):
|
def get_output_embeddings(self):
|
||||||
return self.model.decoder.embed_tokens
|
return self.model.decoder.embed_tokens
|
||||||
|
|
||||||
|
def set_output_embeddings(self, value):
|
||||||
|
self.model.decoder.embed_tokens = value
|
||||||
|
|
||||||
|
|
||||||
class SinusoidalPositionalEmbedding(nn.Embedding):
|
class SinusoidalPositionalEmbedding(nn.Embedding):
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -23,9 +23,8 @@ import torch
|
|||||||
import torch.utils.checkpoint
|
import torch.utils.checkpoint
|
||||||
from torch import nn
|
from torch import nn
|
||||||
|
|
||||||
from transformers.deepspeed import is_deepspeed_zero3_enabled
|
|
||||||
|
|
||||||
from ...activations import ACT2FN
|
from ...activations import ACT2FN
|
||||||
|
from ...deepspeed import is_deepspeed_zero3_enabled
|
||||||
from ...file_utils import (
|
from ...file_utils import (
|
||||||
ModelOutput,
|
ModelOutput,
|
||||||
add_start_docstrings,
|
add_start_docstrings,
|
||||||
@@ -853,17 +852,7 @@ class Wav2Vec2PreTrainedModel(PreTrainedModel):
|
|||||||
module.bias.data.zero_()
|
module.bias.data.zero_()
|
||||||
module.weight.data.fill_(1.0)
|
module.weight.data.fill_(1.0)
|
||||||
elif isinstance(module, nn.Conv1d):
|
elif isinstance(module, nn.Conv1d):
|
||||||
if is_deepspeed_zero3_enabled():
|
nn.init.kaiming_normal_(module.weight.data)
|
||||||
import deepspeed
|
|
||||||
|
|
||||||
if hasattr(module, "weight_v") and hasattr(module, "weight_g"):
|
|
||||||
with deepspeed.zero.GatheredParameters([module.weight_v, module.weight_g], modifier_rank=0):
|
|
||||||
nn.init.kaiming_normal_(module.weight.data)
|
|
||||||
else:
|
|
||||||
with deepspeed.zero.GatheredParameters(module.weight, modifier_rank=0):
|
|
||||||
nn.init.kaiming_normal_(module.weight.data)
|
|
||||||
else:
|
|
||||||
nn.init.kaiming_normal_(module.weight.data)
|
|
||||||
|
|
||||||
if isinstance(module, (nn.Linear, nn.Conv1d)) and module.bias is not None:
|
if isinstance(module, (nn.Linear, nn.Conv1d)) and module.bias is not None:
|
||||||
module.bias.data.zero_()
|
module.bias.data.zero_()
|
||||||
|
|||||||
@@ -799,10 +799,7 @@ class TrainingArguments:
|
|||||||
device = torch.device("cuda", self.local_rank)
|
device = torch.device("cuda", self.local_rank)
|
||||||
self._n_gpu = 1
|
self._n_gpu = 1
|
||||||
elif self.deepspeed:
|
elif self.deepspeed:
|
||||||
# deepspeed performs its own DDP internally, and requires the program to be started with:
|
# deepspeed inits torch.distributed internally
|
||||||
# deepspeed ./program.py
|
|
||||||
# rather than:
|
|
||||||
# python -m torch.distributed.launch --nproc_per_node=2 ./program.py
|
|
||||||
from .deepspeed import is_deepspeed_available
|
from .deepspeed import is_deepspeed_available
|
||||||
|
|
||||||
if not is_deepspeed_available():
|
if not is_deepspeed_available():
|
||||||
|
|||||||
@@ -37,11 +37,12 @@ from transformers.testing_utils import (
|
|||||||
require_torch_multi_gpu,
|
require_torch_multi_gpu,
|
||||||
slow,
|
slow,
|
||||||
)
|
)
|
||||||
from transformers.trainer_utils import set_seed
|
from transformers.trainer_utils import get_last_checkpoint, set_seed
|
||||||
|
|
||||||
|
|
||||||
bindir = os.path.abspath(os.path.dirname(__file__))
|
tests_dir = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
|
||||||
with ExtendSysPath(f"{bindir}/.."):
|
root_dir = os.path.dirname(tests_dir)
|
||||||
|
with ExtendSysPath(tests_dir):
|
||||||
from test_trainer import TrainerIntegrationCommon # noqa
|
from test_trainer import TrainerIntegrationCommon # noqa
|
||||||
|
|
||||||
if is_torch_available():
|
if is_torch_available():
|
||||||
@@ -49,9 +50,10 @@ with ExtendSysPath(f"{bindir}/.."):
|
|||||||
|
|
||||||
|
|
||||||
set_seed(42)
|
set_seed(42)
|
||||||
MBART_TINY = "sshleifer/tiny-mbart"
|
|
||||||
T5_SMALL = "t5-small"
|
T5_SMALL = "t5-small"
|
||||||
T5_TINY = "patrickvonplaten/t5-tiny-random"
|
T5_TINY = "patrickvonplaten/t5-tiny-random"
|
||||||
|
GPT2_TINY = "sshleifer/tiny-gpt2"
|
||||||
|
|
||||||
|
|
||||||
def load_json(path):
|
def load_json(path):
|
||||||
@@ -77,8 +79,19 @@ def require_deepspeed_aio(test_case):
|
|||||||
|
|
||||||
if is_deepspeed_available():
|
if is_deepspeed_available():
|
||||||
from deepspeed.utils import logger as deepspeed_logger # noqa
|
from deepspeed.utils import logger as deepspeed_logger # noqa
|
||||||
|
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
|
||||||
from transformers.deepspeed import deepspeed_config, is_deepspeed_zero3_enabled # noqa
|
from transformers.deepspeed import deepspeed_config, is_deepspeed_zero3_enabled # noqa
|
||||||
|
|
||||||
|
|
||||||
|
def get_launcher(distributed=False):
|
||||||
|
# 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
|
||||||
|
# - it won't be able to handle that
|
||||||
|
# 2. for now testing with just 2 gpus max (since some quality tests may give different
|
||||||
|
# results with mode gpus because we use very little data)
|
||||||
|
num_gpus = min(2, get_gpu_count()) if distributed else 1
|
||||||
|
return f"deepspeed --num_nodes 1 --num_gpus {num_gpus}".split()
|
||||||
|
|
||||||
|
|
||||||
ZERO2 = "zero2"
|
ZERO2 = "zero2"
|
||||||
ZERO3 = "zero3"
|
ZERO3 = "zero3"
|
||||||
stages = [ZERO2, ZERO3]
|
stages = [ZERO2, ZERO3]
|
||||||
@@ -568,6 +581,41 @@ class TrainerIntegrationDeepSpeed(TestCasePlus, TrainerIntegrationCommon):
|
|||||||
self.assertEqual(b, b1)
|
self.assertEqual(b, b1)
|
||||||
self.check_trainer_state_are_the_same(state, state1)
|
self.check_trainer_state_are_the_same(state, state1)
|
||||||
|
|
||||||
|
@parameterized.expand(stages)
|
||||||
|
def test_load_state_dict_from_zero_checkpoint(self, stage):
|
||||||
|
# test that we can load fp32 weights directly from the zero checkpoint into the current model
|
||||||
|
|
||||||
|
output_dir = self.get_auto_remove_tmp_dir() # "./xxx", after=False, before=False)
|
||||||
|
|
||||||
|
ds_config_dict = self.get_config_dict(stage)
|
||||||
|
|
||||||
|
kwargs = dict(
|
||||||
|
output_dir=output_dir,
|
||||||
|
train_len=4,
|
||||||
|
per_device_train_batch_size=4,
|
||||||
|
num_train_epochs=1,
|
||||||
|
save_strategy="steps",
|
||||||
|
save_steps=1,
|
||||||
|
learning_rate=0.1,
|
||||||
|
fp16=True,
|
||||||
|
deepspeed=ds_config_dict,
|
||||||
|
)
|
||||||
|
|
||||||
|
with mockenv_context(**self.dist_env_1_gpu):
|
||||||
|
trainer = get_regression_trainer(**kwargs)
|
||||||
|
trainer.train()
|
||||||
|
(a, b) = trainer.model.a.item(), trainer.model.b.item()
|
||||||
|
state = dataclasses.asdict(trainer.state)
|
||||||
|
|
||||||
|
checkpoint_dir = get_last_checkpoint(output_dir)
|
||||||
|
model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
|
||||||
|
|
||||||
|
(a1, b1) = model.a.item(), model.b.item()
|
||||||
|
state1 = dataclasses.asdict(trainer.state)
|
||||||
|
self.assertEqual(a, a1)
|
||||||
|
self.assertEqual(b, b1)
|
||||||
|
self.check_trainer_state_are_the_same(state, state1)
|
||||||
|
|
||||||
def test_config_object(self):
|
def test_config_object(self):
|
||||||
# test that we can switch from zero2 to zero3 in the same process for example
|
# test that we can switch from zero2 to zero3 in the same process for example
|
||||||
# test is_zero, etc.
|
# test is_zero, etc.
|
||||||
@@ -809,7 +857,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
|
|||||||
|
|
||||||
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
|
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
|
||||||
script = [f"{self.examples_dir_str}/pytorch/translation/run_translation.py"]
|
script = [f"{self.examples_dir_str}/pytorch/translation/run_translation.py"]
|
||||||
launcher = self.get_launcher(distributed)
|
launcher = get_launcher(distributed)
|
||||||
|
|
||||||
cmd = launcher + script + args + ds_args
|
cmd = launcher + script + args + ds_args
|
||||||
# keep for quick debug
|
# keep for quick debug
|
||||||
@@ -826,7 +874,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
|
|||||||
data_dir = self.tests_dir / "fixtures"
|
data_dir = self.tests_dir / "fixtures"
|
||||||
output_dir = self.get_auto_remove_tmp_dir()
|
output_dir = self.get_auto_remove_tmp_dir()
|
||||||
args = f"""
|
args = f"""
|
||||||
--model_name_or_path sshleifer/tiny-gpt2
|
--model_name_or_path {GPT2_TINY}
|
||||||
--train_file {data_dir}/sample_text.txt
|
--train_file {data_dir}/sample_text.txt
|
||||||
--validation_file {data_dir}/sample_text.txt
|
--validation_file {data_dir}/sample_text.txt
|
||||||
--output_dir {output_dir}
|
--output_dir {output_dir}
|
||||||
@@ -846,7 +894,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
|
|||||||
|
|
||||||
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
|
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
|
||||||
script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
|
script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
|
||||||
launcher = self.get_launcher(distributed=True)
|
launcher = get_launcher(distributed=True)
|
||||||
|
|
||||||
cmd = launcher + script + args + ds_args
|
cmd = launcher + script + args + ds_args
|
||||||
# keep for quick debug
|
# keep for quick debug
|
||||||
@@ -860,7 +908,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
|
|||||||
output_dir = self.get_auto_remove_tmp_dir()
|
output_dir = self.get_auto_remove_tmp_dir()
|
||||||
args = f"""
|
args = f"""
|
||||||
--model_type gpt2
|
--model_type gpt2
|
||||||
--tokenizer_name sshleifer/tiny-gpt2
|
--tokenizer_name {GPT2_TINY}
|
||||||
--train_file {data_dir}/sample_text.txt
|
--train_file {data_dir}/sample_text.txt
|
||||||
--validation_file {data_dir}/sample_text.txt
|
--validation_file {data_dir}/sample_text.txt
|
||||||
--output_dir {output_dir}
|
--output_dir {output_dir}
|
||||||
@@ -877,7 +925,7 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
|
|||||||
|
|
||||||
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_zero3.json".split()
|
ds_args = f"--deepspeed {self.test_file_dir_str}/ds_config_zero3.json".split()
|
||||||
script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
|
script = [f"{self.examples_dir_str}/pytorch/language-modeling/run_clm.py"]
|
||||||
launcher = self.get_launcher(distributed=True)
|
launcher = get_launcher(distributed=True)
|
||||||
|
|
||||||
cmd = launcher + script + args + ds_args
|
cmd = launcher + script + args + ds_args
|
||||||
# keep for quick debug
|
# keep for quick debug
|
||||||
@@ -885,11 +933,3 @@ class TestDeepSpeedWithLauncher(TestCasePlus):
|
|||||||
with CaptureStderr() as cs:
|
with CaptureStderr() as cs:
|
||||||
execute_subprocess_async(cmd, env=self.get_env())
|
execute_subprocess_async(cmd, env=self.get_env())
|
||||||
assert "Detected DeepSpeed ZeRO-3" in cs.err
|
assert "Detected DeepSpeed ZeRO-3" in cs.err
|
||||||
|
|
||||||
def get_launcher(self, distributed=False):
|
|
||||||
# 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
|
|
||||||
# - it won't be able to handle that
|
|
||||||
# 2. for now testing with just 2 gpus max (since some quality tests may give different
|
|
||||||
# results with mode gpus because we use very little data)
|
|
||||||
num_gpus = min(2, get_gpu_count()) if distributed else 1
|
|
||||||
return f"deepspeed --num_nodes 1 --num_gpus {num_gpus}".split()
|
|
||||||
|
|||||||
259
tests/deepspeed/test_model_zoo.py
Normal file
259
tests/deepspeed/test_model_zoo.py
Normal file
@@ -0,0 +1,259 @@
|
|||||||
|
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import itertools
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
|
||||||
|
from parameterized import parameterized
|
||||||
|
from transformers import is_torch_available
|
||||||
|
from transformers.testing_utils import (
|
||||||
|
ExtendSysPath,
|
||||||
|
TestCasePlus,
|
||||||
|
execute_subprocess_async,
|
||||||
|
get_gpu_count,
|
||||||
|
require_deepspeed,
|
||||||
|
require_torch_gpu,
|
||||||
|
slow,
|
||||||
|
)
|
||||||
|
from transformers.trainer_utils import set_seed
|
||||||
|
|
||||||
|
|
||||||
|
tests_dir = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
|
||||||
|
root_dir = os.path.dirname(tests_dir)
|
||||||
|
with ExtendSysPath(tests_dir):
|
||||||
|
from test_trainer import TrainerIntegrationCommon # noqa
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
from test_trainer import RegressionModelConfig, RegressionPreTrainedModel, get_regression_trainer # noqa
|
||||||
|
|
||||||
|
|
||||||
|
set_seed(42)
|
||||||
|
|
||||||
|
# translation
|
||||||
|
FSMT_TINY = "stas/tiny-wmt19-en-de"
|
||||||
|
BART_TINY = "sshleifer/bart-tiny-random"
|
||||||
|
T5_SMALL = "t5-small"
|
||||||
|
T5_TINY = "patrickvonplaten/t5-tiny-random"
|
||||||
|
MBART_TINY = "sshleifer/tiny-mbart"
|
||||||
|
MARIAN_TINY = "sshleifer/tiny-marian-en-de"
|
||||||
|
|
||||||
|
# summarization
|
||||||
|
PEGASUS_TINY = "stas/pegasus-cnn_dailymail-tiny-random"
|
||||||
|
|
||||||
|
# causal lm
|
||||||
|
GPT2_TINY = "sshleifer/tiny-gpt2"
|
||||||
|
XLM_ROBERTA_TINY = "hf-internal-testing/tiny-xlm-roberta"
|
||||||
|
|
||||||
|
# question-answering
|
||||||
|
ROBERTA_TINY = "sshleifer/tiny-distilroberta-base"
|
||||||
|
|
||||||
|
# masked lm
|
||||||
|
DISTILBERT_TINY = "sshleifer/tiny-distilbert-base-cased"
|
||||||
|
ELECTRA_TINY = "hf-internal-testing/tiny-electra"
|
||||||
|
|
||||||
|
# classification
|
||||||
|
XLNET_TINY = "sshleifer/tiny-xlnet-base-cased"
|
||||||
|
BERT_TINY = "hf-internal-testing/tiny-bert"
|
||||||
|
|
||||||
|
|
||||||
|
# TODO: to add:
|
||||||
|
# albert
|
||||||
|
# deberta
|
||||||
|
# funnel
|
||||||
|
# longformer
|
||||||
|
# dpr
|
||||||
|
# gpt_neo
|
||||||
|
# camembert
|
||||||
|
# deberta-v2
|
||||||
|
# m2m_100
|
||||||
|
# tapas
|
||||||
|
# vit
|
||||||
|
# big_bird
|
||||||
|
|
||||||
|
|
||||||
|
def get_launcher(distributed=False):
|
||||||
|
# 1. explicitly set --num_nodes=1 just in case these tests end up run on a multi-node setup
|
||||||
|
# - it won't be able to handle that
|
||||||
|
# 2. for now testing with just 2 gpus max (since some quality tests may give different
|
||||||
|
# results with mode gpus because we use very little data)
|
||||||
|
num_gpus = min(2, get_gpu_count()) if distributed else 1
|
||||||
|
return f"deepspeed --num_nodes 1 --num_gpus {num_gpus}".split()
|
||||||
|
|
||||||
|
|
||||||
|
def make_task_cmds():
|
||||||
|
data_dir_fixtures = f"{tests_dir}/fixtures"
|
||||||
|
data_dir_samples = f"{data_dir_fixtures}/tests_samples"
|
||||||
|
data_dir_wmt = f"{data_dir_samples}/wmt_en_ro"
|
||||||
|
data_dir_xsum = f"{data_dir_samples}/xsum"
|
||||||
|
args_main = """
|
||||||
|
--do_train
|
||||||
|
--max_train_samples 4
|
||||||
|
--per_device_train_batch_size 2
|
||||||
|
--num_train_epochs 1
|
||||||
|
--fp16
|
||||||
|
--report_to none
|
||||||
|
--overwrite_output_dir
|
||||||
|
""".split()
|
||||||
|
|
||||||
|
# XXX: try to cover as many models as possible once (it's enough to run on one task per model)
|
||||||
|
# but need a tiny model for each
|
||||||
|
#
|
||||||
|
# should have T5_TINY, etc. global var defined
|
||||||
|
tasks2models = dict(
|
||||||
|
trans=[
|
||||||
|
"bart",
|
||||||
|
"fsmt",
|
||||||
|
"marian",
|
||||||
|
"mbart",
|
||||||
|
"t5",
|
||||||
|
],
|
||||||
|
sum=[
|
||||||
|
"pegasus",
|
||||||
|
],
|
||||||
|
clm=[
|
||||||
|
"gpt2",
|
||||||
|
"xlm-roberta",
|
||||||
|
],
|
||||||
|
mlm=[
|
||||||
|
"electra",
|
||||||
|
"distilbert",
|
||||||
|
],
|
||||||
|
qa=[
|
||||||
|
"roberta",
|
||||||
|
],
|
||||||
|
clas=[
|
||||||
|
"bert",
|
||||||
|
"xlnet",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
scripts_dir = f"{root_dir}/examples/pytorch"
|
||||||
|
|
||||||
|
tasks = dict(
|
||||||
|
trans=f"""
|
||||||
|
{scripts_dir}/translation/run_translation.py
|
||||||
|
--train_file {data_dir_wmt}/train.json
|
||||||
|
--source_lang en
|
||||||
|
--target_lang ro
|
||||||
|
""",
|
||||||
|
sum=f"""
|
||||||
|
{scripts_dir}/summarization/run_summarization.py
|
||||||
|
--train_file {data_dir_xsum}/sample.json
|
||||||
|
--max_source_length 12
|
||||||
|
--max_target_length 12
|
||||||
|
""",
|
||||||
|
clm=f"""
|
||||||
|
{scripts_dir}/language-modeling/run_clm.py
|
||||||
|
--train_file {data_dir_fixtures}/sample_text.txt
|
||||||
|
--block_size 8
|
||||||
|
""",
|
||||||
|
mlm=f"""
|
||||||
|
{scripts_dir}/language-modeling/run_mlm.py
|
||||||
|
--train_file {data_dir_fixtures}/sample_text.txt
|
||||||
|
""",
|
||||||
|
qa=f"""
|
||||||
|
{scripts_dir}/question-answering/run_qa.py
|
||||||
|
--train_file {data_dir_samples}/SQUAD/sample.json
|
||||||
|
""",
|
||||||
|
clas=f"""
|
||||||
|
{scripts_dir}/text-classification/run_glue.py
|
||||||
|
--train_file {data_dir_samples}/MRPC/train.csv
|
||||||
|
--max_seq_length 12
|
||||||
|
--task_name MRPC
|
||||||
|
""",
|
||||||
|
)
|
||||||
|
|
||||||
|
launcher = get_launcher(distributed=True)
|
||||||
|
|
||||||
|
cmds = {}
|
||||||
|
for task, args in tasks.items():
|
||||||
|
args = args.split()
|
||||||
|
for model in tasks2models[task]:
|
||||||
|
model_name = globals()[f"{model.upper().replace('-', '_')}_TINY"]
|
||||||
|
args_model = f"--model_name_or_path {model_name}".split()
|
||||||
|
cmds[f"{task}_{model}"] = launcher + args + args_model + args_main
|
||||||
|
|
||||||
|
# # generation special case
|
||||||
|
# if task == "gen":
|
||||||
|
# launcher = f"deepspeed --num_nodes 1 --num_gpus 1".split()
|
||||||
|
# args_model += f"--model_type {model}".split()
|
||||||
|
# cmds[f"{task}_{model}"] = launcher + args + args_model
|
||||||
|
# else:
|
||||||
|
|
||||||
|
return cmds
|
||||||
|
|
||||||
|
|
||||||
|
task_cmds = make_task_cmds()
|
||||||
|
|
||||||
|
ZERO2 = "zero2"
|
||||||
|
ZERO3 = "zero3"
|
||||||
|
stages = [ZERO2, ZERO3]
|
||||||
|
|
||||||
|
|
||||||
|
def parameterized_custom_name_func(func, param_num, param):
|
||||||
|
# customize the test name generator function as we want both params to appear in the sub-test
|
||||||
|
# name, as by default it shows only the first param
|
||||||
|
param_based_name = parameterized.to_safe_name("_".join(str(x) for x in param.args))
|
||||||
|
return f"{func.__name__}_{param_based_name}"
|
||||||
|
|
||||||
|
|
||||||
|
# Cartesian-product of zero stages with models to test
|
||||||
|
params = list(itertools.product(stages, task_cmds.keys()))
|
||||||
|
|
||||||
|
|
||||||
|
@slow
|
||||||
|
@require_deepspeed
|
||||||
|
@require_torch_gpu
|
||||||
|
class TestDeepSpeedModelZoo(TestCasePlus):
|
||||||
|
"""This class is for testing via an external script - can do multiple gpus"""
|
||||||
|
|
||||||
|
def get_task_cmd(self, task, stage):
|
||||||
|
# return a ready to run train cmd
|
||||||
|
if task not in task_cmds:
|
||||||
|
raise ValueError(f"don't know of task {task}, have {task_cmds.keys()}")
|
||||||
|
|
||||||
|
cmd = task_cmds[task]
|
||||||
|
args_ds = f"--deepspeed {self.test_file_dir_str}/ds_config_{stage}.json".split()
|
||||||
|
|
||||||
|
output_dir = self.get_auto_remove_tmp_dir()
|
||||||
|
args_out = f"--output_dir {output_dir}".split()
|
||||||
|
|
||||||
|
cmd += args_ds + args_out
|
||||||
|
|
||||||
|
return cmd, output_dir
|
||||||
|
|
||||||
|
@parameterized.expand(params, name_func=parameterized_custom_name_func)
|
||||||
|
def test_zero_to_fp32(self, stage, task):
|
||||||
|
# testing the ability to do a run followed by recovery of full fp32 weights
|
||||||
|
|
||||||
|
cmd, output_dir = self.get_task_cmd(task, stage)
|
||||||
|
|
||||||
|
# 1. generate the checkpoint
|
||||||
|
cmd += "--save_steps 1".split()
|
||||||
|
# keep for quick debug
|
||||||
|
# print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] + cmd)); die
|
||||||
|
execute_subprocess_async(cmd, env=self.get_env())
|
||||||
|
|
||||||
|
# 2. test that the fp32 weights get reconsolidated
|
||||||
|
chkpt_dir = f"{output_dir}/checkpoint-1"
|
||||||
|
recovered_model_path = f"{chkpt_dir}/out.bin"
|
||||||
|
cmd = f"{chkpt_dir}/zero_to_fp32.py {chkpt_dir} {recovered_model_path}"
|
||||||
|
# keep for quick debug
|
||||||
|
# print(" ".join([f"\nPYTHONPATH={self.src_dir_str}"] +cmd)); die
|
||||||
|
subprocess.check_call(cmd, shell=True)
|
||||||
|
assert os.path.exists(recovered_model_path), f"{recovered_model_path} was not found"
|
||||||
|
|
||||||
|
# possibly could also test that the resulting saved model is usable but given that we use
|
||||||
|
# random models we won't know if it's any good
|
||||||
Reference in New Issue
Block a user