(Part 2) feat: allow for tp_size attr for tplizing the model (#37054)
* feat: custom tp_size, new transformers tp interface Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: review cmt - error when tp_plan not set for tp_size Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * fix: nit in docs Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
dac443414e
commit
7d76876498
@@ -674,29 +674,7 @@ use_cpu: false
|
|||||||
```
|
```
|
||||||
|
|
||||||
</hfoption>
|
</hfoption>
|
||||||
<hfoption id="Tensor Parallelism with PyTorch 2">
|
|
||||||
|
|
||||||
```yml
|
|
||||||
compute_environment: LOCAL_MACHINE
|
|
||||||
tp_config:
|
|
||||||
tp_size: 4
|
|
||||||
distributed_type: TP
|
|
||||||
downcast_bf16: 'no'
|
|
||||||
machine_rank: 0
|
|
||||||
main_training_function: main
|
|
||||||
mixed_precision: 'no'
|
|
||||||
num_machines: 1
|
|
||||||
num_processes: 4
|
|
||||||
rdzv_backend: static
|
|
||||||
same_network: true
|
|
||||||
tpu_env: []
|
|
||||||
tpu_use_cluster: false
|
|
||||||
tpu_use_sudo: false
|
|
||||||
use_cpu: false
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
</hfoption>
|
|
||||||
</hfoptions>
|
</hfoptions>
|
||||||
يُعد أمر [`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) هو الطريقة المُوصى بها لتشغيل نص البرمجى للتدريب على نظام موزع باستخدام Accelerate و [`Trainer`] مع المعلمات المحددة في `config_file.yaml`. يتم حفظ هذا الملف في مجلد ذاكرة التخزين المؤقت لـ Accelerate ويتم تحميله تلقائيًا عند تشغيل `accelerate_launch`.
|
يُعد أمر [`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) هو الطريقة المُوصى بها لتشغيل نص البرمجى للتدريب على نظام موزع باستخدام Accelerate و [`Trainer`] مع المعلمات المحددة في `config_file.yaml`. يتم حفظ هذا الملف في مجلد ذاكرة التخزين المؤقت لـ Accelerate ويتم تحميله تلقائيًا عند تشغيل `accelerate_launch`.
|
||||||
|
|
||||||
|
|||||||
@@ -341,29 +341,9 @@ use_cpu: false
|
|||||||
```
|
```
|
||||||
|
|
||||||
</hfoption>
|
</hfoption>
|
||||||
<hfoption id="Tensor parallelism with PyTorch 2">
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
compute_environment: LOCAL_MACHINE
|
|
||||||
tp_config:
|
|
||||||
tp_size: 4
|
|
||||||
distributed_type: TP
|
|
||||||
downcast_bf16: 'no'
|
|
||||||
machine_rank: 0
|
|
||||||
main_training_function: main
|
|
||||||
mixed_precision: 'no'
|
|
||||||
num_machines: 1
|
|
||||||
num_processes: 4
|
|
||||||
rdzv_backend: static
|
|
||||||
same_network: true
|
|
||||||
tpu_env: []
|
|
||||||
tpu_use_cluster: false
|
|
||||||
tpu_use_sudo: false
|
|
||||||
use_cpu: false
|
|
||||||
```
|
|
||||||
|
|
||||||
</hfoptions>
|
</hfoptions>
|
||||||
|
|
||||||
|
|
||||||
Run [accelerate_launch](https://hf.co/docs/accelerate/package_reference/cli#accelerate-launch) to start training with the configurations set in `config_file.yaml`. This file is saved to the Accelerate cache folder and automatically loaded when you run `accelerate_launch`.
|
Run [accelerate_launch](https://hf.co/docs/accelerate/package_reference/cli#accelerate-launch) to start training with the configurations set in `config_file.yaml`. This file is saved to the Accelerate cache folder and automatically loaded when you run `accelerate_launch`.
|
||||||
|
|
||||||
The example below launches the [run_glue.py](../../../examples/pytorch/text-classification/run_glue) script with the FSDP configuration shown earlier. Parameters from the `config_file.yaml` file can also be directly set in the command line.
|
The example below launches the [run_glue.py](../../../examples/pytorch/text-classification/run_glue) script with the FSDP configuration shown earlier. Parameters from the `config_file.yaml` file can also be directly set in the command line.
|
||||||
|
|||||||
@@ -363,29 +363,6 @@ use_cpu: false
|
|||||||
|
|
||||||
</hfoption>
|
</hfoption>
|
||||||
|
|
||||||
<hfoption id="Tensor Parallelism with PyTorch 2">
|
|
||||||
|
|
||||||
```yml
|
|
||||||
compute_environment: LOCAL_MACHINE
|
|
||||||
tp_config:
|
|
||||||
tp_size: 4
|
|
||||||
distributed_type: TP
|
|
||||||
downcast_bf16: 'no'
|
|
||||||
machine_rank: 0
|
|
||||||
main_training_function: main
|
|
||||||
mixed_precision: 'no'
|
|
||||||
num_machines: 1
|
|
||||||
num_processes: 4
|
|
||||||
rdzv_backend: static
|
|
||||||
same_network: true
|
|
||||||
tpu_env: []
|
|
||||||
tpu_use_cluster: false
|
|
||||||
tpu_use_sudo: false
|
|
||||||
use_cpu: false
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
</hfoption>
|
|
||||||
</hfoptions>
|
</hfoptions>
|
||||||
|
|
||||||
El comando [`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) es la forma recomendada de lanzar tu script de entrenamiento en un sistema distribuido con Accelerate y [`Trainer`] con los parámetros especificados en `config_file.yaml`. Este archivo se guarda en la carpeta de caché de Accelerate y se carga automáticamente cuando ejecutas `accelerate_launch`.
|
El comando [`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) es la forma recomendada de lanzar tu script de entrenamiento en un sistema distribuido con Accelerate y [`Trainer`] con los parámetros especificados en `config_file.yaml`. Este archivo se guarda en la carpeta de caché de Accelerate y se carga automáticamente cuando ejecutas `accelerate_launch`.
|
||||||
|
|||||||
@@ -549,29 +549,7 @@ use_cpu: false
|
|||||||
```
|
```
|
||||||
|
|
||||||
</hfoption>
|
</hfoption>
|
||||||
<hfoption id="Tensor Parallelism with PyTorch 2">
|
|
||||||
|
|
||||||
```yml
|
|
||||||
compute_environment: LOCAL_MACHINE
|
|
||||||
tp_config:
|
|
||||||
tp_size: 4
|
|
||||||
distributed_type: TP
|
|
||||||
downcast_bf16: 'no'
|
|
||||||
machine_rank: 0
|
|
||||||
main_training_function: main
|
|
||||||
mixed_precision: 'no'
|
|
||||||
num_machines: 1
|
|
||||||
num_processes: 4
|
|
||||||
rdzv_backend: static
|
|
||||||
same_network: true
|
|
||||||
tpu_env: []
|
|
||||||
tpu_use_cluster: false
|
|
||||||
tpu_use_sudo: false
|
|
||||||
use_cpu: false
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
</hfoption>
|
|
||||||
</hfoptions>
|
</hfoptions>
|
||||||
|
|
||||||
[`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) 명령은 Accelerate와 [`Trainer`]를 사용하여 분산 시스템에서 훈련 스크립트를 실행하는 권장 방법이며, `config_file.yaml`에 지정된 매개변수를 사용합니다. 이 파일은 Accelerate 캐시 폴더에 저장되며 `accelerate_launch`를 실행할 때 자동으로 로드됩니다.
|
[`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) 명령은 Accelerate와 [`Trainer`]를 사용하여 분산 시스템에서 훈련 스크립트를 실행하는 권장 방법이며, `config_file.yaml`에 지정된 매개변수를 사용합니다. 이 파일은 Accelerate 캐시 폴더에 저장되며 `accelerate_launch`를 실행할 때 자동으로 로드됩니다.
|
||||||
|
|||||||
@@ -1788,6 +1788,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
# for example.
|
# for example.
|
||||||
_tp_plan = None
|
_tp_plan = None
|
||||||
|
|
||||||
|
# tensor parallel degree to which model is sharded to.
|
||||||
|
_tp_size = None
|
||||||
|
|
||||||
# A pipeline parallel plan specifying the layers which may not be present
|
# A pipeline parallel plan specifying the layers which may not be present
|
||||||
# on all ranks when PP is enabled. For top-level models, this attribute is
|
# on all ranks when PP is enabled. For top-level models, this attribute is
|
||||||
# currently defined in respective model code. For base models, this
|
# currently defined in respective model code. For base models, this
|
||||||
@@ -3878,6 +3881,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
A torch tensor parallel plan, see [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html). Currently, it only accepts
|
A torch tensor parallel plan, see [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html). Currently, it only accepts
|
||||||
`tp_plan="auto"` to use predefined plan based on the model. Note that if you use it, you should launch your script accordingly with
|
`tp_plan="auto"` to use predefined plan based on the model. Note that if you use it, you should launch your script accordingly with
|
||||||
`torchrun [args] script.py`. This will be much faster than using a `device_map`, but has limitations.
|
`torchrun [args] script.py`. This will be much faster than using a `device_map`, but has limitations.
|
||||||
|
tp_size (`str`, *optional*):
|
||||||
|
A torch tensor parallel degree. If not provided would default to world size.
|
||||||
offload_folder (`str` or `os.PathLike`, *optional*):
|
offload_folder (`str` or `os.PathLike`, *optional*):
|
||||||
If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
|
If the `device_map` contains any value `"disk"`, the folder where we will offload weights.
|
||||||
offload_state_dict (`bool`, *optional*):
|
offload_state_dict (`bool`, *optional*):
|
||||||
@@ -3974,6 +3979,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
generation_config = kwargs.pop("generation_config", None)
|
generation_config = kwargs.pop("generation_config", None)
|
||||||
gguf_file = kwargs.pop("gguf_file", None)
|
gguf_file = kwargs.pop("gguf_file", None)
|
||||||
tp_plan = kwargs.pop("tp_plan", None)
|
tp_plan = kwargs.pop("tp_plan", None)
|
||||||
|
tp_size = kwargs.pop("tp_size", None)
|
||||||
key_mapping = kwargs.pop("key_mapping", None)
|
key_mapping = kwargs.pop("key_mapping", None)
|
||||||
# Not used anymore -- remove them from the kwargs
|
# Not used anymore -- remove them from the kwargs
|
||||||
_ = kwargs.pop("resume_download", None)
|
_ = kwargs.pop("resume_download", None)
|
||||||
@@ -3986,7 +3992,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
raise ValueError(
|
raise ValueError(
|
||||||
"`state_dict` cannot be passed together with a model name or a `gguf_file`. Use one of the two loading strategies."
|
"`state_dict` cannot be passed together with a model name or a `gguf_file`. Use one of the two loading strategies."
|
||||||
)
|
)
|
||||||
|
if tp_size is not None and tp_plan is None:
|
||||||
|
raise ValueError("tp_plan has to be set when tp_size is passed.")
|
||||||
if tp_plan is not None and tp_plan != "auto":
|
if tp_plan is not None and tp_plan != "auto":
|
||||||
# TODO: we can relax this check when we support taking tp_plan from a json file, for example.
|
# TODO: we can relax this check when we support taking tp_plan from a json file, for example.
|
||||||
raise ValueError(f"tp_plan supports 'auto' only for now but got {tp_plan}.")
|
raise ValueError(f"tp_plan supports 'auto' only for now but got {tp_plan}.")
|
||||||
@@ -4046,9 +4053,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
sys.stderr = open(os.devnull, "w")
|
sys.stderr = open(os.devnull, "w")
|
||||||
# This is the easiest way to dispatch to the current process device
|
# This is the easiest way to dispatch to the current process device
|
||||||
device_map = tp_device
|
device_map = tp_device
|
||||||
# Assuming sharding the model onto the world
|
|
||||||
world_size = torch.distributed.get_world_size()
|
# Assuming sharding the model onto the world when tp_size not provided
|
||||||
device_mesh = torch.distributed.init_device_mesh(tp_device.type, (world_size,))
|
tp_size = tp_size if tp_size is not None else torch.distributed.get_world_size()
|
||||||
|
device_mesh = torch.distributed.init_device_mesh(tp_device.type, (tp_size,))
|
||||||
|
|
||||||
if use_auth_token is not None:
|
if use_auth_token is not None:
|
||||||
warnings.warn(
|
warnings.warn(
|
||||||
@@ -4415,6 +4423,9 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
weights_only=weights_only,
|
weights_only=weights_only,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# record tp degree the model sharded to
|
||||||
|
model._tp_size = tp_size
|
||||||
|
|
||||||
# make sure token embedding weights are still tied if needed
|
# make sure token embedding weights are still tied if needed
|
||||||
model.tie_weights()
|
model.tie_weights()
|
||||||
|
|
||||||
@@ -4498,7 +4509,6 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
elif from_flax:
|
elif from_flax:
|
||||||
loading_info = None
|
loading_info = None
|
||||||
return model, loading_info
|
return model, loading_info
|
||||||
|
|
||||||
return model
|
return model
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
@@ -5142,6 +5152,14 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
|
|||||||
return True
|
return True
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
@property
|
||||||
|
def tp_size(self):
|
||||||
|
"""
|
||||||
|
Returns the model's tensor parallelism degree.
|
||||||
|
"""
|
||||||
|
# if None, the model didn't undergo tensor parallel sharding
|
||||||
|
return self._tp_size
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def supports_pp_plan(self):
|
def supports_pp_plan(self):
|
||||||
if self._pp_plan is not None:
|
if self._pp_plan is not None:
|
||||||
|
|||||||
@@ -459,7 +459,7 @@ class Trainer:
|
|||||||
self.hp_name = None
|
self.hp_name = None
|
||||||
self.deepspeed = None
|
self.deepspeed = None
|
||||||
self.is_in_train = False
|
self.is_in_train = False
|
||||||
|
self.model = model
|
||||||
self.create_accelerator_and_postprocess()
|
self.create_accelerator_and_postprocess()
|
||||||
|
|
||||||
# memory metrics - must set up as early as possible
|
# memory metrics - must set up as early as possible
|
||||||
@@ -5146,10 +5146,10 @@ class Trainer:
|
|||||||
args.update(accelerator_config)
|
args.update(accelerator_config)
|
||||||
# tp is initialized at Accelerator init phase so
|
# tp is initialized at Accelerator init phase so
|
||||||
# args should be prepared here
|
# args should be prepared here
|
||||||
if self.args.tp_size > 1:
|
if hasattr(self.model, "tp_size") and self.model.tp_size is not None and self.model.tp_size > 1:
|
||||||
self.is_tp_enabled = True
|
self.is_tp_enabled = True
|
||||||
if version.parse(accelerate_version) > version.parse("1.3.0"):
|
if version.parse(accelerate_version) > version.parse("1.3.0"):
|
||||||
args["torch_tp_plugin"] = TorchTensorParallelPlugin(tp_size=self.args.tp_size)
|
args["torch_tp_plugin"] = TorchTensorParallelPlugin(tp_size=self.model.tp_size)
|
||||||
else:
|
else:
|
||||||
raise ValueError("Requires accelerate>1.3.0 to use Tensor Parallelism.")
|
raise ValueError("Requires accelerate>1.3.0 to use Tensor Parallelism.")
|
||||||
|
|
||||||
|
|||||||
@@ -554,10 +554,6 @@ class TrainingArguments:
|
|||||||
Will use gradient checkpointing over each nested XLA FSDP wrapped layer. This setting can only be
|
Will use gradient checkpointing over each nested XLA FSDP wrapped layer. This setting can only be
|
||||||
used when the xla flag is set to true, and an auto wrapping policy is specified through
|
used when the xla flag is set to true, and an auto wrapping policy is specified through
|
||||||
fsdp_min_num_params or fsdp_transformer_layer_cls_to_wrap.
|
fsdp_min_num_params or fsdp_transformer_layer_cls_to_wrap.
|
||||||
tp_size (`int`, *optional*):
|
|
||||||
Use tp_size to enable PyTorch tensor parallelism. Tensor parallelism support is only available to models having `base_tp_plan`
|
|
||||||
in their respective config classes.
|
|
||||||
Set a value greater than 1 to activate TP. The same is used to prepare device mesh internally. Requires accelerate>1.3.0.
|
|
||||||
deepspeed (`str` or `dict`, *optional*):
|
deepspeed (`str` or `dict`, *optional*):
|
||||||
Use [Deepspeed](https://github.com/deepspeedai/DeepSpeed). This is an experimental feature and its API may
|
Use [Deepspeed](https://github.com/deepspeedai/DeepSpeed). This is an experimental feature and its API may
|
||||||
evolve in the future. The value is either the location of DeepSpeed json config file (e.g.,
|
evolve in the future. The value is either the location of DeepSpeed json config file (e.g.,
|
||||||
@@ -1244,18 +1240,6 @@ class TrainingArguments:
|
|||||||
)
|
)
|
||||||
},
|
},
|
||||||
)
|
)
|
||||||
tp_size: Optional[int] = field(
|
|
||||||
default=0,
|
|
||||||
metadata={
|
|
||||||
"help": (
|
|
||||||
"Use tp_size to enable pytorch tensor parallelism."
|
|
||||||
"Tensor parallelism support is only available to models having `base_tp_plan` in their respective config classes."
|
|
||||||
"Set a value greater than 1 to activate TP."
|
|
||||||
"The same is used to prepare device mesh internally."
|
|
||||||
"Requires accelerate>1.3.0."
|
|
||||||
)
|
|
||||||
},
|
|
||||||
)
|
|
||||||
fsdp_transformer_layer_cls_to_wrap: Optional[str] = field(
|
fsdp_transformer_layer_cls_to_wrap: Optional[str] = field(
|
||||||
default=None,
|
default=None,
|
||||||
metadata={
|
metadata={
|
||||||
@@ -1941,14 +1925,6 @@ class TrainingArguments:
|
|||||||
if self.fsdp_config["xla_fsdp_grad_ckpt"]:
|
if self.fsdp_config["xla_fsdp_grad_ckpt"]:
|
||||||
warnings.warn("`--xla_fsdp_grad_ckpt` is useful only when `--xla` is set to true.")
|
warnings.warn("`--xla_fsdp_grad_ckpt` is useful only when `--xla` is set to true.")
|
||||||
|
|
||||||
if self.tp_size > 1:
|
|
||||||
if not is_accelerate_available("1.3.1"):
|
|
||||||
raise NotImplementedError(
|
|
||||||
"TP using PyTorch requires Accelerate version `accelerate` >= 1.3.1. "
|
|
||||||
"This is not supported and we recommend you to update your version."
|
|
||||||
)
|
|
||||||
os.environ["ACCELERATE_USE_TP"] = "true"
|
|
||||||
os.environ["TP_SIZE"] = str(self.tp_size)
|
|
||||||
# accelerate integration for FSDP
|
# accelerate integration for FSDP
|
||||||
if len(self.fsdp) > 0 and not self.fsdp_config["xla"]:
|
if len(self.fsdp) > 0 and not self.fsdp_config["xla"]:
|
||||||
os.environ["ACCELERATE_USE_FSDP"] = "true"
|
os.environ["ACCELERATE_USE_FSDP"] = "true"
|
||||||
|
|||||||
Reference in New Issue
Block a user