Patch release: v4.27.8

Revert "Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head))" (#22444 )
Revert "Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head)) (#21627)" This reverts commit bad8300837.
2023-03-29 11:42:29 -04:00 · 2023-03-29 11:42:09 -04:00 · 2023-03-23 14:04:40 -04:00 · 2023-03-23 14:04:10 -04:00 · 2023-03-20 12:02:35 -04:00 · 2023-03-20 12:01:08 -04:00
5 changed files with 9 additions and 5 deletions
--- a/setup.py
+++ b/setup.py
@@ -418,7 +418,7 @@ install_requires = [
 setup(
    name="transformers",
-    version="4.27.1",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
+    version="4.27.4",  # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots)
    author="The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)",
    author_email="transformers@huggingface.co",
    description="State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow",
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -18,7 +18,7 @@
 # to defer the actual importing for when the objects are requested. This way `import transformers` provides the names
 # in the namespace without actually importing anything (and especially none of the backends).
-__version__ = "4.27.1"
+__version__ = "4.27.4"
 from typing import TYPE_CHECKING
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -2563,7 +2563,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
            elif device_map in ["balanced", "balanced_low_0"] and get_balanced_memory is None:
                raise ValueError(f"`device_map={device_map}` requires a source install of Accelerate.")
-            kwargs = {"no_split_module_classes": no_split_modules, "max_memory": max_memory}
+            kwargs = {"no_split_module_classes": no_split_modules}
            if "special_dtypes" in inspect.signature(infer_auto_device_map).parameters:
                kwargs["special_dtypes"] = special_dtypes
            elif len(special_dtypes) > 0:
@@ -2576,8 +2576,10 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
                    model,
                    dtype=torch_dtype,
                    low_zero=(device_map == "balanced_low_0"),
                    max_memory=max_memory,
                    **kwargs,
                )
            kwargs["max_memory"] = max_memory
            # Make sure tied weights are tied before creating the device map.
            model.tie_weights()
            device_map = infer_auto_device_map(model, dtype=torch_dtype if not load_in_8bit else torch.int8, **kwargs)
--- a/src/transformers/models/flaubert/modeling_flaubert.py
+++ b/src/transformers/models/flaubert/modeling_flaubert.py
@@ -172,7 +172,8 @@ class MultiHeadAttention(nn.Module):
                    k, v = cache[self.layer_id]
            cache[self.layer_id] = (k, v)
-        scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, klen)
+        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)
        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)
        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)
        scores.masked_fill_(mask, torch.finfo(scores.dtype).min)  # (bs, n_heads, qlen, klen)
--- a/src/transformers/models/xlm/modeling_xlm.py
+++ b/src/transformers/models/xlm/modeling_xlm.py
@@ -176,7 +176,8 @@ class MultiHeadAttention(nn.Module):
                    k, v = cache[self.layer_id]
            cache[self.layer_id] = (k, v)
-        scores = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, klen)
+        q = q / math.sqrt(dim_per_head)  # (bs, n_heads, qlen, dim_per_head)
        scores = torch.matmul(q, k.transpose(2, 3))  # (bs, n_heads, qlen, klen)
        mask = (mask == 0).view(mask_reshape).expand_as(scores)  # (bs, n_heads, qlen, klen)
        scores.masked_fill_(mask, torch.finfo(scores.dtype).min)  # (bs, n_heads, qlen, klen)
Author	SHA1	Message	Date
Sylvain Gugger	4e9f6fc67c	Patch release: v4.27.8 Some checks failed Release - Conda / build_and_package (push) Has been cancelled Details	2023-03-29 11:42:29 -04:00
Sylvain Gugger	4277b3dd46	Revert "Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head))" (#22444 ) Revert "Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head)) (#21627)" This reverts commit `bad8300837`.	2023-03-29 11:42:09 -04:00
Sylvain Gugger	5e3b19a805	Patch release: v4.27.3 Some checks failed Release - Conda / build_and_package (push) Has been cancelled Details	2023-03-23 14:04:40 -04:00
Sylvain Gugger	62d9baa53c	Enforce `max_memory` for device_map strategies (#22311 ) Enforce for device_map strategies	2023-03-23 14:04:10 -04:00
Sylvain Gugger	68287689f2	Patch release: v4.27.2 Some checks failed Release - Conda / build_and_package (push) Has been cancelled Details	2023-03-20 12:02:35 -04:00
Sylvain Gugger	1e39734c4b	Fix balanced and auto device_map (#22271 )	2023-03-20 12:01:08 -04:00