Model debugger upgrades (#37391)

* debugging improvements * add debugging details * add more debugging details * debug more * clean up layers + output * add summary json file * cleanup * copies 👀 * remove hooks + add documentation * draft a small test, why not * respect the format (respect it) * fixup imports * nit * add tests and configurable pruning of layers
2025-04-18 16:45:54 +02:00
parent e5ac23081e
commit 4afd3f4820
5 changed files with 425 additions and 112 deletions
--- a/docs/source/en/internal/model_debugging_utils.md
+++ b/docs/source/en/internal/model_debugging_utils.md
@@ -28,7 +28,7 @@ Most of those are only useful if you are adding new models in the library.

 This context manager is a power user tool intended for model adders.
 It tracks all forward calls within a model forward and logs a slice of each input and output on a nested Json.
-To note, this context manager enforces `torch.inference_mode()`.
+To note, this context manager enforces `torch.no_grad()`.

 ### Rationale

@@ -43,6 +43,7 @@ import torch
 from PIL import Image
 import requests
 from transformers import LlavaProcessor, LlavaForConditionalGeneration
+from transformers.model_debugging_utils import model_addition_debugger_context
 torch.random.manual_seed(673)

 # load pretrained model and processor
@@ -60,12 +61,153 @@ prompt = "<image>Describe this image."
 inputs = processor(text=prompt, images=random_image, return_tensors="pt")

 # call forward method (not .generate!)
-with model_addition_debugger_context(model, "optional_path_to_your_output_file.json"):
+with model_addition_debugger_context(
+  model,
+  debug_path="optional_path_to_your_directory",
+  do_prune_layers=False # This will output ALL the layers of a model.
+  ):
    output = model.forward(**inputs)

 ```


-[[autodoc]] model_addition_debugger
+### Reading results
+
+The debugger generates two files from the forward call, both with the same base name, 
+but ending either with `_SUMMARY.json` or with `_FULL_TENSORS.json`. 
+
+The first one will contain a summary of each module's _input_ and _output_ tensor values and shapes.
+
+```json
+{
+  "module_path": "MolmoForConditionalGeneration",
+  "inputs": {
+    "args": [],
+    "kwargs": {
+      "input_ids": {
+        "shape": "torch.Size([1, 589])",
+        "dtype": "torch.int64"
+      },
+      "attention_mask": {
+        "shape": "torch.Size([1, 589])",
+        "dtype": "torch.int64"
+      },
+      "pixel_values": {
+        "shape": "torch.Size([1, 5, 576, 588])",
+        "dtype": "torch.float32",
+        "mean": "tensor(-8.9514e-01, device='cuda:0')",
+        "std": "tensor(9.2586e-01, device='cuda:0')",
+        "min": "tensor(-1.7923e+00, device='cuda:0')",
+        "max": "tensor(1.8899e+00, device='cuda:0')"
+    }
+  },
+  "children": [
+    {
+      "module_path": "MolmoForConditionalGeneration.language_model.model.embed_tokens",
+      "inputs": {
+        "args": [
+          {
+            "shape": "torch.Size([1, 589])",
+            "dtype": "torch.int64"
+          }
+        ]
+      },
+      "outputs": {
+        "shape": "torch.Size([1, 589, 3584])",
+        "dtype": "torch.float32",
+        "mean": "tensor(6.5460e-06, device='cuda:0')",
+        "std": "tensor(2.3807e-02, device='cuda:0')",
+        "min": "tensor(-3.3398e-01, device='cuda:0')",
+        "max": "tensor(3.9453e-01, device='cuda:0')"
+      }
+    },
+    {
+      "module_path": "MolmoForConditionalGeneration.vision_tower",
+      "inputs": {
+        "args": [
+          {
+            "shape": "torch.Size([5, 1, 576, 588])",
+            "dtype": "torch.float32",
+            "mean": "tensor(-8.9514e-01, device='cuda:0')",
+            "std": "tensor(9.2586e-01, device='cuda:0')",
+            "min": "tensor(-1.7923e+00, device='cuda:0')",
+            "max": "tensor(1.8899e+00, device='cuda:0')"
+          }
+        ],
+        "kwargs": {
+          "output_hidden_states": "True"
+        }
+      },
+      "children": [
+        { ... and so on
+```
+
+The `_FULL_TENSORS.json` file will display a full view of all tensors, which is useful
+for comparing two files. 
+```json
+      "pixel_values": {
+        "shape": "torch.Size([1, 5, 576, 588])",
+        "dtype": "torch.float32",
+        "value": [
+          "tensor([[[[-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          ...,",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00]],",
+          "",
+          "         [[-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          ...,",
+          "          [-1.4857e+00, -1.4820e+00, -1.2100e+00,  ..., -6.0979e-01, -5.9650e-01, -3.8527e-01],",
+          "          [-1.6755e+00, -1.7221e+00, -1.4518e+00,  ..., -7.5577e-01, -7.4658e-01, -5.5592e-01],",
+          "          [-7.9957e-01, -8.2162e-01, -5.7014e-01,  ..., -1.3689e+00, -1.3169e+00, -1.0678e+00]],",
+          "",
+          "         [[-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          ...,",
+          "          [-3.0322e-01, -5.0645e-01, -5.8436e-01,  ..., -6.2439e-01, -7.9160e-01, -8.1188e-01],",
+          "          [-4.4921e-01, -6.5653e-01, -7.2656e-01,  ..., -3.4702e-01, -5.2146e-01, -5.1326e-01],",
+          "          [-3.4702e-01, -5.3647e-01, -5.4170e-01,  ..., -1.0915e+00, -1.1968e+00, -1.0252e+00]],",
+          "",
+          "         [[-1.1207e+00, -1.2718e+00, -1.0678e+00,  ..., 1.2013e-01, -1.3126e-01, -1.7197e-01],",
+          "          [-6.9738e-01, -9.1166e-01, -8.5454e-01,  ..., -5.5050e-02, -2.8134e-01, -4.2793e-01],",
+          "          [-3.4702e-01, -5.5148e-01, -5.8436e-01,  ..., 1.9312e-01, -8.6235e-02, -2.1463e-01],",
+          "          ...,",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00]],",
+          "",
+          "         [[-1.0039e+00, -9.5669e-01, -6.5546e-01,  ..., -1.4711e+00, -1.4219e+00, -1.1389e+00],",
+          "          [-1.0039e+00, -9.5669e-01, -6.5546e-01,  ..., -1.7193e+00, -1.6771e+00, -1.4091e+00],",
+          "          [-1.6317e+00, -1.6020e+00, -1.2669e+00,  ..., -1.2667e+00, -1.2268e+00, -8.9720e-01],",
+          "          ...,",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00],",
+          "          [-1.7923e+00, -1.7521e+00, -1.4802e+00,  ..., -1.7923e+00, -1.7521e+00, -1.4802e+00]]]], device='cuda:0')"
+        ],
+        "mean": "tensor(-8.9514e-01, device='cuda:0')",
+        "std": "tensor(9.2586e-01, device='cuda:0')",
+        "min": "tensor(-1.7923e+00, device='cuda:0')",
+        "max": "tensor(1.8899e+00, device='cuda:0')"
+      },
+```
+
+### Comparing between implementations
+
+Once the forward passes of two models have been traced by the debugger, one can compare the `json` output files. See below: we can see slight differences between these two implementations' key projection layer. Inputs are mostly identical, but not quite. Looking through the file differences makes it easier to pinpoint which layer is wrong. 
+
+
+![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/files_difference_debugging.png)
+
+
+### Limitations and scope
+
+This feature will only work for torch-based models, and would require more work and case-by-case approach for say `jax`-based models that are usually compiled. Models relying heavily on external kernel calls may work, but trace will probably miss some things. Regardless, any python implementation that aims at mimicking another implementation can be traced once instead of reran N times with breakpoints.
+
+If you pass `do_prune_layers=False` to your model debugger, ALL the layers will be outputted to `json`. Else, only the first and last layer will be shown. This is useful when some layers (typically cross-attention) appear only after N layers. 

 [[autodoc]] model_addition_debugger_context