From 8d40ca5749dff9e0dcdc18eba4165691fe951f42 Mon Sep 17 00:00:00 2001
From: Tanuj Rai <tanujrai898@gmail.com>
Date: Mon, 14 Jul 2025 23:05:17 +0530
Subject: [PATCH] Update phi4_multimodal.md (#38830)

* Update phi4_multimodal.md

* Update docs/source/en/model_doc/phi4_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/phi4_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/phi4_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/phi4_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/model_doc/phi4_multimodal.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update phi4_multimodal.md

* Update phi4_multimodal.md

* Update phi4_multimodal.md

* Update phi4_multimodal.md

* Update phi4_multimodal.md

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/phi4_multimodal.md | 70 ++++++++++++++-------
 1 file changed, 48 insertions(+), 22 deletions(-)
diff --git a/docs/source/en/model_doc/phi4_multimodal.md b/docs/source/en/model_doc/phi4_multimodal.md
index 22b55792f6..f7d93d2617 100644
--- a/docs/source/en/model_doc/phi4_multimodal.md
+++ b/docs/source/en/model_doc/phi4_multimodal.md
@@ -9,44 +9,53 @@ specific language governing permissions and limitations under the License.
 rendered properly in your Markdown viewer.
 -->
 
-# Phi4 Multimodal
+<div style="float: right;">
+  <div class="flex flex-wrap space-x-1">
+    <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white&style=flat">
+  </div>
+</div>
 
-## Overview
+## Phi4 Multimodal
 
-Phi4 Multimodal is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
+[Phi4 Multimodal](https://huggingface.co/papers/2503.01743) is a multimodal model capable of text, image, and speech and audio inputs or any combination of these. It features a mixture of LoRA adapters for handling different inputs, and each input is routed to the appropriate encoder.
 
-- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
-- Vision: English
-- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
+You can find all the original Phi4 Multimodal checkpoints under the [Phi4](https://huggingface.co/collections/microsoft/phi-4-677e9380e514feb5577a40e4) collection.
 
-This model was contributed by [Cyril Vallez](https://huggingface.co/cyrilvallez). The most recent code can be
-found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py).
+> [!TIP]
+> This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez).
+>
+> Click on the Phi-4 Multimodal in the right sidebar for more examples of how to apply Phi-4 Multimodal to different tasks.
 
+The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
 
-## Usage tips
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
-`Phi4-multimodal-instruct` can be found on the [Huggingface Hub](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
+```python
+from transformers import pipeline
+generator = pipeline("text-generation", model="microsoft/Phi-4-multimodal-instruct", torch_dtype="auto", device=0)
 
-In the following, we demonstrate how to use it for inference depending on the input modalities (text, image, audio).
+prompt = "Explain the concept of multimodal AI in simple terms."
+
+result = generator(prompt, max_length=50)
+print(result[0]['generated_text'])
+```
+
+</hfoption>
+<hfoption id="AutoModel">
 
 ```python
 import torch
 from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
 
-
-# Define model path
 model_path = "microsoft/Phi-4-multimodal-instruct"
 device = "cuda:0"
 
-# Load model and processor
 processor = AutoProcessor.from_pretrained(model_path)
-model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device,  torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.float16)
 
-# Optional: load the adapters (note that without them, the base model will very likely not work well)
-model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'})
 model.load_adapter(model_path, adapter_name="vision", device_map=device, adapter_kwargs={"subfolder": 'vision-lora'})
 
-# Part : Image Processing
 messages = [
     {
         "role": "user",
@@ -57,7 +66,7 @@ messages = [
     },
 ]
 
-model.set_adapter("vision") # if loaded, activate the vision adapter
+model.set_adapter("vision")
 inputs = processor.apply_chat_template(
     messages,
     add_generation_prompt=True,
@@ -66,7 +75,6 @@ inputs = processor.apply_chat_template(
     return_tensors="pt",
 ).to(device)
 
-# Generate response
 generate_ids = model.generate(
     **inputs,
     max_new_tokens=1000,
@@ -77,10 +85,27 @@ response = processor.batch_decode(
     generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )[0]
 print(f'>>> Response\n{response}')
+```
 
+</hfoption>
+</hfoptions>
 
-# Part 2: Audio Processing
-model.set_adapter("speech") # if loaded, activate the speech adapter
+## Notes
+
+The example below demonstrates inference with an audio and text input.
+
+```py
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
+
+model_path = "microsoft/Phi-4-multimodal-instruct"
+device = "cuda:0"
+
+processor = AutoProcessor.from_pretrained(model_path)
+model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device,  torch_dtype=torch.float16)
+
+model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'})
+model.set_adapter("speech")
 audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
 messages = [
     {
@@ -110,6 +135,7 @@ response = processor.batch_decode(
     generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )[0]
 print(f'>>> Response\n{response}')
+
 ```
 
 ## Phi4MultimodalFeatureExtractor