Update phi4_multimodal.md (#38830)
* Update phi4_multimodal.md * Update docs/source/en/model_doc/phi4_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/phi4_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/phi4_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/phi4_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/phi4_multimodal.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update phi4_multimodal.md * Update phi4_multimodal.md * Update phi4_multimodal.md * Update phi4_multimodal.md * Update phi4_multimodal.md --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
@@ -9,44 +9,53 @@ specific language governing permissions and limitations under the License.
|
||||
rendered properly in your Markdown viewer.
|
||||
-->
|
||||
|
||||
# Phi4 Multimodal
|
||||
<div style="float: right;">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white&style=flat">
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Overview
|
||||
## Phi4 Multimodal
|
||||
|
||||
Phi4 Multimodal is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
|
||||
[Phi4 Multimodal](https://huggingface.co/papers/2503.01743) is a multimodal model capable of text, image, and speech and audio inputs or any combination of these. It features a mixture of LoRA adapters for handling different inputs, and each input is routed to the appropriate encoder.
|
||||
|
||||
- Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
|
||||
- Vision: English
|
||||
- Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
|
||||
You can find all the original Phi4 Multimodal checkpoints under the [Phi4](https://huggingface.co/collections/microsoft/phi-4-677e9380e514feb5577a40e4) collection.
|
||||
|
||||
This model was contributed by [Cyril Vallez](https://huggingface.co/cyrilvallez). The most recent code can be
|
||||
found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi4_multimodal/modeling_phi4_multimodal.py).
|
||||
> [!TIP]
|
||||
> This model was contributed by [cyrilvallez](https://huggingface.co/cyrilvallez).
|
||||
>
|
||||
> Click on the Phi-4 Multimodal in the right sidebar for more examples of how to apply Phi-4 Multimodal to different tasks.
|
||||
|
||||
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
|
||||
|
||||
## Usage tips
|
||||
<hfoptions id="usage">
|
||||
<hfoption id="Pipeline">
|
||||
|
||||
`Phi4-multimodal-instruct` can be found on the [Huggingface Hub](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
|
||||
```python
|
||||
from transformers import pipeline
|
||||
generator = pipeline("text-generation", model="microsoft/Phi-4-multimodal-instruct", torch_dtype="auto", device=0)
|
||||
|
||||
In the following, we demonstrate how to use it for inference depending on the input modalities (text, image, audio).
|
||||
prompt = "Explain the concept of multimodal AI in simple terms."
|
||||
|
||||
result = generator(prompt, max_length=50)
|
||||
print(result[0]['generated_text'])
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="AutoModel">
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
|
||||
|
||||
|
||||
# Define model path
|
||||
model_path = "microsoft/Phi-4-multimodal-instruct"
|
||||
device = "cuda:0"
|
||||
|
||||
# Load model and processor
|
||||
processor = AutoProcessor.from_pretrained(model_path)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.float16)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.float16)
|
||||
|
||||
# Optional: load the adapters (note that without them, the base model will very likely not work well)
|
||||
model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'})
|
||||
model.load_adapter(model_path, adapter_name="vision", device_map=device, adapter_kwargs={"subfolder": 'vision-lora'})
|
||||
|
||||
# Part : Image Processing
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
@@ -57,7 +66,7 @@ messages = [
|
||||
},
|
||||
]
|
||||
|
||||
model.set_adapter("vision") # if loaded, activate the vision adapter
|
||||
model.set_adapter("vision")
|
||||
inputs = processor.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt=True,
|
||||
@@ -66,7 +75,6 @@ inputs = processor.apply_chat_template(
|
||||
return_tensors="pt",
|
||||
).to(device)
|
||||
|
||||
# Generate response
|
||||
generate_ids = model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=1000,
|
||||
@@ -77,10 +85,27 @@ response = processor.batch_decode(
|
||||
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
||||
)[0]
|
||||
print(f'>>> Response\n{response}')
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
# Part 2: Audio Processing
|
||||
model.set_adapter("speech") # if loaded, activate the speech adapter
|
||||
## Notes
|
||||
|
||||
The example below demonstrates inference with an audio and text input.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
|
||||
|
||||
model_path = "microsoft/Phi-4-multimodal-instruct"
|
||||
device = "cuda:0"
|
||||
|
||||
processor = AutoProcessor.from_pretrained(model_path)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, torch_dtype=torch.float16)
|
||||
|
||||
model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'})
|
||||
model.set_adapter("speech")
|
||||
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
|
||||
messages = [
|
||||
{
|
||||
@@ -110,6 +135,7 @@ response = processor.batch_decode(
|
||||
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
||||
)[0]
|
||||
print(f'>>> Response\n{response}')
|
||||
|
||||
```
|
||||
|
||||
## Phi4MultimodalFeatureExtractor
|
||||
|
||||
Reference in New Issue
Block a user