Add video modality for InstrucBLIP (#30182)
* squash in single commit * add docs * dummy obj * more changes in diff converter * tiny fix * make docs happy * skip test * repo consistency tests * update docstring * style * fix tests * change diff imports * [run-slow] instructblipvideo * [run-slow] instructblipvideo * fix tests and remove logit check * [run-slow] instructblipvideo
This commit is contained in:
committed by
GitHub
parent
a958c4a801
commit
fc689d75a0
@@ -776,6 +776,8 @@
|
||||
title: Idefics2
|
||||
- local: model_doc/instructblip
|
||||
title: InstructBLIP
|
||||
- local: model_doc/instructblipvideo
|
||||
title: InstructBlipVideo
|
||||
- local: model_doc/kosmos-2
|
||||
title: KOSMOS-2
|
||||
- local: model_doc/layoutlm
|
||||
|
||||
@@ -165,6 +165,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| [ImageGPT](model_doc/imagegpt) | ✅ | ❌ | ❌ |
|
||||
| [Informer](model_doc/informer) | ✅ | ❌ | ❌ |
|
||||
| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ |
|
||||
| [InstructBlipVideo](model_doc/instructblipvideo) | ✅ | ❌ | ❌ |
|
||||
| [Jamba](model_doc/jamba) | ✅ | ❌ | ❌ |
|
||||
| [JetMoe](model_doc/jetmoe) | ✅ | ❌ | ❌ |
|
||||
| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ |
|
||||
|
||||
@@ -50,6 +50,7 @@ InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but impor
|
||||
|
||||
[[autodoc]] InstructBlipProcessor
|
||||
|
||||
|
||||
## InstructBlipVisionModel
|
||||
|
||||
[[autodoc]] InstructBlipVisionModel
|
||||
|
||||
74
docs/source/en/model_doc/instructblipvideo.md
Normal file
74
docs/source/en/model_doc/instructblipvideo.md
Normal file
@@ -0,0 +1,74 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# InstructBlipVideo
|
||||
|
||||
## Overview
|
||||
|
||||
## Overview
|
||||
|
||||
The InstructBLIPVideo is an extension of the models proposed in [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.
|
||||
InstructBLIPVideo uses the same architecture as [InstructBLIP](instructblip) and works with the same checkpoints as [InstructBLIP](instructblip). The only difference is the ability to process videos.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.*
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/instructblip_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> InstructBLIPVideo architecture. Taken from the <a href="https://arxiv.org/abs/2305.06500">original paper.</a> </small>
|
||||
|
||||
This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanTurganbay).
|
||||
The original code can be found [here](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip).
|
||||
|
||||
## Usage tips
|
||||
|
||||
- The model was trained by sampling 4 frames per video, so it's recommended to sample 4 frames
|
||||
|
||||
## InstructBlipVideoConfig
|
||||
|
||||
[[autodoc]] InstructBlipVideoConfig
|
||||
- from_vision_qformer_text_configs
|
||||
|
||||
## InstructBlipVideoVisionConfig
|
||||
|
||||
[[autodoc]] InstructBlipVideoVisionConfig
|
||||
|
||||
## InstructBlipVideoQFormerConfig
|
||||
|
||||
[[autodoc]] InstructBlipVideoQFormerConfig
|
||||
|
||||
## InstructBlipVideoProcessor
|
||||
|
||||
[[autodoc]] InstructBlipVideoProcessor
|
||||
|
||||
## InstructBlipVideoImageProcessor
|
||||
|
||||
[[autodoc]] InstructBlipVideoImageProcessor
|
||||
- preprocess
|
||||
|
||||
## InstructBlipVideoVisionModel
|
||||
|
||||
[[autodoc]] InstructBlipVideoVisionModel
|
||||
- forward
|
||||
|
||||
## InstructBlipVideoQFormerModel
|
||||
|
||||
[[autodoc]] InstructBlipVideoQFormerModel
|
||||
- forward
|
||||
|
||||
## InstructBlipVideoForConditionalGeneration
|
||||
|
||||
[[autodoc]] InstructBlipVideoForConditionalGeneration
|
||||
- forward
|
||||
- generate
|
||||
Reference in New Issue
Block a user