🔴 [VLM] Add base model without head (#37033)
* i guessreverted all CdGen classes * style * llava onevision * fix copies * fix some tests * some more tests * dump * skip these * nevermind, i am dumb * revert fix not needed * fixup * fixup * another fixup * more fixup to make ci finally happy * fixup after rebasing * fix qwen tests * add internVL + typos here and there * image token index -> id * style * fix init weights * revert blip-2 not supported * address comments * fix copies * revert blip2 test file as well * as discussed internally, revert back CdGen models * fix some tests * fix more tests for compile * CI red * fix copies * enumerate explicitly allowed models * address comments * fix tests * fixup * style again * add tests for new model class * another fixup ( x _ x ) * [fixup] unused attributes can be removed post-deprecation
This commit is contained in:
committed by
GitHub
parent
3fa8d9c20e
commit
17742bd9c8
@@ -102,6 +102,10 @@ response = processor.decode(output_ids, skip_special_tokens=True)
|
||||
|
||||
[[autodoc]] AriaTextModel
|
||||
|
||||
## AriaModel
|
||||
|
||||
[[autodoc]] AriaModel
|
||||
|
||||
## AriaTextForCausalLM
|
||||
|
||||
[[autodoc]] AriaTextForCausalLM
|
||||
|
||||
@@ -237,6 +237,10 @@ for i, output in enumerate(batch_outputs):
|
||||
|
||||
[[autodoc]] AyaVisionConfig
|
||||
|
||||
## AyaVisionModel
|
||||
|
||||
[[autodoc]] AyaVisionModel
|
||||
|
||||
## AyaVisionForConditionalGeneration
|
||||
|
||||
[[autodoc]] AyaVisionForConditionalGeneration
|
||||
|
||||
@@ -174,6 +174,10 @@ for i, image in enumerate(images['pixel_values']):
|
||||
[[autodoc]] Emu3TextModel
|
||||
- forward
|
||||
|
||||
## Emu3Model
|
||||
|
||||
[[autodoc]] Emu3Model
|
||||
|
||||
## Emu3ForCausalLM
|
||||
|
||||
[[autodoc]] Emu3ForCausalLM
|
||||
|
||||
@@ -103,6 +103,10 @@ The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece.
|
||||
|
||||
[[autodoc]] FuyuConfig
|
||||
|
||||
## FuyuModel
|
||||
|
||||
[[autodoc]] FuyuModel
|
||||
|
||||
## FuyuForCausalLM
|
||||
|
||||
[[autodoc]] FuyuForCausalLM
|
||||
|
||||
@@ -254,6 +254,10 @@ visualizer("<img>What is shown in this image?")
|
||||
[[autodoc]] Gemma3TextModel
|
||||
- forward
|
||||
|
||||
## Gemma3Model
|
||||
|
||||
[[autodoc]] Gemma3Model
|
||||
|
||||
## Gemma3ForCausalLM
|
||||
|
||||
[[autodoc]] Gemma3ForCausalLM
|
||||
|
||||
@@ -277,6 +277,10 @@ alt="drawing" width="600"/>
|
||||
|
||||
[[autodoc]] GotOcr2Processor
|
||||
|
||||
## GotOcr2Model
|
||||
|
||||
[[autodoc]] GotOcr2Model
|
||||
|
||||
## GotOcr2ForConditionalGeneration
|
||||
|
||||
[[autodoc]] GotOcr2ForConditionalGeneration
|
||||
|
||||
@@ -69,6 +69,10 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
|
||||
[[autodoc]] InstructBlipQFormerModel
|
||||
- forward
|
||||
|
||||
## InstructBlipModel
|
||||
|
||||
[[autodoc]] InstructBlipModel
|
||||
|
||||
## InstructBlipForConditionalGeneration
|
||||
|
||||
[[autodoc]] InstructBlipForConditionalGeneration
|
||||
|
||||
@@ -73,6 +73,10 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
|
||||
[[autodoc]] InstructBlipVideoQFormerModel
|
||||
- forward
|
||||
|
||||
## InstructBlipVideoModel
|
||||
[[autodoc]] InstructBlipVideoModel
|
||||
- forward
|
||||
|
||||
## InstructBlipVideoForConditionalGeneration
|
||||
|
||||
[[autodoc]] InstructBlipVideoForConditionalGeneration
|
||||
|
||||
@@ -340,6 +340,11 @@ This example showcases how to handle a batch of chat conversations with interlea
|
||||
[[autodoc]] InternVLVisionModel
|
||||
- forward
|
||||
|
||||
## InternVLModel
|
||||
|
||||
[[autodoc]] InternVLModel
|
||||
- forward
|
||||
|
||||
## InternVLForConditionalGeneration
|
||||
|
||||
[[autodoc]] InternVLForConditionalGeneration
|
||||
|
||||
@@ -256,6 +256,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
|
||||
|
||||
[[autodoc]] LlavaProcessor
|
||||
|
||||
## LlavaModel
|
||||
|
||||
[[autodoc]] LlavaModel
|
||||
|
||||
## LlavaForConditionalGeneration
|
||||
|
||||
[[autodoc]] LlavaForConditionalGeneration
|
||||
|
||||
@@ -315,6 +315,10 @@ model = AutoModelForImageTextToText.from_pretrained(
|
||||
|
||||
[[autodoc]] LlavaNextProcessor
|
||||
|
||||
## LlavaNextModel
|
||||
|
||||
[[autodoc]] LlavaNextModel
|
||||
|
||||
## LlavaNextForConditionalGeneration
|
||||
|
||||
[[autodoc]] LlavaNextForConditionalGeneration
|
||||
|
||||
@@ -262,6 +262,10 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
|
||||
|
||||
[[autodoc]] LlavaNextVideoImageProcessor
|
||||
|
||||
## LlavaNextVideoModel
|
||||
|
||||
[[autodoc]] LlavaNextVideoModel
|
||||
|
||||
## LlavaNextVideoForConditionalGeneration
|
||||
|
||||
[[autodoc]] LlavaNextVideoForConditionalGeneration
|
||||
|
||||
@@ -313,6 +313,10 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
|
||||
|
||||
[[autodoc]] LlavaOnevisionVideoProcessor
|
||||
|
||||
## LlavaOnevisionModel
|
||||
|
||||
[[autodoc]] LlavaOnevisionModel
|
||||
|
||||
## LlavaOnevisionForConditionalGeneration
|
||||
|
||||
[[autodoc]] LlavaOnevisionForConditionalGeneration
|
||||
|
||||
@@ -227,6 +227,9 @@ This example also how to use `BitsAndBytes` to load the model in 4bit quantizati
|
||||
|
||||
[[autodoc]] Mistral3Config
|
||||
|
||||
## Mistral3Model
|
||||
|
||||
[[autodoc]] Mistral3Model
|
||||
|
||||
## Mistral3ForConditionalGeneration
|
||||
|
||||
|
||||
@@ -130,6 +130,10 @@ print(processor.decode(output[0], skip_special_tokens=True))
|
||||
[[autodoc]] MllamaTextModel
|
||||
- forward
|
||||
|
||||
## MllamaModel
|
||||
|
||||
[[autodoc]] MllamaModel
|
||||
|
||||
## MllamaForCausalLM
|
||||
|
||||
[[autodoc]] MllamaForCausalLM
|
||||
|
||||
@@ -174,6 +174,10 @@ visualizer("<img> What is in this image?")
|
||||
|
||||
[[autodoc]] PaliGemmaProcessor
|
||||
|
||||
## PaliGemmaModel
|
||||
|
||||
[[autodoc]] PaliGemmaModel
|
||||
|
||||
## PaliGemmaForConditionalGeneration
|
||||
|
||||
[[autodoc]] PaliGemmaForConditionalGeneration
|
||||
|
||||
@@ -240,6 +240,10 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
||||
|
||||
[[autodoc]] Qwen2_5_VLProcessor
|
||||
|
||||
## Qwen2_5_VLTextModel
|
||||
|
||||
[[autodoc]] Qwen2_5_VLTextModel
|
||||
- forward
|
||||
|
||||
## Qwen2_5_VLModel
|
||||
|
||||
|
||||
@@ -296,6 +296,11 @@ model = Qwen2VLForConditionalGeneration.from_pretrained(
|
||||
|
||||
[[autodoc]] Qwen2VLProcessor
|
||||
|
||||
## Qwen2VLTextModel
|
||||
|
||||
[[autodoc]] Qwen2VLTextModel
|
||||
- forward
|
||||
|
||||
## Qwen2VLModel
|
||||
|
||||
[[autodoc]] Qwen2VLModel
|
||||
|
||||
@@ -215,6 +215,10 @@ model = VideoLlavaForConditionalGeneration.from_pretrained(
|
||||
|
||||
[[autodoc]] VideoLlavaProcessor
|
||||
|
||||
## VideoLlavaModel
|
||||
|
||||
[[autodoc]] VideoLlavaModel
|
||||
|
||||
## VideoLlavaForConditionalGeneration
|
||||
|
||||
[[autodoc]] VideoLlavaForConditionalGeneration
|
||||
|
||||
@@ -101,6 +101,10 @@ A chat between a curious human and an artificial intelligence assistant. The ass
|
||||
|
||||
[[autodoc]] VipLlavaConfig
|
||||
|
||||
## VipLlavaModel
|
||||
|
||||
[[autodoc]] VipLlavaModel
|
||||
|
||||
## VipLlavaForConditionalGeneration
|
||||
|
||||
[[autodoc]] VipLlavaForConditionalGeneration
|
||||
|
||||
Reference in New Issue
Block a user