Add GOT-OCR 2.0 to Transformers (#34721)

* init modular got_ocr2

* Get correct got_ocr architecture

* add processing

* run modular with processing

* add working inference

* apply modular

* Refactor and fix style

* Refactor, cleanup, fix style

* fix init order

* Fix docs

* add base modeling tests

* fix style and consistency

* rename doc file

* fix repo consistency

* fix inference with box

* add image processing and support for crop_to_multi_page

* Fix batch inference

* add tests

* fixup

* fix slow test

* fix docstrings

* Add model doc

* update to new init

* fix input autocast pixel_values dtype

* update doc

* move doc to multimodal

* Reformat crop_image_to_patches and add docstrings

* Fix example in forward docstring

* Address Pablo review

* [run slow] got_ocr2

* remove defaults defined twice

* apply modular

* add torch_device to integration tests

* update modular

* follow-up Pavel review

* add device variable in doc

* fix doc multi-page

* Force eager attention for vision encoder to avoid attn implementation conflict

* revert qwen2vl doc changes

* use Qwen2ForCausalLM instead of Qwen2Model

* make fixup

* refactor gotocr2 to llava style

* uniformize function names and reduce checks

* final nits

* fix pixel_values dtype error

* change checkpoint names

* fix modular
This commit is contained in:
Yoni Gozlan
2025-01-31 11:28:13 -05:00
committed by GitHub
parent 5bbee12ac9
commit 2b46943195
26 changed files with 4184 additions and 3 deletions

View File

@@ -1650,7 +1650,7 @@ class GenerationTesterMixin:
# checks without adding test complexity. Ditto for `pixel_values_videos` and `pixel_values_images`
pixel_values_is_mutually_exclusive = any(
model_name in model_class.__name__.lower()
for model_name in ["llava", "idefics2", "idefics3", "mllama", "paligemma", "emu3"]
for model_name in ["llava", "idefics2", "idefics3", "mllama", "paligemma", "emu3", "gotocr2"]
)
if pixel_values_is_mutually_exclusive:
inputs_dict.pop("pixel_values", None)