new model: IDEFICS via HuggingFaceM4 (#24796)

* rename

* restore

* mappings

* unedited tests+docs

* docs

* fixes

* fix auto-sync breakage

* cleanup

* wip

* wip

* add fetch_images

* remove einops dependency

* update

* fix

* fix

* fix

* fix

* fix

* re-add

* add batching

* rework

* fix

* improve

* add Leo as I am extending his work

* cleanup

* fix

* cleanup

* slow-test

* fix

* fix

* fixes

* deal with warning

* rename modified llama classes

* rework fetch_images

* alternative implementation

* cleanup

* strict version

* cleanup

* [`IDEFICS`] Fix idefics ci (#25056)

* Fix IDEFICS CI

* fix test file

* fixup

* some changes to make tests pass

* fix

* fixup

* Update src/transformers/models/idefics/configuration_idefics.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

---------

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* remove compat checks

* style

* explain that Idefics is not for training from scratch

* require pt>=2.0

* fix idefics vision config (#25092)

* fix idefics vision config

* fixup

* clean

* Update src/transformers/models/idefics/configuration_idefics.py

---------

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* cleanup

* style

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* upcase

* sequence of images

* handle the case with no images

* Update src/transformers/image_processing_utils.py

Co-authored-by: Victor SANH <victorsanh@gmail.com>

* support pure lm take 2

* support tokenizer options

* parameterize num_channels

* fix upcase

* s|IdeficsForCausalLM|IdeficsForVisionText2Text|g

* manual to one line

* addressing review

* unbreak

* remove clip dependency

* fix test

* consistency

* PIL import

* Idefics prefix

* Idefics prefix

* hack to make tests work

* style

* fix

* fix

* revert

* try/finally

* cleanup

* clean up

* move

* [`IDEFICS`] Fix idefics config refactor (#25149)

* refactor config

* nuke init weights

* more refactor

* oops

* remove visual question answering pipeline support

* Update src/transformers/models/idefics/clip.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update src/transformers/models/idefics/modeling_idefics.py

* cleanup

* mv clip.py vision.py

* tidyup

---------

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas@stason.org>

* fix

* license

* condition on pt

* fix

* style

* fix

* rm torchvision dependency, allow custom transforms

* address review

* rework device arg

* add_eos_token

* s/transforms/transform/

* fix top level imports

* fix return value

* cleanup

* cleanup

* fix

* style

* license

* license

* Update src/transformers/models/idefics/image_processing_idefics.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add a wrapper to freeze vision layears

* tidyup

* use the correct std/mean settings

* parameterize values from config

* add tests/models/idefics/test_image_processing_idefics.py

* add test_processor_idefics.py

* cleanup

* cleanups

* fix

* fix

* move to the right group

* style

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add perceiver config

* reset

* missing arg docs

* Apply suggestions from code review

Co-authored-by: Leo Tronchon <leo.tronchon@gmail.com>

* address review comments

* inject automatic end of utterance tokens (#25218)

* inject automatic end of utterance tokens

* fix

* fix

* fix

* rework to not use the config

* not end_of_utterance_token at the end

* Update src/transformers/models/idefics/processing_idefics.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* address review

* Apply suggestions from code review

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* Update src/transformers/image_processing_utils.py

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

* [`Idefics`] add image_embeddings option in generate-related methods (#25442)

* add image_embeddings option in generate-related methods

* style

* rename image_embeddings and allow perceiver embeddings precomputation

* compute embeddings within generate

* make is_encoder_decoder= True the default in config

* nested if else fix

* better triple check

* switch if elif order for pixel values / img embeds

* update model_kwargs perceiver only at the end

* use _prepare_model_inputs instead of encoder_decoder logic

* fix comment typo

* fix config default for is_encoder_decoder

* style

* add typehints

* precompute in forward

* doc builder

* style

* pop instead of get image hidden states

* Trigger CI

* Update src/transformers/models/idefics/modeling_idefics.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/idefics/modeling_idefics.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix * + indentation + style

* simplify a bit the use_resampler logic using comments

* update diocstrings

* Trigger CI

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix rebase changes

* unbreak #25237 - to be fixed in follow up PRs

* is_composition = False

* no longer needed

---------

Co-authored-by: leot13 <leo.tronchon@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Victor SANH <victorsanh@gmail.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
This commit is contained in:
Stas Bekman
2023-08-18 14:12:28 -07:00
committed by GitHub
parent 4d64157ed3
commit 6c811a322f
33 changed files with 4195 additions and 2 deletions

View File

@@ -351,6 +351,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer.
1. **[IDEFICS](https://huggingface.co/docs/transformers/main/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh.
1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever.
1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.