Add SigLIP (#26522)

* Add first draft * Use appropriate gelu function * More improvements * More improvements * More improvements * Convert checkpoint * More improvements * Improve docs, remove print statements * More improvements * Add link * remove unused masking function * begin tokenizer * do_lower_case * debug * set split_special_tokens=True * Remove script * Fix style * Fix rebase * Use same design as CLIP * Add fast tokenizer * Add SiglipTokenizer to init, remove extra_ids * Improve conversion script * Use smaller inputs in conversion script * Update conversion script * More improvements * Add processor to conversion script * Add tests * Remove print statements * Add tokenizer tests * Fix more tests * More improvements related to weight initialization * More improvements * Make more tests pass * More improvements * More improvements * Add copied from * Add canonicalize_text * Enable fast tokenizer tests * More improvements * Fix most slow tokenizer tests * Address comments * Fix style * Remove script * Address some comments * Add copied from to tests * Add more copied from * Add more copied from * Add more copied from * Remove is_flax_available * More updates * Address comment * Remove SiglipTokenizerFast for now * Add caching * Remove umt5 test * Add canonicalize_text inside _tokenize, thanks Arthur * Fix image processor tests * Skip tests which are not applicable * Skip test_initialization * More improvements * Compare pixel values * Fix doc tests, add integration test * Add do_normalize * Remove causal mask and leverage ignore copy * Fix attention_mask * Fix remaining tests * Fix dummies * Rename temperature and bias * Address comments * Add copied from to tokenizer tests * Add SiglipVisionModel to auto mapping * Add copied from to image processor tests * Improve doc * Remove SiglipVisionModel from index * Address comments * Improve docs * Simplify config * Add first draft * Make it like mistral * More improvements * Fix attention_mask * Fix output_attentions * Add note in docs * Convert multilingual model * Convert large checkpoint * Convert more checkpoints * Add pipeline support, correct image_mean and image_std * Use padding=max_length by default * Make processor like llava * Add code snippet * Convert more checkpoints * Set keep_punctuation_string=None as in OpenCLIP * Set normalized=False for special tokens * Fix doc test * Update integration test * Add figure * Update organization * Happy new year * Use AutoModel everywhere --------- Co-authored-by: patil-suraj <surajp815@gmail.com>
2024-01-08 18:17:16 +01:00
parent 73c88012b7
commit 3b742ea84c
35 changed files with 4254 additions and 4 deletions
--- a/utils/check_copies.py
+++ b/utils/check_copies.py
@@ -1061,6 +1061,7 @@ MODELS_NOT_IN_README = [
    "Vision Encoder decoder",
    "VisionTextDualEncoder",
    "CLIPVisionModel",
+    "SiglipVisionModel",
 ]

 # Template for new entries to add in the main README when we have missing models.
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -214,7 +214,6 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
    "ChineseCLIPVisionModel",
    "CLIPTextModel",
    "CLIPTextModelWithProjection",
-    "CLIPVisionModel",
    "CLIPVisionModelWithProjection",
    "ClvpForCausalLM",
    "ClvpModel",
@@ -309,6 +308,8 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
    "SeamlessM4Tv2NARTextToUnitForConditionalGeneration",
    "SeamlessM4Tv2CodeHifiGan",
    "SeamlessM4Tv2ForSpeechToSpeech",  # no auto class for speech-to-speech
+    "SiglipVisionModel",
+    "SiglipTextModel",
 ]

 # DO NOT edit this list!
--- a/utils/check_table.py
+++ b/utils/check_table.py
@@ -171,7 +171,7 @@ MODEL_NAMES_WITH_SAME_CONFIG = {
    "XLS-R": "Wav2Vec2",
    "XLSR-Wav2Vec2": "Wav2Vec2",
 }
-MODEL_NAMES_TO_IGNORE = ["CLIPVisionModel"]
+MODEL_NAMES_TO_IGNORE = ["CLIPVisionModel", "SiglipVisionModel"]


 def get_model_table_from_auto_modules() -> str: