Add AltCLIP (#20446)
* add altclip * update * fix wrong title * fix the copyright in readme * add altclip model * add altclip * fix test_gradient_checkpointing_enable_disable * code * add return class * add projection_state * "fix pretrained model bug" * delete print and fix 2 test instances. * delete token * rm xlmr * one model one file. * empty commit to trigger CI * Fix modeling_outputs.py * Fix __init__ * Fix quality * Fix modeling file docstring * Fix README.md * Fix test file * add vision model * empty commit to trigger CI * fix * fix * fix * fix * fix * fix * fix * fix * fix * del token in mdx file * fix * fix * fix * remove altrob from test list * add vision test * fix fx * fix * fix * fix * trigger CI * fix copies * fix tests * fix style * fix quality * update * recover import * recover * add , * recover * fix copies * trigger CI * fix * some of review * update * remove import * last 2 * fix * fix style * fix style * fix bug * fix uncomment * fix * update * fix * second review * empty commit to trigger CI * empty commit to trigger CI * fix position * fix * empty commit to trigger CI * empty commit to trigger CI * third comment * Update docs/source/en/model_doc/altclip.mdx Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * Update docs/source/en/model_doc/altclip.mdx Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * Update src/transformers/__init__.py Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * Update src/transformers/models/altclip/configuration_altclip.py Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * Update src/transformers/models/altclip/modeling_altclip.py Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * Update src/transformers/models/altclip/processing_altclip.py Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * Update src/transformers/models/altclip/modeling_altclip.py Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> * fix merge * fix copies * update * update * empty commit to trigger CI * fix code example * empty commit to trigger CI * fix * empty commit to trigger CI * empty commit to trigger CI Co-authored-by: shunxing1234 <xw747777271@gmail.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: shunxing1234 <33774367+shunxing1234@users.noreply.github.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
This commit is contained in:
@@ -263,6 +263,7 @@ Current number of checkpoints:  for a high-level summary of each them):
|
||||
|
||||
1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/main/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
|
||||
@@ -263,6 +263,7 @@ Número actual de puntos de control:  para un resumen de alto nivel de cada uno de ellas.):
|
||||
|
||||
1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/main/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
|
||||
@@ -236,6 +236,7 @@ conda install -c huggingface transformers
|
||||
🤗 ट्रांसफॉर्मर वर्तमान में निम्नलिखित आर्किटेक्चर का समर्थन करते हैं (मॉडल के अवलोकन के लिए [यहां] देखें (https://huggingface.co/docs/transformers/model_summary)):
|
||||
|
||||
1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (Google Research and the Toyota Technological Institute at Chicago) साथ थीसिस [ALBERT: A Lite BERT for Self-supervised भाषा प्रतिनिधित्व सीखना](https://arxiv.org/abs/1909.11942), झेंझोंग लैन, मिंगदा चेन, सेबेस्टियन गुडमैन, केविन गिम्पेल, पीयूष शर्मा, राडू सोरिकट
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/main/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ] (https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (से École polytechnique) साथ थीसिस [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) पर निर्भर Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis रिहाई।
|
||||
|
||||
@@ -298,6 +298,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
||||
🤗Transformersは現在、以下のアーキテクチャを提供しています(それぞれのハイレベルな要約は[こちら](https://huggingface.co/docs/transformers/model_summary)を参照してください):
|
||||
|
||||
1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (Google Research and the Toyota Technological Institute at Chicago から) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut から公開された研究論文: [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/main/model_doc/altclip)** (BAAI から) Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell から公開された研究論文: [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679)
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (MIT から) Yuan Gong, Yu-An Chung, James Glass から公開された研究論文: [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778)
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (Facebook から) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer から公開された研究論文: [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (École polytechnique から) Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis から公開された研究論文: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321)
|
||||
|
||||
@@ -213,6 +213,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||
🤗 Transformers는 다음 모델들을 제공합니다 (각 모델의 요약은 [여기](https://huggingface.co/docs/transformers/model_summary)서 확인하세요):
|
||||
|
||||
1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/main/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
|
||||
@@ -237,6 +237,7 @@ conda install -c huggingface transformers
|
||||
🤗 Transformers 目前支持如下的架构(模型概述请阅[这里](https://huggingface.co/docs/transformers/model_summary)):
|
||||
|
||||
1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (来自 Google Research and the Toyota Technological Institute at Chicago) 伴随论文 [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), 由 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut 发布。
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/main/model_doc/altclip)** (来自 BAAI) 伴随论文 [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 由 Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell 发布。
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (来自 MIT) 伴随论文 [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 由 Yuan Gong, Yu-An Chung, James Glass 发布。
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (来自 Facebook) 伴随论文 [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) 由 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer 发布。
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (来自 École polytechnique) 伴随论文 [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 由 Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis 发布。
|
||||
|
||||
@@ -249,6 +249,7 @@ conda install -c huggingface transformers
|
||||
🤗 Transformers 目前支援以下的架構(模型概覽請參閱[這裡](https://huggingface.co/docs/transformers/model_summary)):
|
||||
|
||||
1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/main/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
|
||||
2
docs/source/en/_toctree.yml
Normal file → Executable file
2
docs/source/en/_toctree.yml
Normal file → Executable file
@@ -498,6 +498,8 @@
|
||||
title: Audio models
|
||||
- isExpanded: false
|
||||
sections:
|
||||
- local: model_doc/altclip
|
||||
title: AltCLIP
|
||||
- local: model_doc/blip
|
||||
title: BLIP
|
||||
- local: model_doc/chinese_clip
|
||||
|
||||
@@ -50,6 +50,7 @@ The documentation is organized into five sections:
|
||||
<!--This list is updated automatically from the README with _make fix-copies_. Do not update manually! -->
|
||||
|
||||
1. **[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
|
||||
1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
@@ -230,6 +231,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
|
||||
|:-----------------------------:|:--------------:|:--------------:|:---------------:|:------------------:|:------------:|
|
||||
| ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| AltCLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Audio Spectrogram Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| BART | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| BEiT | ❌ | ❌ | ✅ | ❌ | ✅ |
|
||||
|
||||
107
docs/source/en/model_doc/altclip.mdx
Normal file
107
docs/source/en/model_doc/altclip.mdx
Normal file
@@ -0,0 +1,107 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# AltCLIP
|
||||
|
||||
## Overview
|
||||
|
||||
The AltCLIP model was proposed in [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679v2) by Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, Ledell Wu. AltCLIP
|
||||
(Altering the Language Encoder in CLIP) is a neural network trained on a variety of image-text and text-text pairs. By switching CLIP's
|
||||
text encoder with a pretrained multilingual text encoder XLM-R, we could obtain very close performances with CLIP on almost all tasks, and extended original CLIP's capabilities such as multilingual understanding.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model.
|
||||
Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained
|
||||
multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of
|
||||
teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art
|
||||
performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with
|
||||
CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.*
|
||||
|
||||
## Usage
|
||||
|
||||
The usage of AltCLIP is very similar to the CLIP. the difference between CLIP is the text encoder. Note that we use bidirectional attention instead of casual attention
|
||||
and we take the [CLS] token in XLM-R to represent text embedding.
|
||||
|
||||
AltCLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image
|
||||
classification. AltCLIP uses a ViT like transformer to get visual features and a bidirectional language model to get the text
|
||||
features. Both the text and visual features are then projected to a latent space with identical dimension. The dot
|
||||
product between the projected image and text features is then used as a similar score.
|
||||
|
||||
To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches,
|
||||
which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors
|
||||
also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
|
||||
The [`CLIPImageProcessor`] can be used to resize (or rescale) and normalize images for the model.
|
||||
|
||||
The [`AltCLIPProcessor`] wraps a [`CLIPImageProcessor`] and a [`XLMRobertaTokenizer`] into a single instance to both
|
||||
encode the text and prepare the images. The following example shows how to get the image-text similarity scores using
|
||||
[`AltCLIPProcessor`] and [`AltCLIPModel`].
|
||||
|
||||
|
||||
```python
|
||||
>>> from PIL import Image
|
||||
>>> import requests
|
||||
|
||||
>>> from transformers import AltCLIPModel, AltCLIPProcessor
|
||||
|
||||
>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
|
||||
>>> processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP")
|
||||
|
||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||
|
||||
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
|
||||
|
||||
>>> outputs = model(**inputs)
|
||||
>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
||||
>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
||||
```
|
||||
|
||||
Tips:
|
||||
|
||||
This model is build on `CLIPModel`, so use it like a original CLIP.
|
||||
|
||||
This model was contributed by [jongjyh](https://huggingface.co/jongjyh).
|
||||
|
||||
## AltCLIPConfig
|
||||
|
||||
[[autodoc]] AltCLIPConfig
|
||||
- from_text_vision_configs
|
||||
|
||||
## AltCLIPTextConfig
|
||||
|
||||
[[autodoc]] AltCLIPTextConfig
|
||||
|
||||
## AltCLIPVisionConfig
|
||||
|
||||
[[autodoc]] AltCLIPVisionConfig
|
||||
|
||||
## AltCLIPProcessor
|
||||
|
||||
[[autodoc]] AltCLIPProcessor
|
||||
|
||||
## AltCLIPModel
|
||||
|
||||
[[autodoc]] AltCLIPModel
|
||||
- forward
|
||||
- get_text_features
|
||||
- get_image_features
|
||||
|
||||
## AltCLIPTextModel
|
||||
|
||||
[[autodoc]] AltCLIPTextModel
|
||||
- forward
|
||||
|
||||
## AltCLIPVisionModel
|
||||
|
||||
[[autodoc]] AltCLIPVisionModel
|
||||
- forward
|
||||
31
src/transformers/__init__.py
Normal file → Executable file
31
src/transformers/__init__.py
Normal file → Executable file
@@ -124,6 +124,13 @@ _import_structure = {
|
||||
"models": [],
|
||||
# Models
|
||||
"models.albert": ["ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "AlbertConfig"],
|
||||
"models.altclip": [
|
||||
"ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"AltCLIPConfig",
|
||||
"AltCLIPProcessor",
|
||||
"AltCLIPTextConfig",
|
||||
"AltCLIPVisionConfig",
|
||||
],
|
||||
"models.audio_spectrogram_transformer": [
|
||||
"AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"ASTConfig",
|
||||
@@ -921,7 +928,15 @@ else:
|
||||
"load_tf_weights_in_albert",
|
||||
]
|
||||
)
|
||||
|
||||
_import_structure["models.altclip"].extend(
|
||||
[
|
||||
"ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"AltCLIPModel",
|
||||
"AltCLIPPreTrainedModel",
|
||||
"AltCLIPTextModel",
|
||||
"AltCLIPVisionModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.audio_spectrogram_transformer"].extend(
|
||||
[
|
||||
"AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
@@ -3482,6 +3497,13 @@ if TYPE_CHECKING:
|
||||
load_tf2_weights_in_pytorch_model,
|
||||
)
|
||||
from .models.albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig
|
||||
from .models.altclip import (
|
||||
ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
AltCLIPConfig,
|
||||
AltCLIPProcessor,
|
||||
AltCLIPTextConfig,
|
||||
AltCLIPVisionConfig,
|
||||
)
|
||||
from .models.audio_spectrogram_transformer import (
|
||||
AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
ASTConfig,
|
||||
@@ -4181,6 +4203,13 @@ if TYPE_CHECKING:
|
||||
AlbertPreTrainedModel,
|
||||
load_tf_weights_in_albert,
|
||||
)
|
||||
from .models.altclip import (
|
||||
ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
AltCLIPModel,
|
||||
AltCLIPPreTrainedModel,
|
||||
AltCLIPTextModel,
|
||||
AltCLIPVisionModel,
|
||||
)
|
||||
from .models.audio_spectrogram_transformer import (
|
||||
AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
ASTForAudioClassification,
|
||||
|
||||
37
src/transformers/modeling_outputs.py
Normal file → Executable file
37
src/transformers/modeling_outputs.py
Normal file → Executable file
@@ -1318,3 +1318,40 @@ class BackboneOutput(ModelOutput):
|
||||
feature_maps: Tuple[torch.FloatTensor] = None
|
||||
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
|
||||
attentions: Optional[Tuple[torch.FloatTensor]] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class BaseModelOutputWithPoolingAndProjection(ModelOutput):
|
||||
"""
|
||||
Base class for model's outputs that also contains a pooling of the last hidden states.
|
||||
|
||||
Args:
|
||||
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
|
||||
Sequence of hidden-states at the output of the last layer of the model.
|
||||
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
|
||||
Last layer hidden-state of the first token of the sequence (classification token) after further processing
|
||||
through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
|
||||
the classification token after processing through a linear layer and a tanh activation function. The linear
|
||||
layer weights are trained from the next sentence prediction (classification) objective during pretraining.
|
||||
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
|
||||
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
|
||||
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
|
||||
|
||||
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
|
||||
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
||||
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
|
||||
sequence_length)`.
|
||||
|
||||
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
|
||||
heads.
|
||||
projection_state (`tuple(torch.FloatTensor)`, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
|
||||
Tuple of `torch.FloatTensor` of shape `(batch_size,config.project_dim)`.
|
||||
|
||||
Text embeddings before the projection layer, used to mimic the last hidden state of the teacher encoder.
|
||||
"""
|
||||
|
||||
last_hidden_state: torch.FloatTensor = None
|
||||
pooler_output: torch.FloatTensor = None
|
||||
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
|
||||
attentions: Optional[Tuple[torch.FloatTensor]] = None
|
||||
projection_state: Optional[Tuple[torch.FloatTensor]] = None
|
||||
|
||||
@@ -18,6 +18,7 @@
|
||||
|
||||
from . import (
|
||||
albert,
|
||||
altclip,
|
||||
audio_spectrogram_transformer,
|
||||
auto,
|
||||
bart,
|
||||
|
||||
76
src/transformers/models/altclip/__init__.py
Executable file
76
src/transformers/models/altclip/__init__.py
Executable file
@@ -0,0 +1,76 @@
|
||||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2020 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
# rely on isort to merge the imports
|
||||
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_altclip": [
|
||||
"ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"AltCLIPConfig",
|
||||
"AltCLIPTextConfig",
|
||||
"AltCLIPVisionConfig",
|
||||
],
|
||||
"processing_altclip": ["AltCLIPProcessor"],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_altclip"] = [
|
||||
"ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"AltCLIPPreTrainedModel",
|
||||
"AltCLIPModel",
|
||||
"AltCLIPTextModel",
|
||||
"AltCLIPVisionModel",
|
||||
]
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_altclip import (
|
||||
ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
AltCLIPConfig,
|
||||
AltCLIPTextConfig,
|
||||
AltCLIPVisionConfig,
|
||||
)
|
||||
from .processing_altclip import AltCLIPProcessor
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_altclip import (
|
||||
ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
AltCLIPModel,
|
||||
AltCLIPPreTrainedModel,
|
||||
AltCLIPTextModel,
|
||||
AltCLIPVisionModel,
|
||||
)
|
||||
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||
357
src/transformers/models/altclip/configuration_altclip.py
Executable file
357
src/transformers/models/altclip/configuration_altclip.py
Executable file
@@ -0,0 +1,357 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 WenXiang ZhongzhiCheng LedellWu LiuGuang BoWenZhang and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" AltCLIP model configuration"""
|
||||
import copy
|
||||
import os
|
||||
from typing import Union
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"BAAI/AltCLIP": "https://huggingface.co/BAAI/AltCLIP/resolve/main/config.json",
|
||||
# See all AltCLIP models at https://huggingface.co/models?filter=altclip
|
||||
}
|
||||
|
||||
|
||||
class AltCLIPTextConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`AltCLIPTextModel`]. It is used to instantiate a
|
||||
AltCLIP text model according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the AltCLIP
|
||||
[BAAI/AltCLIP](https://huggingface.co/BAAI/AltCLIP) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 250002):
|
||||
Vocabulary size of the AltCLIP model. Defines the number of different tokens that can be represented by the
|
||||
`inputs_ids` passed when calling [`AltCLIPTextModel`].
|
||||
hidden_size (`int`, *optional*, defaults to 1024):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 24):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 16):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
intermediate_size (`int`, *optional*, defaults to 4096):
|
||||
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
|
||||
hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout ratio for the attention probabilities.
|
||||
max_position_embeddings (`int`, *optional*, defaults to 514):
|
||||
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
type_vocab_size (`int`, *optional*, defaults to 2):
|
||||
The vocabulary size of the `token_type_ids` passed when calling [`AltCLIPTextModel`]
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
|
||||
The epsilon used by the layer normalization layers.
|
||||
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
|
||||
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
|
||||
positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
|
||||
[Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
|
||||
For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
|
||||
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||
relevant if `config.is_decoder=True`.
|
||||
classifier_dropout (`float`, *optional*):
|
||||
The dropout ratio for the classification head.
|
||||
project_dim (`int`, *optional*, defaults to 768):
|
||||
The dimentions of the teacher model before the mapping layer.
|
||||
pooler_fn (`str`, *optional*, defaults to `"cls"`):
|
||||
Type of pooler we use. We take the first token as pooled output.
|
||||
|
||||
Examples:
|
||||
|
||||
```python
|
||||
>>> from transformers import AltCLIPTextModel, AltCLIPTextConfig
|
||||
|
||||
>>> # Initializing a AltCLIPTextConfig with BAAI/AltCLIP style configuration
|
||||
>>> configuration = AltCLIPTextConfig()
|
||||
|
||||
>>> # Initializing a AltCLIPTextModel (with random weights) from the BAAI/AltCLIP style configuration
|
||||
>>> model = AltCLIPTextModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
model_type = "altclip_text_model"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=250002,
|
||||
hidden_size=1024,
|
||||
num_hidden_layers=24,
|
||||
num_attention_heads=16,
|
||||
intermediate_size=4096,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=514,
|
||||
type_vocab_size=1,
|
||||
initializer_range=0.02,
|
||||
initializer_factor=0.02,
|
||||
layer_norm_eps=1e-05,
|
||||
pad_token_id=1,
|
||||
bos_token_id=0,
|
||||
eos_token_id=2,
|
||||
position_embedding_type="absolute",
|
||||
use_cache=True,
|
||||
classifier_dropout=None,
|
||||
project_dim=768,
|
||||
pooler_fn="cls",
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
|
||||
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_act = hidden_act
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.initializer_range = initializer_range
|
||||
self.initializer_factor = initializer_factor
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.position_embedding_type = position_embedding_type
|
||||
self.use_cache = use_cache
|
||||
self.classifier_dropout = classifier_dropout
|
||||
self.project_dim = project_dim
|
||||
self.pooler_fn = pooler_fn
|
||||
|
||||
|
||||
class AltCLIPVisionConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`AltCLIPModel`]. It is used to instantiate an
|
||||
AltCLIP model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the AltCLIP
|
||||
[BAAI/AltCLIP](https://huggingface.co/BAAI/AltCLIP) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
hidden_size (`int`, *optional*, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
image_size (`int`, *optional*, defaults to 224):
|
||||
The size (resolution) of each image.
|
||||
patch_size (`int`, *optional*, defaults to 32):
|
||||
The size (resolution) of each patch.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
|
||||
defaults to 1e-5): The epsilon used by the layer normalization layers.
|
||||
dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
initializer_factor (`float``, *optional*, defaults to 1):
|
||||
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
|
||||
testing).
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import AltCLIPVisionConfig, AltCLIPVisionModel
|
||||
|
||||
>>> # Initializing a AltCLIPVisionConfig with BAAI/AltCLIP style configuration
|
||||
>>> configuration = AltCLIPVisionConfig()
|
||||
|
||||
>>> # Initializing a AltCLIPVisionModel (with random weights) from the BAAI/AltCLIP style configuration
|
||||
>>> model = AltCLIPVisionModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
|
||||
model_type = "altclip_vision_model"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
hidden_size=768,
|
||||
intermediate_size=3072,
|
||||
projection_dim=512,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=12,
|
||||
num_channels=3,
|
||||
image_size=224,
|
||||
patch_size=32,
|
||||
hidden_act="quick_gelu",
|
||||
layer_norm_eps=0.00001,
|
||||
dropout=0.0,
|
||||
attention_dropout=0.0,
|
||||
initializer_range=0.02,
|
||||
initializer_factor=1.0,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self.hidden_size = hidden_size
|
||||
self.intermediate_size = intermediate_size
|
||||
self.projection_dim = projection_dim
|
||||
self.dropout = dropout
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.num_channels = num_channels
|
||||
self.patch_size = patch_size
|
||||
self.image_size = image_size
|
||||
self.initializer_range = initializer_range
|
||||
self.initializer_factor = initializer_factor
|
||||
self.attention_dropout = attention_dropout
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.hidden_act = hidden_act
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||
|
||||
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||
|
||||
# get the vision config dict if we are loading from AltCLIPConfig
|
||||
if config_dict.get("model_type") == "altclip":
|
||||
config_dict = config_dict["vision_config"]
|
||||
|
||||
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||
logger.warning(
|
||||
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||
)
|
||||
|
||||
return cls.from_dict(config_dict, **kwargs)
|
||||
|
||||
|
||||
class AltCLIPConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`AltCLIPModel`]. It is used to instantiate an
|
||||
AltCLIP model according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the AltCLIP
|
||||
[BAAI/AltCLIP](https://huggingface.co/BAAI/AltCLIP) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
text_config (`dict`, *optional*):
|
||||
Dictionary of configuration options used to initialize [`AltCLIPTextConfig`].
|
||||
vision_config (`dict`, *optional*):
|
||||
Dictionary of configuration options used to initialize [`AltCLIPVisionConfig`].
|
||||
projection_dim (`int`, *optional*, defaults to 512):
|
||||
Dimentionality of text and vision projection layers.
|
||||
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
|
||||
The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
|
||||
kwargs (*optional*):
|
||||
Dictionary of keyword arguments.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import AltCLIPConfig, AltCLIPModel
|
||||
|
||||
>>> # Initializing a AltCLIPConfig with BAAI/AltCLIP style configuration
|
||||
>>> configuration = AltCLIPConfig()
|
||||
|
||||
>>> # Initializing a AltCLIPModel (with random weights) from the BAAI/AltCLIP style configuration
|
||||
>>> model = AltCLIPModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
|
||||
>>> # We can also initialize a AltCLIPConfig from a AltCLIPTextConfig and a AltCLIPVisionConfig
|
||||
|
||||
>>> # Initializing a AltCLIPText and AltCLIPVision configuration
|
||||
>>> config_text = AltCLIPTextConfig()
|
||||
>>> config_vision = AltCLIPVisionConfig()
|
||||
|
||||
>>> config = AltCLIPConfig.from_text_vision_configs(config_text, config_vision)
|
||||
```"""
|
||||
|
||||
model_type = "altclip"
|
||||
is_composition = True
|
||||
|
||||
def __init__(
|
||||
self, text_config=None, vision_config=None, projection_dim=768, logit_scale_init_value=2.6592, **kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
# If `_config_dict` exist, we use them for the backward compatibility.
|
||||
text_config_dict = kwargs.pop("text_config_dict", None)
|
||||
vision_config_dict = kwargs.pop("vision_config_dict", None)
|
||||
if text_config_dict is not None:
|
||||
text_config = text_config_dict
|
||||
if vision_config_dict is not None:
|
||||
vision_config = vision_config_dict
|
||||
|
||||
if text_config is None:
|
||||
text_config = {}
|
||||
logger.info("text_config is None. Initializing the AltCLIPTextConfig with default values.")
|
||||
|
||||
if vision_config is None:
|
||||
vision_config = {}
|
||||
logger.info("vision_config is None. initializing the AltCLIPVisionConfig with default values.")
|
||||
|
||||
self.text_config = AltCLIPTextConfig(**text_config)
|
||||
self.vision_config = AltCLIPVisionConfig(**vision_config)
|
||||
|
||||
self.projection_dim = projection_dim
|
||||
self.logit_scale_init_value = logit_scale_init_value
|
||||
self.initializer_factor = 1.0
|
||||
|
||||
@classmethod
|
||||
def from_text_vision_configs(cls, text_config: AltCLIPTextConfig, vision_config: AltCLIPVisionConfig, **kwargs):
|
||||
r"""
|
||||
Instantiate a [`AltCLIPConfig`] (or a derived class) from altclip text model configuration and altclip vision
|
||||
model configuration.
|
||||
|
||||
Returns:
|
||||
[`AltCLIPConfig`]: An instance of a configuration object
|
||||
"""
|
||||
|
||||
return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)
|
||||
|
||||
def to_dict(self):
|
||||
"""
|
||||
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
|
||||
|
||||
Returns:
|
||||
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
output["text_config"] = self.text_config.to_dict()
|
||||
output["vision_config"] = self.vision_config.to_dict()
|
||||
output["model_type"] = self.__class__.model_type
|
||||
return output
|
||||
1710
src/transformers/models/altclip/modeling_altclip.py
Executable file
1710
src/transformers/models/altclip/modeling_altclip.py
Executable file
File diff suppressed because it is too large
Load Diff
130
src/transformers/models/altclip/processing_altclip.py
Normal file
130
src/transformers/models/altclip/processing_altclip.py
Normal file
@@ -0,0 +1,130 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 WenXiang ZhongzhiCheng LedellWu LiuGuang BoWenZhang The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Image/Text processor class for AltCLIP
|
||||
"""
|
||||
import warnings
|
||||
|
||||
from ...processing_utils import ProcessorMixin
|
||||
from ...tokenization_utils_base import BatchEncoding
|
||||
|
||||
|
||||
class AltCLIPProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs a AltCLIP processor which wraps a CLIP image processor and a XLM-Roberta tokenizer into a single
|
||||
processor.
|
||||
|
||||
[`AltCLIPProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`XLMRobertaTokenizerFast`]. See
|
||||
the [`~AltCLIPProcessor.__call__`] and [`~AltCLIPProcessor.decode`] for more information.
|
||||
|
||||
Args:
|
||||
image_processor ([`CLIPImageProcessor`]):
|
||||
The image processor is a required input.
|
||||
tokenizer ([`XLMRobertaTokenizerFast`]):
|
||||
The tokenizer is a required input.
|
||||
"""
|
||||
attributes = ["image_processor", "tokenizer"]
|
||||
image_processor_class = "CLIPImageProcessor"
|
||||
tokenizer_class = ("XLMRobertaTokenizer", "XLMRobertaTokenizerFast")
|
||||
|
||||
def __init__(self, image_processor=None, tokenizer=None, **kwargs):
|
||||
if "feature_extractor" in kwargs:
|
||||
warnings.warn(
|
||||
"The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
|
||||
" instead.",
|
||||
FutureWarning,
|
||||
)
|
||||
feature_extractor = kwargs.pop("feature_extractor")
|
||||
|
||||
image_processor = image_processor if image_processor is not None else feature_extractor
|
||||
if image_processor is None:
|
||||
raise ValueError("You need to specify an `image_processor`.")
|
||||
if tokenizer is None:
|
||||
raise ValueError("You need to specify a `tokenizer`.")
|
||||
|
||||
super().__init__(image_processor, tokenizer)
|
||||
|
||||
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
|
||||
"""
|
||||
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
|
||||
and `kwargs` arguments to XLMRobertaTokenizerFast's [`~XLMRobertaTokenizerFast.__call__`] if `text` is not
|
||||
`None` to encode the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
|
||||
CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
|
||||
of the above two methods for more information.
|
||||
|
||||
Args:
|
||||
text (`str`, `List[str]`, `List[List[str]]`):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||
number of channels, H and W are image height and width.
|
||||
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors of a particular framework. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
||||
|
||||
Returns:
|
||||
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
|
||||
|
||||
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
||||
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
||||
`None`).
|
||||
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
|
||||
"""
|
||||
|
||||
if text is None and images is None:
|
||||
raise ValueError("You have to specify either text or images. Both cannot be none.")
|
||||
|
||||
if text is not None:
|
||||
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
|
||||
|
||||
if images is not None:
|
||||
image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
|
||||
|
||||
if text is not None and images is not None:
|
||||
encoding["pixel_values"] = image_features.pixel_values
|
||||
return encoding
|
||||
elif text is not None:
|
||||
return encoding
|
||||
else:
|
||||
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
|
||||
|
||||
def batch_decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to XLMRobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
|
||||
Please refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||
|
||||
def decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to XLMRobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please
|
||||
refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.decode(*args, **kwargs)
|
||||
|
||||
@property
|
||||
def model_input_names(self):
|
||||
tokenizer_input_names = self.tokenizer.model_input_names
|
||||
image_processor_input_names = self.image_processor.model_input_names
|
||||
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
|
||||
3
src/transformers/models/auto/configuration_auto.py
Normal file → Executable file
3
src/transformers/models/auto/configuration_auto.py
Normal file → Executable file
@@ -30,6 +30,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Add configs here
|
||||
("albert", "AlbertConfig"),
|
||||
("altclip", "AltCLIPConfig"),
|
||||
("audio-spectrogram-transformer", "ASTConfig"),
|
||||
("bart", "BartConfig"),
|
||||
("beit", "BeitConfig"),
|
||||
@@ -191,6 +192,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Add archive maps here)
|
||||
("albert", "ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("altclip", "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("audio-spectrogram-transformer", "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("bart", "BART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
@@ -335,6 +337,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
[
|
||||
# Add full (and cased) model names here
|
||||
("albert", "ALBERT"),
|
||||
("altclip", "AltCLIP"),
|
||||
("audio-spectrogram-transformer", "Audio Spectrogram Transformer"),
|
||||
("bart", "BART"),
|
||||
("barthez", "BARThez"),
|
||||
|
||||
2
src/transformers/models/auto/modeling_auto.py
Normal file → Executable file
2
src/transformers/models/auto/modeling_auto.py
Normal file → Executable file
@@ -29,6 +29,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Base model mapping
|
||||
("albert", "AlbertModel"),
|
||||
("altclip", "AltCLIPModel"),
|
||||
("audio-spectrogram-transformer", "ASTModel"),
|
||||
("bart", "BartModel"),
|
||||
("beit", "BeitModel"),
|
||||
@@ -878,6 +879,7 @@ MODEL_FOR_AUDIO_XVECTOR_MAPPING_NAMES = OrderedDict(
|
||||
_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
# Model for Zero Shot Image Classification mapping
|
||||
("altclip", "AltCLIPModel"),
|
||||
("blip", "BlipModel"),
|
||||
("chinese_clip", "ChineseCLIPModel"),
|
||||
("clip", "CLIPModel"),
|
||||
|
||||
@@ -41,6 +41,7 @@ logger = logging.get_logger(__name__)
|
||||
|
||||
PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
("altclip", "AltCLIPProcessor"),
|
||||
("blip", "BLIPProcessor"),
|
||||
("chinese_clip", "ChineseCLIPProcessor"),
|
||||
("clip", "CLIPProcessor"),
|
||||
|
||||
@@ -357,6 +357,37 @@ def load_tf_weights_in_albert(*args, **kwargs):
|
||||
requires_backends(load_tf_weights_in_albert, ["torch"])
|
||||
|
||||
|
||||
ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class AltCLIPModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class AltCLIPPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class AltCLIPTextModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class AltCLIPVisionModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
||||
3
src/transformers/utils/fx.py
Normal file → Executable file
3
src/transformers/utils/fx.py
Normal file → Executable file
@@ -101,6 +101,7 @@ def _generate_supported_model_class_names(
|
||||
|
||||
|
||||
_REGULAR_SUPPORTED_MODEL_NAMES_AND_TASKS = [
|
||||
"altclip",
|
||||
"albert",
|
||||
"bart",
|
||||
"bert",
|
||||
@@ -156,6 +157,8 @@ _SPECIAL_SUPPORTED_MODELS = [
|
||||
"CLIPTextModelWithProjection",
|
||||
"CLIPVisionModel",
|
||||
"CLIPVisionModelWithProjection",
|
||||
"AltCLIPTextModel",
|
||||
"AltCLIPVisionModel",
|
||||
"GitVisionModel",
|
||||
"GPT2DoubleHeadsModel",
|
||||
"Speech2Text2Decoder",
|
||||
|
||||
0
tests/models/altclip/__init__.py
Executable file
0
tests/models/altclip/__init__.py
Executable file
539
tests/models/altclip/test_modeling_altclip.py
Executable file
539
tests/models/altclip/test_modeling_altclip.py
Executable file
@@ -0,0 +1,539 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch AltCLIP model. """
|
||||
|
||||
|
||||
import inspect
|
||||
import os
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
import requests
|
||||
from transformers import AltCLIPConfig, AltCLIPProcessor, AltCLIPTextConfig, AltCLIPVisionConfig
|
||||
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||
from transformers.utils import is_torch_available, is_vision_available
|
||||
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import (
|
||||
ModelTesterMixin,
|
||||
_config_zero_init,
|
||||
floats_tensor,
|
||||
ids_tensor,
|
||||
random_attention_mask,
|
||||
)
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
from transformers import AltCLIPModel, AltCLIPTextModel, AltCLIPVisionModel
|
||||
from transformers.models.altclip.modeling_altclip import ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
|
||||
class AltCLIPVisionModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=12,
|
||||
image_size=30,
|
||||
patch_size=2,
|
||||
num_channels=3,
|
||||
is_training=True,
|
||||
hidden_size=32,
|
||||
projection_dim=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
initializer_range=0.02,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.is_training = is_training
|
||||
self.hidden_size = hidden_size
|
||||
self.projection_dim = projection_dim
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.initializer_range = initializer_range
|
||||
self.scope = scope
|
||||
|
||||
# in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
|
||||
num_patches = (image_size // patch_size) ** 2
|
||||
self.seq_length = num_patches + 1
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
config = self.get_config()
|
||||
|
||||
return config, pixel_values
|
||||
|
||||
def get_config(self):
|
||||
return AltCLIPVisionConfig(
|
||||
image_size=self.image_size,
|
||||
patch_size=self.patch_size,
|
||||
num_channels=self.num_channels,
|
||||
hidden_size=self.hidden_size,
|
||||
projection_dim=self.projection_dim,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
dropout=self.dropout,
|
||||
attention_dropout=self.attention_dropout,
|
||||
initializer_range=self.initializer_range,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, pixel_values):
|
||||
model = AltCLIPVisionModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
result = model(pixel_values)
|
||||
# expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
|
||||
image_size = (self.image_size, self.image_size)
|
||||
patch_size = (self.patch_size, self.patch_size)
|
||||
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
|
||||
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, pixel_values = config_and_inputs
|
||||
inputs_dict = {"pixel_values": pixel_values}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class AltCLIPVisionModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
"""
|
||||
Here we also overwrite some of the tests of test_modeling_common.py, as CLIP does not use input_ids, inputs_embeds,
|
||||
attention_mask and seq_length.
|
||||
"""
|
||||
|
||||
all_model_classes = (AltCLIPVisionModel,) if is_torch_available() else ()
|
||||
fx_compatible = False
|
||||
test_pruning = False
|
||||
test_resize_embeddings = False
|
||||
test_head_masking = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = AltCLIPVisionModelTester(self)
|
||||
self.config_tester = ConfigTester(
|
||||
self, config_class=AltCLIPVisionConfig, has_text_modality=False, hidden_size=37
|
||||
)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
@unittest.skip(reason="CLIP does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
def test_model_common_attributes(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
|
||||
x = model.get_output_embeddings()
|
||||
self.assertTrue(x is None or isinstance(x, nn.Linear))
|
||||
|
||||
def test_forward_signature(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
signature = inspect.signature(model.forward)
|
||||
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||
arg_names = [*signature.parameters.keys()]
|
||||
|
||||
expected_arg_names = ["pixel_values"]
|
||||
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_training(self):
|
||||
pass
|
||||
|
||||
def test_training_gradient_checkpointing(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="AltCLIPVisionModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_from_base(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="AltCLIPVisionModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_to_base(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="AltCLIPVisionModel use the same cv backbone with CLIP model.")
|
||||
def test_model_from_pretrained(self):
|
||||
pass
|
||||
|
||||
|
||||
class AltCLIPTextModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=12,
|
||||
seq_length=7,
|
||||
is_training=True,
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
projection_dim=32,
|
||||
project_dim=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
max_position_embeddings=512,
|
||||
initializer_range=0.02,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.projection_dim = projection_dim
|
||||
self.project_dim = project_dim
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.initializer_range = initializer_range
|
||||
self.scope = scope
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
if input_mask is not None:
|
||||
batch_size, seq_length = input_mask.shape
|
||||
rnd_start_indices = np.random.randint(1, seq_length - 1, size=(batch_size,))
|
||||
for batch_idx, start_index in enumerate(rnd_start_indices):
|
||||
input_mask[batch_idx, :start_index] = 1
|
||||
input_mask[batch_idx, start_index:] = 0
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_ids, input_mask
|
||||
|
||||
def get_config(self):
|
||||
return AltCLIPTextConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
projection_dim=self.projection_dim,
|
||||
project_dim=self.project_dim,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
dropout=self.dropout,
|
||||
attention_dropout=self.attention_dropout,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
initializer_range=self.initializer_range,
|
||||
pad_token_id=1,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, input_ids, input_mask):
|
||||
model = AltCLIPTextModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
result = model(input_ids, attention_mask=input_mask)
|
||||
result = model(input_ids)
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.projection_dim))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, input_ids, input_mask = config_and_inputs
|
||||
inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class AltCLIPTextModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (AltCLIPTextModel,) if is_torch_available() else ()
|
||||
fx_compatible = True
|
||||
test_pruning = False
|
||||
test_head_masking = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = AltCLIPTextModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=AltCLIPTextConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_training(self):
|
||||
pass
|
||||
|
||||
def test_training_gradient_checkpointing(self):
|
||||
pass
|
||||
|
||||
def test_model_outputs_equivalence(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="Result of the model is a dict")
|
||||
def test_hidden_states_output(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="AltCLIP does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="AltCLIPTextModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_from_base(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="AltCLIPTextModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_to_base(self):
|
||||
pass
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = AltCLIPTextModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
class AltCLIPModelTester:
|
||||
def __init__(self, parent, text_kwargs=None, vision_kwargs=None, is_training=True):
|
||||
|
||||
if text_kwargs is None:
|
||||
text_kwargs = {}
|
||||
if vision_kwargs is None:
|
||||
vision_kwargs = {}
|
||||
|
||||
self.parent = parent
|
||||
self.text_model_tester = AltCLIPTextModelTester(parent, **text_kwargs)
|
||||
self.vision_model_tester = AltCLIPVisionModelTester(parent, **vision_kwargs)
|
||||
self.is_training = is_training
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
text_config, input_ids, attention_mask = self.text_model_tester.prepare_config_and_inputs()
|
||||
vision_config, pixel_values = self.vision_model_tester.prepare_config_and_inputs()
|
||||
|
||||
config = self.get_config()
|
||||
return config, input_ids, attention_mask, pixel_values
|
||||
|
||||
def get_config(self):
|
||||
return AltCLIPConfig.from_text_vision_configs(
|
||||
self.text_model_tester.get_config(), self.vision_model_tester.get_config(), projection_dim=64
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, input_ids, attention_mask, pixel_values):
|
||||
model = AltCLIPModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
with torch.no_grad():
|
||||
model(input_ids, pixel_values, attention_mask)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, input_ids, attention_mask, pixel_values = config_and_inputs
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"pixel_values": pixel_values,
|
||||
"return_loss": True,
|
||||
}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
# We will verify our results on an image of cute cats
|
||||
def prepare_img():
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
im = Image.open(requests.get(url, stream=True).raw)
|
||||
return im
|
||||
|
||||
|
||||
@require_torch
|
||||
class AltCLIPModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (AltCLIPModel,) if is_torch_available() else ()
|
||||
fx_compatible = True
|
||||
test_head_masking = False
|
||||
test_pruning = False
|
||||
test_resize_embeddings = False
|
||||
test_attention_outputs = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = AltCLIPModelTester(self)
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
@unittest.skip(reason="Hidden_states is tested in individual model tests")
|
||||
def test_hidden_states_output(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="Inputs_embeds is tested in individual model tests")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="Retain_grad is tested in individual model tests")
|
||||
def test_retain_grad_hidden_states_attentions(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="CLIPModel does not have input/output embeddings")
|
||||
def test_model_common_attributes(self):
|
||||
pass
|
||||
|
||||
# override as the `logit_scale` parameter initilization is different for AltCLIP
|
||||
def test_initialization(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
configs_no_init = _config_zero_init(config)
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
for name, param in model.named_parameters():
|
||||
if param.requires_grad:
|
||||
# check if `logit_scale` is initilized as per the original implementation
|
||||
if name == "logit_scale":
|
||||
self.assertAlmostEqual(
|
||||
param.data.item(),
|
||||
np.log(1 / 0.07),
|
||||
delta=1e-3,
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
else:
|
||||
self.assertIn(
|
||||
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||
[0.0, 1.0],
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
|
||||
def _create_and_check_torchscript(self, config, inputs_dict):
|
||||
if not self.test_torchscript:
|
||||
return
|
||||
|
||||
configs_no_init = _config_zero_init(config) # To be sure we have no Nan
|
||||
configs_no_init.torchscript = True
|
||||
configs_no_init.return_dict = False
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
try:
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
pixel_values = inputs_dict["pixel_values"] # CLIP needs pixel_values
|
||||
traced_model = torch.jit.trace(model, (input_ids, pixel_values))
|
||||
except RuntimeError:
|
||||
self.fail("Couldn't trace module.")
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmp_dir_name:
|
||||
pt_file_name = os.path.join(tmp_dir_name, "traced_model.pt")
|
||||
|
||||
try:
|
||||
torch.jit.save(traced_model, pt_file_name)
|
||||
except Exception:
|
||||
self.fail("Couldn't save module.")
|
||||
|
||||
try:
|
||||
loaded_model = torch.jit.load(pt_file_name)
|
||||
except Exception:
|
||||
self.fail("Couldn't load module.")
|
||||
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
loaded_model.to(torch_device)
|
||||
loaded_model.eval()
|
||||
|
||||
model_state_dict = model.state_dict()
|
||||
loaded_model_state_dict = loaded_model.state_dict()
|
||||
|
||||
self.assertEqual(set(model_state_dict.keys()), set(loaded_model_state_dict.keys()))
|
||||
|
||||
models_equal = True
|
||||
for layer_name, p1 in model_state_dict.items():
|
||||
p2 = loaded_model_state_dict[layer_name]
|
||||
if p1.data.ne(p2.data).sum() > 0:
|
||||
models_equal = False
|
||||
|
||||
self.assertTrue(models_equal)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = AltCLIPModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
@require_vision
|
||||
@require_torch
|
||||
class AltCLIPModelIntegrationTest(unittest.TestCase):
|
||||
@slow
|
||||
def test_inference(self):
|
||||
model_name = "BAAI/AltCLIP"
|
||||
model = AltCLIPModel.from_pretrained(model_name).to(torch_device)
|
||||
processor = AltCLIPProcessor.from_pretrained(model_name)
|
||||
|
||||
image = prepare_img()
|
||||
inputs = processor(text=["一张猫的照片", "一张狗的照片"], images=image, padding=True, return_tensors="pt").to(torch_device)
|
||||
|
||||
# forward pass
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
# verify the logits
|
||||
self.assertEqual(
|
||||
outputs.logits_per_image.shape,
|
||||
torch.Size((inputs.pixel_values.shape[0], inputs.input_ids.shape[0])),
|
||||
)
|
||||
self.assertEqual(
|
||||
outputs.logits_per_text.shape,
|
||||
torch.Size((inputs.input_ids.shape[0], inputs.pixel_values.shape[0])),
|
||||
)
|
||||
|
||||
probs = outputs.logits_per_image.softmax(dim=1)
|
||||
expected_probs = torch.tensor([[9.9942e-01, 5.7805e-04]], device=torch_device)
|
||||
|
||||
self.assertTrue(torch.allclose(probs, expected_probs, atol=5e-3))
|
||||
6
utils/check_repo.py
Normal file → Executable file
6
utils/check_repo.py
Normal file → Executable file
@@ -35,6 +35,7 @@ PATH_TO_DOC = "docs/source/en"
|
||||
|
||||
# Update this list with models that are supposed to be private.
|
||||
PRIVATE_MODELS = [
|
||||
"AltRobertaModel",
|
||||
"DPRSpanPredictor",
|
||||
"LongT5Stack",
|
||||
"RealmBertModel",
|
||||
@@ -121,6 +122,7 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
||||
"FlaxBertForCausalLM", # Building part of bigger (tested) model. Tested implicitly through FlaxRobertaForCausalLM.
|
||||
"OPTDecoderWrapper",
|
||||
"TFSegformerDecodeHead", # Not a regular model.
|
||||
"AltRobertaModel", # Building part of bigger (tested) model.
|
||||
"BlipTextLMHeadModel", # No need to test it as it is tested by BlipTextVision models
|
||||
]
|
||||
|
||||
@@ -251,6 +253,9 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||
"TFHubertForCTC",
|
||||
"XCLIPVisionModel",
|
||||
"XCLIPTextModel",
|
||||
"AltCLIPTextModel",
|
||||
"AltCLIPVisionModel",
|
||||
"AltRobertaModel",
|
||||
]
|
||||
|
||||
# Update this list for models that have multiple model types for the same
|
||||
@@ -673,6 +678,7 @@ UNDOCUMENTED_OBJECTS = [
|
||||
"logger", # Internal logger
|
||||
"logging", # External module
|
||||
"requires_backends", # Internal function
|
||||
"AltRobertaModel", # Internal module
|
||||
]
|
||||
|
||||
# This list should be empty. Objects in it should get their own doc page.
|
||||
|
||||
Reference in New Issue
Block a user