MSN (Masked Siamese Networks) for ViT (#18815)
* feat: modeling and conversion scripts for msn. * chore: change license year. * chore: remove unneeded modules. * feat: direct loading of state_dict from remote url. * fix: import paths. * add: rest of the files. * add and fix rest of the files. Co-authored-by: Niels <niels.rogge1@gmail.com> * chore: formatting. * code quality fix. * chore: remove pooler. * feat: add classification top. * fix: configuration object. * add: initial test cases (one failing). * fix: basemodeloutput. * add: caution on using the classification head. * add: rest of the model related files. * add: vit msn readme. * fix: copied from statement. * fix: dummy objects. * add: ViTMSNPreTrainedModel to inits. * fix: repo consistency. * minor change in the model doc. * fix: tests. * Empty-Commit * Update src/transformers/models/vit_msn/configuration_vit_msn.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * address PR comments. * Update src/transformers/models/vit_msn/modeling_vit_msn.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * chore: put model in no_grad() and formatting. Co-authored-by: Niels <niels.rogge1@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
@@ -408,6 +408,8 @@
|
||||
title: Vision Transformer (ViT)
|
||||
- local: model_doc/vit_mae
|
||||
title: ViTMAE
|
||||
- local: model_doc/vit_msn
|
||||
title: ViTMSN
|
||||
- local: model_doc/yolos
|
||||
title: YOLOS
|
||||
title: Vision models
|
||||
|
||||
@@ -175,6 +175,7 @@ The documentation is organized into five sections:
|
||||
1. **[Vision Transformer (ViT)](model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
1. **[VisualBERT](model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||
1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
||||
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
||||
1. **[Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli.
|
||||
@@ -318,6 +319,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| VisualBERT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| ViT | ❌ | ❌ | ✅ | ✅ | ✅ |
|
||||
| ViTMAE | ❌ | ❌ | ✅ | ✅ | ❌ |
|
||||
| ViTMSN | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ |
|
||||
| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
|
||||
64
docs/source/en/model_doc/vit_msn.mdx
Normal file
64
docs/source/en/model_doc/vit_msn.mdx
Normal file
@@ -0,0 +1,64 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# ViTMSN
|
||||
|
||||
## Overview
|
||||
|
||||
The ViTMSN model was proposed in [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes,
|
||||
Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. The paper presents a joint-embedding architecture to match the prototypes
|
||||
of masked patches with that of the unmasked patches. With this setup, their method yields excellent performance in the low-shot and extreme low-shot
|
||||
regimes.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our
|
||||
approach matches the representation of an image view containing randomly masked patches to the representation of the original
|
||||
unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the
|
||||
unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures,
|
||||
while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance,
|
||||
on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy,
|
||||
and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark.*
|
||||
|
||||
Tips:
|
||||
|
||||
- MSN (masked siamese networks) is a method for self-supervised pre-training of Vision Transformers (ViTs). The pre-training
|
||||
objective is to match the prototypes assigned to the unmasked views of the images to that of the masked views of the same images.
|
||||
- The authors have only released pre-trained weights of the backbone (ImageNet-1k pre-training). So, to use that on your own image classification dataset,
|
||||
use the [`ViTMSNForImageClassification`] class which is initialized from [`ViTMSNModel`]. Follow
|
||||
[this notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_classification.ipynb) for a detailed tutorial on fine-tuning.
|
||||
- MSN is particularly useful in the low-shot and extreme low-shot regimes. Notably, it achieves 75.7% top-1 accuracy with only 1% of ImageNet-1K
|
||||
labels when fine-tuned.
|
||||
|
||||
|
||||
<img src="https://i.ibb.co/W6PQMdC/Screenshot-2022-09-13-at-9-08-40-AM.png" alt="drawing" width="600"/>
|
||||
|
||||
<small> MSN architecture. Taken from the <a href="https://arxiv.org/abs/2204.07141">original paper.</a> </small>
|
||||
|
||||
This model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/facebookresearch/msn).
|
||||
|
||||
|
||||
## ViTMSNConfig
|
||||
|
||||
[[autodoc]] ViTMSNConfig
|
||||
|
||||
|
||||
## ViTMSNModel
|
||||
|
||||
[[autodoc]] ViTMSNModel
|
||||
- forward
|
||||
|
||||
|
||||
## ViTMSNForImageClassification
|
||||
|
||||
[[autodoc]] ViTMSNForImageClassification
|
||||
- forward
|
||||
Reference in New Issue
Block a user