Add TF ViT MAE (#16255)

* ported TFViTMAEIntermediate and TFViTMAEOutput. * added TFViTMAEModel and TFViTMAEDecoder. * feat: added a noise argument in the implementation for reproducibility. * feat: vit mae models with an additional noise argument for reproducibility. Co-authored-by: ariG23498 <aritra.born2fly@gmail.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-03-29 22:54:15 +05:30
parent 7a9ef8181c
commit 5b40a37bc4
10 changed files with 2034 additions and 27 deletions
--- a/docs/source/model_doc/vit_mae.mdx
+++ b/docs/source/model_doc/vit_mae.mdx
@@ -41,13 +41,16 @@ fine-tuning, one can directly plug in the weights into a [`ViTForImageClassifica
 - Note that the encoder of MAE is only used to encode the visual patches. The encoded patches are then concatenated with mask tokens, which the decoder (which also
 consists of Transformer blocks) takes as input. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. Fixed
 sin/cos position embeddings are added both to the input of the encoder and the decoder.
+- For a visual understanding of how MAEs work you can check out this [post](https://keras.io/examples/vision/masked_image_modeling/).

 <img src="https://user-images.githubusercontent.com/11435359/146857310-f258c86c-fde6-48e8-9cee-badd2b21bd2c.png"
 alt="drawing" width="600"/> 

 <small> MAE architecture. Taken from the <a href="https://arxiv.org/abs/2111.06377">original paper.</a> </small>

-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/mae).
+This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlow version of the model was contributed by [sayakpaul](https://github.com/sayakpaul) and 
+[ariG23498](https://github.com/ariG23498) (equal contribution). The original code can be found [here](https://github.com/facebookresearch/mae). 
+

 ## ViTMAEConfig

@@ -64,3 +67,15 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi

 [[autodoc]] transformers.ViTMAEForPreTraining
    - forward
+
+
+## TFViTMAEModel
+
+[[autodoc]] TFViTMAEModel
+    - call
+
+
+## TFViTMAEForPreTraining
+
+[[autodoc]] transformers.TFViTMAEForPreTraining
+    - call