[feat] Add FLAVA model (#16654)
* [WIP] Add FLAVA model This PR aims to add [FLAVA](ihttps://arxiv.org/abs/2112.04482) model to the transformers repo. Following checklist delineates the list of things to be done for this PR to be complete: [x] Flava init [x] Flava base models [x] Flava layers [x] Flava Configs [x] Flava encoders [x] Flava pretraining models [ ] Flava classification/retrieval models (To be added in a separate PR) [x] Documentation updates [x] Imports updates [x] Argstring updates [x] Flava pretrained checkpoints [x] Flava tests [x] Flava processors [x] Sanity check [x] Lint
This commit is contained in:
@@ -265,6 +265,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
|||||||
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
||||||
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
||||||
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
||||||
|
1. **[FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||||
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||||
|
|||||||
@@ -244,6 +244,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||||||
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
||||||
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
||||||
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
||||||
|
1. **[FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||||
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||||
|
|||||||
@@ -268,6 +268,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (来自 Google Research/Stanford University) 伴随论文 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 由 Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 发布。
|
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (来自 Google Research/Stanford University) 伴随论文 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 由 Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 发布。
|
||||||
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (来自 Google Research) 伴随论文 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 由 Sascha Rothe, Shashi Narayan, Aliaksei Severyn 发布。
|
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (来自 Google Research) 伴随论文 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 由 Sascha Rothe, Shashi Narayan, Aliaksei Severyn 发布。
|
||||||
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (来自 CNRS) 伴随论文 [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) 由 Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab 发布。
|
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (来自 CNRS) 伴随论文 [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) 由 Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab 发布。
|
||||||
|
1. **[FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava)** (来自 Facebook AI) 伴随论文 [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) 由 Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela 发布。
|
||||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (来自 Google Research) 伴随论文 [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) 由 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon 发布。
|
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (来自 Google Research) 伴随论文 [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) 由 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon 发布。
|
||||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (来自 CMU/Google Brain) 伴随论文 [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) 由 Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le 发布。
|
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (来自 CMU/Google Brain) 伴随论文 [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) 由 Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le 发布。
|
||||||
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (来自 KAIST) 伴随论文 [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) 由 Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim 发布。
|
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (来自 KAIST) 伴随论文 [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) 由 Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim 发布。
|
||||||
|
|||||||
@@ -280,6 +280,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
||||||
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
||||||
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
||||||
|
1. **[FLAVA](https://huggingface.co/docs/transformers/main/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||||
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
1. **[GLPN](https://huggingface.co/docs/transformers/main/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||||
|
|||||||
@@ -216,6 +216,8 @@
|
|||||||
title: Encoder Decoder Models
|
title: Encoder Decoder Models
|
||||||
- local: model_doc/flaubert
|
- local: model_doc/flaubert
|
||||||
title: FlauBERT
|
title: FlauBERT
|
||||||
|
- local: model_doc/flava
|
||||||
|
title: FLAVA
|
||||||
- local: model_doc/fnet
|
- local: model_doc/fnet
|
||||||
title: FNet
|
title: FNet
|
||||||
- local: model_doc/fsmt
|
- local: model_doc/fsmt
|
||||||
|
|||||||
@@ -86,6 +86,7 @@ The library currently contains JAX, PyTorch and TensorFlow implementations, pret
|
|||||||
1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn.
|
||||||
1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
|
||||||
1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab.
|
||||||
|
1. **[FLAVA](model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||||
1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||||
1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||||
1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||||
@@ -204,6 +205,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
| Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
|
| Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ |
|
||||||
| FairSeq Machine-Translation | ✅ | ❌ | ✅ | ❌ | ❌ |
|
| FairSeq Machine-Translation | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ |
|
| FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ |
|
||||||
|
| Flava | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| FNet | ✅ | ✅ | ✅ | ❌ | ❌ |
|
| FNet | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||||
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
| GLPN | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| GLPN | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
|||||||
96
docs/source/en/model_doc/flava.mdx
Normal file
96
docs/source/en/model_doc/flava.mdx
Normal file
@@ -0,0 +1,96 @@
|
|||||||
|
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# FLAVA
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The FLAVA model was proposed in [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela and is accepted at CVPR 2022.
|
||||||
|
|
||||||
|
The paper aims at creating a single unified foundation model which can work across vision, language
|
||||||
|
as well as vision-and-language multimodal tasks.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety
|
||||||
|
of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal
|
||||||
|
(with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising
|
||||||
|
direction would be to use a single holistic universal model, as a "foundation", that targets all modalities
|
||||||
|
at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and
|
||||||
|
cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate
|
||||||
|
impressive performance on a wide range of 35 tasks spanning these target modalities.*
|
||||||
|
|
||||||
|
|
||||||
|
This model was contributed by [aps](https://huggingface.co/aps). The original code can be found [here](https://github.com/facebookresearch/multimodal/tree/main/examples/flava).
|
||||||
|
|
||||||
|
|
||||||
|
## FlavaConfig
|
||||||
|
|
||||||
|
[[autodoc]] FlavaConfig
|
||||||
|
|
||||||
|
## FlavaTextConfig
|
||||||
|
|
||||||
|
[[autodoc]] FlavaTextConfig
|
||||||
|
|
||||||
|
## FlavaImageConfig
|
||||||
|
|
||||||
|
[[autodoc]] FlavaImageConfig
|
||||||
|
|
||||||
|
## FlavaMultimodalConfig
|
||||||
|
|
||||||
|
[[autodoc]] FlavaMultimodalConfig
|
||||||
|
|
||||||
|
## FlavaImageCodebookConfig
|
||||||
|
|
||||||
|
[[autodoc]] FlavaImageCodebookConfig
|
||||||
|
|
||||||
|
## FlavaProcessor
|
||||||
|
|
||||||
|
[[autodoc]] FlavaProcessor
|
||||||
|
|
||||||
|
## FlavaFeatureExtractor
|
||||||
|
|
||||||
|
[[autodoc]] FlavaFeatureExtractor
|
||||||
|
|
||||||
|
## FlavaForPreTraining
|
||||||
|
|
||||||
|
[[autodoc]] FlavaForPreTraining
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlavaModel
|
||||||
|
|
||||||
|
[[autodoc]] FlavaModel
|
||||||
|
- forward
|
||||||
|
- get_text_features
|
||||||
|
- get_image_features
|
||||||
|
|
||||||
|
## FlavaImageCodebook
|
||||||
|
|
||||||
|
[[autodoc]] FlavaImageCodebook
|
||||||
|
- forward
|
||||||
|
- get_codebook_indices
|
||||||
|
- get_codebook_probs
|
||||||
|
|
||||||
|
## FlavaTextModel
|
||||||
|
|
||||||
|
[[autodoc]] FlavaTextModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlavaImageModel
|
||||||
|
|
||||||
|
[[autodoc]] FlavaImageModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## FlavaMultimodalModel
|
||||||
|
|
||||||
|
[[autodoc]] FlavaMultimodalModel
|
||||||
|
- forward
|
||||||
@@ -198,6 +198,14 @@ _import_structure = {
|
|||||||
"models.electra": ["ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP", "ElectraConfig", "ElectraTokenizer"],
|
"models.electra": ["ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP", "ElectraConfig", "ElectraTokenizer"],
|
||||||
"models.encoder_decoder": ["EncoderDecoderConfig"],
|
"models.encoder_decoder": ["EncoderDecoderConfig"],
|
||||||
"models.flaubert": ["FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FlaubertConfig", "FlaubertTokenizer"],
|
"models.flaubert": ["FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FlaubertConfig", "FlaubertTokenizer"],
|
||||||
|
"models.flava": [
|
||||||
|
"FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
|
"FlavaConfig",
|
||||||
|
"FlavaImageCodebookConfig",
|
||||||
|
"FlavaImageConfig",
|
||||||
|
"FlavaMultimodalConfig",
|
||||||
|
"FlavaTextConfig",
|
||||||
|
],
|
||||||
"models.fnet": ["FNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "FNetConfig"],
|
"models.fnet": ["FNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "FNetConfig"],
|
||||||
"models.fsmt": ["FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FSMTConfig", "FSMTTokenizer"],
|
"models.fsmt": ["FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FSMTConfig", "FSMTTokenizer"],
|
||||||
"models.funnel": ["FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP", "FunnelConfig", "FunnelTokenizer"],
|
"models.funnel": ["FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP", "FunnelConfig", "FunnelTokenizer"],
|
||||||
@@ -568,6 +576,7 @@ else:
|
|||||||
_import_structure["models.deit"].append("DeiTFeatureExtractor")
|
_import_structure["models.deit"].append("DeiTFeatureExtractor")
|
||||||
_import_structure["models.detr"].append("DetrFeatureExtractor")
|
_import_structure["models.detr"].append("DetrFeatureExtractor")
|
||||||
_import_structure["models.dpt"].append("DPTFeatureExtractor")
|
_import_structure["models.dpt"].append("DPTFeatureExtractor")
|
||||||
|
_import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaProcessor"])
|
||||||
_import_structure["models.glpn"].append("GLPNFeatureExtractor")
|
_import_structure["models.glpn"].append("GLPNFeatureExtractor")
|
||||||
_import_structure["models.imagegpt"].append("ImageGPTFeatureExtractor")
|
_import_structure["models.imagegpt"].append("ImageGPTFeatureExtractor")
|
||||||
_import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
|
_import_structure["models.layoutlmv2"].append("LayoutLMv2FeatureExtractor")
|
||||||
@@ -1038,6 +1047,18 @@ else:
|
|||||||
"FlaubertWithLMHeadModel",
|
"FlaubertWithLMHeadModel",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.flava"].extend(
|
||||||
|
[
|
||||||
|
"FLAVA_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"FlavaForPreTraining",
|
||||||
|
"FlavaImageCodebook",
|
||||||
|
"FlavaImageModel",
|
||||||
|
"FlavaModel",
|
||||||
|
"FlavaMultimodalModel",
|
||||||
|
"FlavaPreTrainedModel",
|
||||||
|
"FlavaTextModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.fnet"].extend(
|
_import_structure["models.fnet"].extend(
|
||||||
[
|
[
|
||||||
"FNET_PRETRAINED_MODEL_ARCHIVE_LIST",
|
"FNET_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
@@ -2654,6 +2675,14 @@ if TYPE_CHECKING:
|
|||||||
from .models.electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig, ElectraTokenizer
|
from .models.electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig, ElectraTokenizer
|
||||||
from .models.encoder_decoder import EncoderDecoderConfig
|
from .models.encoder_decoder import EncoderDecoderConfig
|
||||||
from .models.flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig, FlaubertTokenizer
|
from .models.flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig, FlaubertTokenizer
|
||||||
|
from .models.flava import (
|
||||||
|
FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
FlavaConfig,
|
||||||
|
FlavaImageCodebookConfig,
|
||||||
|
FlavaImageConfig,
|
||||||
|
FlavaMultimodalConfig,
|
||||||
|
FlavaTextConfig,
|
||||||
|
)
|
||||||
from .models.fnet import FNET_PRETRAINED_CONFIG_ARCHIVE_MAP, FNetConfig
|
from .models.fnet import FNET_PRETRAINED_CONFIG_ARCHIVE_MAP, FNetConfig
|
||||||
from .models.fsmt import FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP, FSMTConfig, FSMTTokenizer
|
from .models.fsmt import FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP, FSMTConfig, FSMTTokenizer
|
||||||
from .models.funnel import FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP, FunnelConfig, FunnelTokenizer
|
from .models.funnel import FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP, FunnelConfig, FunnelTokenizer
|
||||||
@@ -2974,6 +3003,7 @@ if TYPE_CHECKING:
|
|||||||
from .models.deit import DeiTFeatureExtractor
|
from .models.deit import DeiTFeatureExtractor
|
||||||
from .models.detr import DetrFeatureExtractor
|
from .models.detr import DetrFeatureExtractor
|
||||||
from .models.dpt import DPTFeatureExtractor
|
from .models.dpt import DPTFeatureExtractor
|
||||||
|
from .models.flava import FlavaFeatureExtractor, FlavaProcessor
|
||||||
from .models.glpn import GLPNFeatureExtractor
|
from .models.glpn import GLPNFeatureExtractor
|
||||||
from .models.imagegpt import ImageGPTFeatureExtractor
|
from .models.imagegpt import ImageGPTFeatureExtractor
|
||||||
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
|
from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2Processor
|
||||||
@@ -3372,6 +3402,16 @@ if TYPE_CHECKING:
|
|||||||
FlaubertModel,
|
FlaubertModel,
|
||||||
FlaubertWithLMHeadModel,
|
FlaubertWithLMHeadModel,
|
||||||
)
|
)
|
||||||
|
from .models.flava import (
|
||||||
|
FLAVA_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
FlavaForPreTraining,
|
||||||
|
FlavaImageCodebook,
|
||||||
|
FlavaImageModel,
|
||||||
|
FlavaModel,
|
||||||
|
FlavaMultimodalModel,
|
||||||
|
FlavaPreTrainedModel,
|
||||||
|
FlavaTextModel,
|
||||||
|
)
|
||||||
from .models.fnet import (
|
from .models.fnet import (
|
||||||
FNET_PRETRAINED_MODEL_ARCHIVE_LIST,
|
FNET_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
FNetForMaskedLM,
|
FNetForMaskedLM,
|
||||||
|
|||||||
@@ -54,6 +54,7 @@ from . import (
|
|||||||
electra,
|
electra,
|
||||||
encoder_decoder,
|
encoder_decoder,
|
||||||
flaubert,
|
flaubert,
|
||||||
|
flava,
|
||||||
fnet,
|
fnet,
|
||||||
fsmt,
|
fsmt,
|
||||||
funnel,
|
funnel,
|
||||||
|
|||||||
@@ -66,6 +66,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
("canine", "CanineConfig"),
|
("canine", "CanineConfig"),
|
||||||
("roformer", "RoFormerConfig"),
|
("roformer", "RoFormerConfig"),
|
||||||
("clip", "CLIPConfig"),
|
("clip", "CLIPConfig"),
|
||||||
|
("flava", "FlavaConfig"),
|
||||||
("bigbird_pegasus", "BigBirdPegasusConfig"),
|
("bigbird_pegasus", "BigBirdPegasusConfig"),
|
||||||
("deit", "DeiTConfig"),
|
("deit", "DeiTConfig"),
|
||||||
("luke", "LukeConfig"),
|
("luke", "LukeConfig"),
|
||||||
@@ -171,6 +172,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||||||
("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("roformer", "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("roformer", "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
("flava", "FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("bigbird_pegasus", "BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("bigbird_pegasus", "BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("deit", "DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("deit", "DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("luke", "LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("luke", "LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
@@ -268,6 +270,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||||||
("canine", "Canine"),
|
("canine", "Canine"),
|
||||||
("roformer", "RoFormer"),
|
("roformer", "RoFormer"),
|
||||||
("clip", "CLIP"),
|
("clip", "CLIP"),
|
||||||
|
("flava", "Flava"),
|
||||||
("bigbird_pegasus", "BigBirdPegasus"),
|
("bigbird_pegasus", "BigBirdPegasus"),
|
||||||
("deit", "DeiT"),
|
("deit", "DeiT"),
|
||||||
("luke", "LUKE"),
|
("luke", "LUKE"),
|
||||||
|
|||||||
@@ -47,6 +47,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
|
|||||||
("detr", "DetrFeatureExtractor"),
|
("detr", "DetrFeatureExtractor"),
|
||||||
("layoutlmv2", "LayoutLMv2FeatureExtractor"),
|
("layoutlmv2", "LayoutLMv2FeatureExtractor"),
|
||||||
("clip", "CLIPFeatureExtractor"),
|
("clip", "CLIPFeatureExtractor"),
|
||||||
|
("flava", "FlavaFeatureExtractor"),
|
||||||
("perceiver", "PerceiverFeatureExtractor"),
|
("perceiver", "PerceiverFeatureExtractor"),
|
||||||
("swin", "ViTFeatureExtractor"),
|
("swin", "ViTFeatureExtractor"),
|
||||||
("vit_mae", "ViTFeatureExtractor"),
|
("vit_mae", "ViTFeatureExtractor"),
|
||||||
|
|||||||
@@ -62,6 +62,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("canine", "CanineModel"),
|
("canine", "CanineModel"),
|
||||||
("roformer", "RoFormerModel"),
|
("roformer", "RoFormerModel"),
|
||||||
("clip", "CLIPModel"),
|
("clip", "CLIPModel"),
|
||||||
|
("flava", "FlavaModel"),
|
||||||
("bigbird_pegasus", "BigBirdPegasusModel"),
|
("bigbird_pegasus", "BigBirdPegasusModel"),
|
||||||
("deit", "DeiTModel"),
|
("deit", "DeiTModel"),
|
||||||
("luke", "LukeModel"),
|
("luke", "LukeModel"),
|
||||||
@@ -131,6 +132,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
MODEL_FOR_PRETRAINING_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
# Model for pre-training mapping
|
# Model for pre-training mapping
|
||||||
|
("flava", "FlavaForPreTraining"),
|
||||||
("vit_mae", "ViTMAEForPreTraining"),
|
("vit_mae", "ViTMAEForPreTraining"),
|
||||||
("fnet", "FNetForPreTraining"),
|
("fnet", "FNetForPreTraining"),
|
||||||
("visual_bert", "VisualBertForPreTraining"),
|
("visual_bert", "VisualBertForPreTraining"),
|
||||||
|
|||||||
@@ -38,6 +38,7 @@ logger = logging.get_logger(__name__)
|
|||||||
PROCESSOR_MAPPING_NAMES = OrderedDict(
|
PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||||
[
|
[
|
||||||
("clip", "CLIPProcessor"),
|
("clip", "CLIPProcessor"),
|
||||||
|
("flava", "FLAVAProcessor"),
|
||||||
("layoutlmv2", "LayoutLMv2Processor"),
|
("layoutlmv2", "LayoutLMv2Processor"),
|
||||||
("layoutxlm", "LayoutXLMProcessor"),
|
("layoutxlm", "LayoutXLMProcessor"),
|
||||||
("speech_to_text", "Speech2TextProcessor"),
|
("speech_to_text", "Speech2TextProcessor"),
|
||||||
|
|||||||
99
src/transformers/models/flava/__init__.py
Normal file
99
src/transformers/models/flava/__init__.py
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_flava": [
|
||||||
|
"FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
|
"FlavaConfig",
|
||||||
|
"FlavaImageCodebookConfig",
|
||||||
|
"FlavaImageConfig",
|
||||||
|
"FlavaMultimodalConfig",
|
||||||
|
"FlavaTextConfig",
|
||||||
|
],
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_vision_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
_import_structure["feature_extraction_flava"] = ["FlavaFeatureExtractor"]
|
||||||
|
_import_structure["processing_flava"] = ["FlavaProcessor"]
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
_import_structure["modeling_flava"] = [
|
||||||
|
"FLAVA_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"FlavaForPreTraining",
|
||||||
|
"FlavaImageCodebook",
|
||||||
|
"FlavaImageModel",
|
||||||
|
"FlavaModel",
|
||||||
|
"FlavaMultimodalModel",
|
||||||
|
"FlavaPreTrainedModel",
|
||||||
|
"FlavaTextModel",
|
||||||
|
]
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_flava import (
|
||||||
|
FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
FlavaConfig,
|
||||||
|
FlavaImageCodebookConfig,
|
||||||
|
FlavaImageConfig,
|
||||||
|
FlavaMultimodalConfig,
|
||||||
|
FlavaTextConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_vision_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
from .feature_extraction_flava import FlavaFeatureExtractor
|
||||||
|
from .processing_flava import FlavaProcessor
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
from .modeling_flava import (
|
||||||
|
FLAVA_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
FlavaForPreTraining,
|
||||||
|
FlavaImageCodebook,
|
||||||
|
FlavaImageModel,
|
||||||
|
FlavaModel,
|
||||||
|
FlavaMultimodalModel,
|
||||||
|
FlavaPreTrainedModel,
|
||||||
|
FlavaTextModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||||
646
src/transformers/models/flava/configuration_flava.py
Normal file
646
src/transformers/models/flava/configuration_flava.py
Normal file
@@ -0,0 +1,646 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" FLAVA model configurations"""
|
||||||
|
|
||||||
|
import copy
|
||||||
|
import os
|
||||||
|
from typing import Any, Dict, Union
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"facebook/flava-full": "https://huggingface.co/facebook/flava-full/resolve/main/config.json",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaImageConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`FlavaImageModel`]. It is used to instantiate an
|
||||||
|
FLAVA model according to the specified arguments, defining the model architecture.
|
||||||
|
|
||||||
|
Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
|
||||||
|
[facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"selu"` and `"gelu_new"` are supported.
|
||||||
|
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
image_size (`int`, *optional*, defaults to 224):
|
||||||
|
The size (resolution) of each image.
|
||||||
|
patch_size (`int`, *optional*, defaults to 16):
|
||||||
|
The size (resolution) of each patch.
|
||||||
|
num_channels (`int`, *optional*, defaults to 3):
|
||||||
|
The number of input channels.
|
||||||
|
qkv_bias (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to add a bias to the queries, keys and values.
|
||||||
|
mask_token (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to use a mask token or not. Used in MIM (Masked Image Modeling) loss for FLAVA.
|
||||||
|
vocab_size (`int`, *optional*, defaults to 8192):
|
||||||
|
Vocabulary size of the [`FlavaImageCodebook`] used in conjunction with [`FlavaImageModel`] for MIM (Masked
|
||||||
|
Image Modeling) loss for FLAVA.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import FlavaImageModel, FlavaImageConfig
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaImageModel with style configuration
|
||||||
|
>>> configuration = FlavaImageConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaImageModel model from the style configuration
|
||||||
|
>>> model = FlavaImageModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "flava_image_model"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
hidden_size: int = 768,
|
||||||
|
num_hidden_layers: int = 12,
|
||||||
|
num_attention_heads: int = 12,
|
||||||
|
intermediate_size: int = 3072,
|
||||||
|
hidden_act: int = "gelu",
|
||||||
|
hidden_dropout_prob: float = 0.0,
|
||||||
|
attention_probs_dropout_prob: float = 0.0,
|
||||||
|
initializer_range: float = 0.02,
|
||||||
|
layer_norm_eps: float = 1e-12,
|
||||||
|
image_size: int = 224,
|
||||||
|
patch_size: int = 16,
|
||||||
|
num_channels: int = 3,
|
||||||
|
qkv_bias: bool = True,
|
||||||
|
mask_token: bool = True,
|
||||||
|
vocab_size: int = 8192,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.image_size = image_size
|
||||||
|
self.patch_size = patch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.qkv_bias = qkv_bias
|
||||||
|
self.mask_token = mask_token
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||||
|
|
||||||
|
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
|
# get the image config dict if we are loading from FlavaConfig
|
||||||
|
if config_dict.get("model_type") == "flava":
|
||||||
|
config_dict = config_dict["image_config"]
|
||||||
|
|
||||||
|
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||||
|
logger.warning(
|
||||||
|
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||||
|
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||||
|
)
|
||||||
|
|
||||||
|
return cls.from_dict(config_dict, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaTextConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`FlavaTextModel`]. It is used to instantiate an
|
||||||
|
FLAVA model according to the specified arguments, defining the model architecture.
|
||||||
|
|
||||||
|
Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
|
||||||
|
[facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 30522):
|
||||||
|
Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
|
||||||
|
`inputs_ids` passed when calling [`FlavaTextModel`].
|
||||||
|
type_vocab_size (`int`, *optional*, defaults to 2):
|
||||||
|
The vocabulary size of the `token_type_ids` passed when calling [`FlavaTextModel`]. Note that even though
|
||||||
|
text encoder allows `token_type_ids`'s value as 2, for text-only pretraining and fine-tuning, only 1 is
|
||||||
|
used similar to RoBERTa.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 512):
|
||||||
|
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||||
|
just in case (e.g., 512 or 1024 or 2048). For VL, max_length passed to model is 77.
|
||||||
|
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
|
||||||
|
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
|
||||||
|
positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
|
||||||
|
[Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
|
||||||
|
For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
|
||||||
|
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"selu"` and `"gelu_new"` are supported.
|
||||||
|
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
image_size (`int`, *optional*, defaults to 224):
|
||||||
|
The size (resolution) of each image.
|
||||||
|
patch_size (`int`, *optional*, defaults to 16):
|
||||||
|
The size (resolution) of each patch.
|
||||||
|
num_channels (`int`, *optional*, defaults to 3):
|
||||||
|
The number of input channels.
|
||||||
|
qkv_bias (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to add a bias to the queries, keys and values.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import FlavaTextModel, FlavaTextConfig
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaTextModel with style configuration
|
||||||
|
>>> configuration = FlavaTextConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaTextConfig from the style configuration
|
||||||
|
>>> model = FlavaTextModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
model_type = "flava_text_model"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size: int = 30522,
|
||||||
|
type_vocab_size: int = 2,
|
||||||
|
max_position_embeddings: int = 512,
|
||||||
|
position_embedding_type: str = "absolute",
|
||||||
|
hidden_size: int = 768,
|
||||||
|
num_hidden_layers: int = 12,
|
||||||
|
num_attention_heads: int = 12,
|
||||||
|
intermediate_size: int = 3072,
|
||||||
|
hidden_act: str = "gelu",
|
||||||
|
hidden_dropout_prob: float = 0.0,
|
||||||
|
attention_probs_dropout_prob: float = 0.0,
|
||||||
|
initializer_range: float = 0.02,
|
||||||
|
layer_norm_eps: float = 1e-12,
|
||||||
|
pad_token_id: int = 0,
|
||||||
|
qkv_bias: bool = True,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.type_vocab_size = type_vocab_size
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.position_embedding_type = position_embedding_type
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.qkv_bias = qkv_bias
|
||||||
|
self.pad_token_id = pad_token_id
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||||
|
|
||||||
|
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
|
# get the text config dict if we are loading from FlavaConfig
|
||||||
|
if config_dict.get("model_type") == "flava":
|
||||||
|
config_dict = config_dict["text_config"]
|
||||||
|
|
||||||
|
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||||
|
logger.warning(
|
||||||
|
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||||
|
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||||
|
)
|
||||||
|
|
||||||
|
return cls.from_dict(config_dict, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaMultimodalConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`FlavaMultimodalModel`]. It is used to instantiate
|
||||||
|
an FLAVA model according to the specified arguments, defining the model architecture.
|
||||||
|
|
||||||
|
Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
|
||||||
|
[facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"selu"` and `"gelu_new"` are supported.
|
||||||
|
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
qkv_bias (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to add a bias to the queries, keys and values.
|
||||||
|
use_cls_token (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to use an extra CLS token for multimodal settings. Usually needed by the FLAVA model.
|
||||||
|
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import FlavaMultimodalModel, FlavaMultimodalConfig
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaMultimodalModel with style configuration
|
||||||
|
>>> configuration = FlavaMultimodalConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaMultimodalModel model from the style configuration
|
||||||
|
>>> model = FlavaMultimodalModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "flava_multimodal_model"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
hidden_size: int = 768,
|
||||||
|
num_hidden_layers: int = 6,
|
||||||
|
num_attention_heads: int = 12,
|
||||||
|
intermediate_size: int = 3072,
|
||||||
|
hidden_act: int = "gelu",
|
||||||
|
hidden_dropout_prob: int = 0.0,
|
||||||
|
attention_probs_dropout_prob: int = 0.0,
|
||||||
|
initializer_range: float = 0.02,
|
||||||
|
layer_norm_eps: float = 1e-12,
|
||||||
|
qkv_bias: bool = True,
|
||||||
|
use_cls_token: bool = True,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.qkv_bias = qkv_bias
|
||||||
|
self.use_cls_token = use_cls_token
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||||
|
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
|
# get the multimodal config dict if we are loading from FlavaConfig
|
||||||
|
if config_dict.get("model_type") == "flava":
|
||||||
|
config_dict = config_dict["multimodal_config"]
|
||||||
|
|
||||||
|
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||||
|
logger.warning(
|
||||||
|
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||||
|
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||||
|
)
|
||||||
|
|
||||||
|
return cls.from_dict(config_dict, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaImageCodebookConfig(PretrainedConfig):
|
||||||
|
model_type = "flava_image_codebook"
|
||||||
|
|
||||||
|
r"""
|
||||||
|
[`FlavaImageCodebookConfig`] is the configuration class to store the configuration of a [`FlavaImageCodebook`]. It
|
||||||
|
is used to instantiate an FLAVA model according to the specified arguments, defining the model architecture.
|
||||||
|
Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
|
||||||
|
[facebook/flava-image-codebook](https://huggingface.co/facebook/flava-image-codebook) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
num_groups (`int`, defaults to 4):
|
||||||
|
Number of groups to be created. This parameter as of now doesn't affect the model and is used for some
|
||||||
|
internal calculation and estimations.
|
||||||
|
input_channels (`int`, defaults to 3):
|
||||||
|
Number of channels in the image to be passed.
|
||||||
|
num_blocks_per_group (`int`, defaults to 2):
|
||||||
|
Number of conv-based blocks per group.
|
||||||
|
hidden_size (`int`, defaults to 256):
|
||||||
|
Size of hidden dim for the blocks.
|
||||||
|
vocab_size (`int`, defaults to 8192):
|
||||||
|
Size of the output vocabulary for the codebook.
|
||||||
|
freeze (`bool`, defaults to `True`):
|
||||||
|
Whether to freeze the weights of the model.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
kwargs (*optional*):
|
||||||
|
Dictionary of keyword arguments.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import FlavaImageCodebook, FlavaImageCodebookConfig
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaImageCodebook with style configuration
|
||||||
|
>>> configuration = FlavaImageCodebookConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaImageCodebook model from the style configuration
|
||||||
|
>>> model = FlavaImageCodebook(configuration)
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
num_groups: int = 4,
|
||||||
|
input_channels: int = 3,
|
||||||
|
num_blocks_per_group: int = 2,
|
||||||
|
hidden_size: int = 256,
|
||||||
|
vocab_size: int = 8192,
|
||||||
|
freeze: int = True,
|
||||||
|
initializer_range: float = 0.02,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.num_groups = num_groups
|
||||||
|
self.input_channels = input_channels
|
||||||
|
self.num_blocks_per_group = num_blocks_per_group
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.freeze = freeze
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||||
|
|
||||||
|
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||||
|
|
||||||
|
# get the image codebook config dict if we are loading from FlavaConfig
|
||||||
|
if config_dict.get("model_type") == "flava":
|
||||||
|
config_dict = config_dict["image_codebook_config"]
|
||||||
|
|
||||||
|
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||||
|
logger.warning(
|
||||||
|
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||||
|
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||||
|
)
|
||||||
|
|
||||||
|
return cls.from_dict(config_dict, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
[`FlavaConfig`] is the configuration class to store the configuration of a [`FlavaModel`]. It is used to
|
||||||
|
instantiate FLAVA model according to the specified arguments, defining the text model, image model, image codebook
|
||||||
|
and multimodal model configs. Instantiating a configuration with the defaults will yield a similar configuration to
|
||||||
|
that of the FLAVA [facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text_config_dict (`dict`, *optional*):
|
||||||
|
Dictionary of configuration options used to initialize [`FlavaTextConfig`].
|
||||||
|
image_config_dict (`dict`, *optional*):
|
||||||
|
Dictionary of configuration options used to initialize [`FlavaImageConfig`].
|
||||||
|
multimodal_config_dict (`dict`, *optional*):
|
||||||
|
Dictionary of configuration options used to initialize [`FlavaMultimodalConfig`].
|
||||||
|
hidden_size (`int`, *optional*, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
projection_dim (`int`, *optional*, defaults to 512):
|
||||||
|
Dimentionality of text and image projection layers.
|
||||||
|
logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
|
||||||
|
The inital value of the *logit_scale* paramter. Default is used as per the original FLAVA/CLIP
|
||||||
|
implementation.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
ce_ignore_index (`int`, *optional*, defaults to -100):
|
||||||
|
Cross entropy index to ignore.
|
||||||
|
mim_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
Weight to be assigned to MIM (Masked Image Modeling) unimodal loss
|
||||||
|
mlm_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
Weight to be assigned to MLM (Masked Language Modeling) unimodal loss
|
||||||
|
global_contrastive_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
Weight to be assigned to global contrastive cross-alignment loss.
|
||||||
|
itm_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
Weight to be assigned to image-text matching multimodal loss.
|
||||||
|
mmm_image_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
Weight to be assigned to MMM loss's image part.
|
||||||
|
mmm_text_weight (`float`, *optional*, defaults to 1.0):
|
||||||
|
Weight to be assigned to MMM loss's text part.
|
||||||
|
global_backprop_contrastive (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to use global backpropgation through all workers in contrastive loss.
|
||||||
|
skip_unmasked_multimodal_encoder (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to skip running unmasked multimodal encoder whose outputs are not used by FLAVA losses.
|
||||||
|
return_loss (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to return loss or not
|
||||||
|
|
||||||
|
kwargs (*optional*):
|
||||||
|
Dictionary of keyword arguments.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import FlavaModel, FlavaForPreTraining, FlavaConfig
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaConfig with style configuration
|
||||||
|
>>> configuration = FlavaConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a FlavaModel and FlavaForPreTraining model from the style configuration
|
||||||
|
>>> model = FlavaModel(configuration)
|
||||||
|
>>> model_pre = FlavaForPreTraining(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
>>> configuration_pre = model_pre.config
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
|
||||||
|
model_type = "flava"
|
||||||
|
is_composition = True
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
image_config_dict: Dict[str, Any] = None,
|
||||||
|
text_config_dict: Dict[str, Any] = None,
|
||||||
|
multimodal_config_dict: Dict[str, Any] = None,
|
||||||
|
image_codebook_config_dict: Dict[str, Any] = None,
|
||||||
|
hidden_size: int = 768,
|
||||||
|
layer_norm_eps: float = 1e-12,
|
||||||
|
projection_dim: int = 768,
|
||||||
|
init_codebook: bool = True,
|
||||||
|
logit_scale_init_value: float = 2.6592,
|
||||||
|
initializer_range: float = 0.02,
|
||||||
|
ce_ignore_index: int = -100,
|
||||||
|
mim_weight: float = 1.0,
|
||||||
|
mlm_weight: float = 1.0,
|
||||||
|
global_contrastive_weight: float = 1.0,
|
||||||
|
itm_weight: float = 1.0,
|
||||||
|
mmm_image_weight: float = 1.0,
|
||||||
|
mmm_text_weight: float = 1.0,
|
||||||
|
global_backprop_contrastive: bool = True,
|
||||||
|
skip_unmasked_multimodal_encoder: bool = True,
|
||||||
|
return_loss: bool = True,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
|
||||||
|
if image_config_dict is None:
|
||||||
|
image_config_dict = {}
|
||||||
|
logger.info("image_config_dict is None. initializing the FlavaImageConfig with default values.")
|
||||||
|
|
||||||
|
if text_config_dict is None:
|
||||||
|
text_config_dict = {}
|
||||||
|
logger.info("text_config_dict is None. Initializing the FlavaTextConfig with default values.")
|
||||||
|
|
||||||
|
if multimodal_config_dict is None:
|
||||||
|
multimodal_config_dict = {}
|
||||||
|
logger.info("multimodal_config_dict is None. initializing the FlavaMultimodalConfig with default values.")
|
||||||
|
|
||||||
|
if image_codebook_config_dict is None:
|
||||||
|
image_codebook_config_dict = {}
|
||||||
|
logger.info(
|
||||||
|
"image_codebook_config_dict is None. initializing the FlavaImageCodebookConfig with default values."
|
||||||
|
)
|
||||||
|
|
||||||
|
self.image_config_dict = image_config_dict
|
||||||
|
self.text_config_dict = text_config_dict
|
||||||
|
self.multimodal_config_dict = multimodal_config_dict
|
||||||
|
self.image_codebook_config_dict = image_codebook_config_dict
|
||||||
|
|
||||||
|
self.image_config = FlavaImageConfig(**self.image_config_dict)
|
||||||
|
self.text_config = FlavaTextConfig(**self.text_config_dict)
|
||||||
|
self.multimodal_config = FlavaMultimodalConfig(**self.multimodal_config_dict)
|
||||||
|
self.image_codebook_config = FlavaImageCodebookConfig(**self.image_codebook_config_dict)
|
||||||
|
self.projection_dim = projection_dim
|
||||||
|
self.init_codebook = init_codebook
|
||||||
|
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.logit_scale_init_value = logit_scale_init_value
|
||||||
|
self.initializer_factor = 1.0
|
||||||
|
self.ce_ignore_index = ce_ignore_index
|
||||||
|
self.mim_weight = mim_weight
|
||||||
|
self.mlm_weight = mlm_weight
|
||||||
|
self.global_contrastive_weight = global_contrastive_weight
|
||||||
|
self.itm_weight = itm_weight
|
||||||
|
self.mmm_image_weight = mmm_image_weight
|
||||||
|
self.mmm_text_weight = mmm_text_weight
|
||||||
|
self.global_backprop_contrastive = global_backprop_contrastive
|
||||||
|
self.skip_unmasked_multimodal_encoder = skip_unmasked_multimodal_encoder
|
||||||
|
self.return_loss = return_loss
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_configs(
|
||||||
|
cls,
|
||||||
|
image_config: FlavaImageConfig,
|
||||||
|
text_config: FlavaTextConfig,
|
||||||
|
multimodal_config: FlavaMultimodalConfig,
|
||||||
|
image_codebook_config: FlavaImageCodebookConfig,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
r"""
|
||||||
|
Instantiate a [`FlavaConfig`] (or a derived class) from flava text model configuration, flava image model
|
||||||
|
configuration, flava multimodal model and flava codebook model configuration.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[`FlavaConfig`]: An instance of a configuration object
|
||||||
|
"""
|
||||||
|
|
||||||
|
return cls(
|
||||||
|
image_config_dict=image_config.to_dict(),
|
||||||
|
text_config_dict=text_config.to_dict(),
|
||||||
|
multimodal_config_dict=multimodal_config.to_dict(),
|
||||||
|
image_codebook_config_dict=image_codebook_config.to_dict(),
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
def to_dict(self):
|
||||||
|
"""
|
||||||
|
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||||
|
"""
|
||||||
|
output = copy.deepcopy(self.__dict__)
|
||||||
|
output["image_config"] = self.image_config.to_dict()
|
||||||
|
output["text_config"] = self.text_config.to_dict()
|
||||||
|
output["multimodal_config"] = self.multimodal_config.to_dict()
|
||||||
|
output["image_codebook_config"] = self.image_codebook_config.to_dict()
|
||||||
|
output["model_type"] = self.__class__.model_type
|
||||||
|
return output
|
||||||
102
src/transformers/models/flava/convert_dalle_to_flava_codebook.py
Normal file
102
src/transformers/models/flava/convert_dalle_to_flava_codebook.py
Normal file
@@ -0,0 +1,102 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import FlavaImageCodebook, FlavaImageCodebookConfig
|
||||||
|
|
||||||
|
|
||||||
|
def rreplace(s, old, new, occurrence):
|
||||||
|
li = s.rsplit(old, occurrence)
|
||||||
|
return new.join(li)
|
||||||
|
|
||||||
|
|
||||||
|
def count_parameters(state_dict):
|
||||||
|
# encoder.embeddings are double copied in original FLAVA
|
||||||
|
return sum(param.float().sum() if "encoder.embeddings" not in key else 0 for key, param in state_dict.items())
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade_state_dict(state_dict):
|
||||||
|
upgrade = {}
|
||||||
|
|
||||||
|
group_keys = ["group_1", "group_2", "group_3", "group_4"]
|
||||||
|
for key, value in state_dict.items():
|
||||||
|
for group_key in group_keys:
|
||||||
|
if group_key in key:
|
||||||
|
key = key.replace(f"{group_key}.", f"{group_key}.group.")
|
||||||
|
|
||||||
|
if "res_path" in key:
|
||||||
|
key = key.replace("res_path.", "res_path.path.")
|
||||||
|
|
||||||
|
if key.endswith(".w"):
|
||||||
|
key = rreplace(key, ".w", ".weight", 1)
|
||||||
|
if key.endswith(".b"):
|
||||||
|
key = rreplace(key, ".b", ".bias", 1)
|
||||||
|
|
||||||
|
upgrade[key] = value.float()
|
||||||
|
|
||||||
|
return upgrade
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def convert_dalle_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_path=None, save_checkpoint=True):
|
||||||
|
"""
|
||||||
|
Copy/paste/tweak model's weights to transformers design.
|
||||||
|
"""
|
||||||
|
from dall_e import Encoder
|
||||||
|
|
||||||
|
encoder = Encoder()
|
||||||
|
if os.path.exists(checkpoint_path):
|
||||||
|
ckpt = torch.load(checkpoint_path)
|
||||||
|
else:
|
||||||
|
ckpt = torch.hub.load_state_dict_from_url(checkpoint_path)
|
||||||
|
|
||||||
|
if isinstance(ckpt, Encoder):
|
||||||
|
ckpt = ckpt.state_dict()
|
||||||
|
encoder.load_state_dict(ckpt)
|
||||||
|
|
||||||
|
if config_path is not None:
|
||||||
|
config = FlavaImageCodebookConfig.from_pretrained(config_path)
|
||||||
|
else:
|
||||||
|
config = FlavaImageCodebookConfig()
|
||||||
|
|
||||||
|
hf_model = FlavaImageCodebook(config).eval()
|
||||||
|
state_dict = encoder.state_dict()
|
||||||
|
|
||||||
|
hf_state_dict = upgrade_state_dict(state_dict)
|
||||||
|
hf_model.load_state_dict(hf_state_dict)
|
||||||
|
hf_state_dict = hf_model.state_dict()
|
||||||
|
hf_count = count_parameters(hf_state_dict)
|
||||||
|
state_dict_count = count_parameters(state_dict)
|
||||||
|
|
||||||
|
assert torch.allclose(hf_count, state_dict_count, atol=1e-3)
|
||||||
|
|
||||||
|
if save_checkpoint:
|
||||||
|
hf_model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
else:
|
||||||
|
return hf_state_dict
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||||
|
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to flava checkpoint")
|
||||||
|
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
convert_dalle_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path)
|
||||||
@@ -0,0 +1,99 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import FlavaConfig, FlavaForPreTraining
|
||||||
|
from transformers.models.flava.convert_dalle_to_flava_codebook import convert_dalle_checkpoint
|
||||||
|
|
||||||
|
|
||||||
|
def count_parameters(state_dict):
|
||||||
|
# encoder.embeddings are double copied in original FLAVA
|
||||||
|
return sum(param.float().sum() if "encoder.embeddings" not in key else 0 for key, param in state_dict.items())
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade_state_dict(state_dict, codebook_state_dict):
|
||||||
|
upgrade = {}
|
||||||
|
|
||||||
|
for key, value in state_dict.items():
|
||||||
|
if "text_encoder.embeddings" in key or "image_encoder.embeddings" in key:
|
||||||
|
continue
|
||||||
|
|
||||||
|
key = key.replace("heads.cmd.mim_head.cls.predictions", "mmm_image_head")
|
||||||
|
key = key.replace("heads.cmd.mlm_head.cls.predictions", "mmm_text_head")
|
||||||
|
key = key.replace("heads.cmd.itm_head.cls", "itm_head")
|
||||||
|
key = key.replace("heads.cmd.itm_head.pooler", "itm_head.pooler")
|
||||||
|
key = key.replace("heads.cmd.clip_head.logit_scale", "flava.logit_scale")
|
||||||
|
key = key.replace("heads.fairseq_mlm.cls.predictions", "mlm_head")
|
||||||
|
key = key.replace("heads.imagenet.mim_head.cls.predictions", "mim_head")
|
||||||
|
key = key.replace("mm_text_projection", "flava.text_to_mm_projection")
|
||||||
|
key = key.replace("mm_image_projection", "flava.image_to_mm_projection")
|
||||||
|
key = key.replace("image_encoder.module", "flava.image_model")
|
||||||
|
key = key.replace("text_encoder.module", "flava.text_model")
|
||||||
|
key = key.replace("mm_encoder.module.encoder.cls_token", "flava.multimodal_model.cls_token")
|
||||||
|
key = key.replace("mm_encoder.module", "flava.multimodal_model")
|
||||||
|
key = key.replace("text_projection", "flava.text_projection")
|
||||||
|
key = key.replace("image_projection", "flava.image_projection")
|
||||||
|
|
||||||
|
upgrade[key] = value.float()
|
||||||
|
|
||||||
|
for key, value in codebook_state_dict.items():
|
||||||
|
upgrade[f"image_codebook.{key}"] = value
|
||||||
|
|
||||||
|
return upgrade
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def convert_flava_checkpoint(checkpoint_path, codebook_path, pytorch_dump_folder_path, config_path=None):
|
||||||
|
"""
|
||||||
|
Copy/paste/tweak model's weights to transformers design.
|
||||||
|
"""
|
||||||
|
if config_path is not None:
|
||||||
|
config = FlavaConfig.from_pretrained(config_path)
|
||||||
|
else:
|
||||||
|
config = FlavaConfig()
|
||||||
|
|
||||||
|
hf_model = FlavaForPreTraining(config).eval()
|
||||||
|
|
||||||
|
codebook_state_dict = convert_dalle_checkpoint(codebook_path, None, save_checkpoint=False)
|
||||||
|
|
||||||
|
if os.path.exists(checkpoint_path):
|
||||||
|
state_dict = torch.load(checkpoint_path, map_location="cpu")
|
||||||
|
else:
|
||||||
|
state_dict = torch.hub.load_state_dict_from_url(checkpoint_path, map_location="cpu")
|
||||||
|
|
||||||
|
hf_state_dict = upgrade_state_dict(state_dict, codebook_state_dict)
|
||||||
|
hf_model.load_state_dict(hf_state_dict)
|
||||||
|
hf_state_dict = hf_model.state_dict()
|
||||||
|
hf_count = count_parameters(hf_state_dict)
|
||||||
|
state_dict_count = count_parameters(state_dict) + count_parameters(codebook_state_dict)
|
||||||
|
|
||||||
|
assert torch.allclose(hf_count, state_dict_count, atol=1e-3)
|
||||||
|
|
||||||
|
hf_model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||||
|
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to flava checkpoint")
|
||||||
|
parser.add_argument("--codebook_path", default=None, type=str, help="Path to flava codebook checkpoint")
|
||||||
|
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
convert_flava_checkpoint(args.checkpoint_path, args.codebook_path, args.pytorch_dump_folder_path, args.config_path)
|
||||||
351
src/transformers/models/flava/feature_extraction_flava.py
Normal file
351
src/transformers/models/flava/feature_extraction_flava.py
Normal file
@@ -0,0 +1,351 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Feature extractor class for FLAVA."""
|
||||||
|
|
||||||
|
import math
|
||||||
|
import random
|
||||||
|
from functools import lru_cache
|
||||||
|
from typing import Any, List, Optional, Tuple, Union
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from ...feature_extraction_utils import BatchFeature, FeatureExtractionMixin
|
||||||
|
from ...image_utils import ImageFeatureExtractionMixin, is_torch_tensor
|
||||||
|
from ...utils import TensorType, logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
# These values are taken from CLIP
|
||||||
|
FLAVA_IMAGE_MEAN = [0.48145466, 0.4578275, 0.40821073]
|
||||||
|
FLAVA_IMAGE_STD = [0.26862954, 0.26130258, 0.27577711]
|
||||||
|
FLAVA_CODEBOOK_MEAN = [0.0, 0.0, 0.0]
|
||||||
|
FLAVA_CODEBOOK_STD = [1.0, 1.0, 1.0]
|
||||||
|
LOGIT_LAPLACE_EPS: float = 0.1
|
||||||
|
|
||||||
|
|
||||||
|
# Inspired from https://github.com/microsoft/unilm/blob/master/beit/masking_generator.py
|
||||||
|
class FlavaMaskingGenerator:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
input_size: Union[int, Tuple[int, int]] = 14,
|
||||||
|
total_mask_patches: int = 75,
|
||||||
|
mask_group_max_patches: Optional[int] = None,
|
||||||
|
mask_group_min_patches: int = 16,
|
||||||
|
mask_group_min_aspect_ratio: Optional[float] = 0.3,
|
||||||
|
mask_group_max_aspect_ratio: float = None,
|
||||||
|
):
|
||||||
|
if not isinstance(input_size, tuple):
|
||||||
|
input_size = (input_size,) * 2
|
||||||
|
self.height, self.width = input_size
|
||||||
|
|
||||||
|
self.num_patches = self.height * self.width
|
||||||
|
self.total_mask_patches = total_mask_patches
|
||||||
|
|
||||||
|
self.mask_group_min_patches = mask_group_min_patches
|
||||||
|
self.mask_group_max_patches = total_mask_patches if mask_group_max_patches is None else mask_group_max_patches
|
||||||
|
|
||||||
|
mask_group_max_aspect_ratio = mask_group_max_aspect_ratio or 1 / mask_group_min_aspect_ratio
|
||||||
|
self.log_aspect_ratio = (math.log(mask_group_min_aspect_ratio), math.log(mask_group_max_aspect_ratio))
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
repr_str = "MaskingGenerator(%d, %d -> [%d ~ %d], max = %d, %.3f ~ %.3f)" % (
|
||||||
|
self.height,
|
||||||
|
self.width,
|
||||||
|
self.mask_group_min_patches,
|
||||||
|
self.mask_group_max_patches,
|
||||||
|
self.total_mask_patches,
|
||||||
|
self.log_aspect_ratio[0],
|
||||||
|
self.log_aspect_ratio[1],
|
||||||
|
)
|
||||||
|
return repr_str
|
||||||
|
|
||||||
|
def get_shape(self):
|
||||||
|
return self.height, self.width
|
||||||
|
|
||||||
|
def _mask(self, mask, max_mask_patches):
|
||||||
|
delta = 0
|
||||||
|
for _attempt in range(10):
|
||||||
|
target_area = random.uniform(self.mask_group_min_patches, max_mask_patches)
|
||||||
|
aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
|
||||||
|
height = int(round(math.sqrt(target_area * aspect_ratio)))
|
||||||
|
width = int(round(math.sqrt(target_area / aspect_ratio)))
|
||||||
|
if width < self.width and height < self.height:
|
||||||
|
top = random.randint(0, self.height - height)
|
||||||
|
left = random.randint(0, self.width - width)
|
||||||
|
|
||||||
|
num_masked = mask[top : top + height, left : left + width].sum()
|
||||||
|
# Overlap
|
||||||
|
if 0 < height * width - num_masked <= max_mask_patches:
|
||||||
|
for i in range(top, top + height):
|
||||||
|
for j in range(left, left + width):
|
||||||
|
if mask[i, j] == 0:
|
||||||
|
mask[i, j] = 1
|
||||||
|
delta += 1
|
||||||
|
|
||||||
|
if delta > 0:
|
||||||
|
break
|
||||||
|
return delta
|
||||||
|
|
||||||
|
def __call__(self):
|
||||||
|
mask = np.zeros(shape=self.get_shape(), dtype=int)
|
||||||
|
mask_count = 0
|
||||||
|
while mask_count < self.total_mask_patches:
|
||||||
|
max_mask_patches = self.total_mask_patches - mask_count
|
||||||
|
max_mask_patches = min(max_mask_patches, self.mask_group_max_patches)
|
||||||
|
|
||||||
|
delta = self._mask(mask, max_mask_patches)
|
||||||
|
if delta == 0:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
mask_count += delta
|
||||||
|
|
||||||
|
return mask
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaFeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionMixin):
|
||||||
|
r"""
|
||||||
|
Constructs a FLAVA feature extractor.
|
||||||
|
|
||||||
|
This feature extractor inherits from [`FeatureExtractionMixin`] which contains most of the main methods. Users
|
||||||
|
should refer to this superclass for more information regarding those methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
do_resize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to resize the input to a certain `size`.
|
||||||
|
size (`int`, *optional*, defaults to 224):
|
||||||
|
Resize the input to the given size. Only has an effect if `do_resize` is set to `True`.
|
||||||
|
resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
|
||||||
|
An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
|
||||||
|
`PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
|
||||||
|
do_center_crop (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to crop the input at the center. If the input size is smaller than `crop_size` along any edge, the
|
||||||
|
image is padded with 0's and then center cropped.
|
||||||
|
crop_size (`int`, *optional*, defaults to 224):
|
||||||
|
Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to `True`.
|
||||||
|
do_normalize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to normalize the input with `image_mean` and `image_std`.
|
||||||
|
image_mean (`Tuple[float, float, float]`, *optional*, defaults to `[0.485, 0.456, 0.406]`):
|
||||||
|
The sequence of means for each channel, to be used when normalizing images.
|
||||||
|
image_std (`Tuple[float, float, float]`, *optional*, defaults to `[0.229, 0.224, 0.225]`):
|
||||||
|
The sequence of standard deviations for each channel, to be used when normalizing images.
|
||||||
|
input_size_patches (`int`, *optional*, defaults to 14):
|
||||||
|
Number of patches in the image in height and width direction. 14x14 = 196 total patches.
|
||||||
|
total_mask_patches (`int`, *optional*, defaults to 75):
|
||||||
|
Total number of patches that should be masked.
|
||||||
|
mask_group_min_patches (`int`, *optional*, defaults to 16):
|
||||||
|
Minimum number of patches that should be masked.
|
||||||
|
mask_group_max_patches (`int`, *optional*, defaults to None):
|
||||||
|
Maximum number of patches that should be masked.
|
||||||
|
mask_group_min_aspect_ratio (`float`, *optional*, defaults to 0.3):
|
||||||
|
Minimum aspect ratio of the mask window.
|
||||||
|
mask_group_max_aspect_ratio (`float`, *optional*, defaults to None):
|
||||||
|
Maximum aspect ratio of the mask window
|
||||||
|
codebook_do_resize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to resize the input for codebook to a certain `codebook_size`.
|
||||||
|
codebook_size (`int`, *optional*, defaults to 224):
|
||||||
|
Resize the input for codebook to the given size. Only has an effect if `codebook_do_resize` is set to
|
||||||
|
`True`.
|
||||||
|
codebook_resample (`int`, *optional*, defaults to `PIL.Image.BICUBIC`):
|
||||||
|
An optional resampling filter. This can be one of `PIL.Image.NEAREST`, `PIL.Image.BOX`,
|
||||||
|
`PIL.Image.BILINEAR`, `PIL.Image.HAMMING`, `PIL.Image.BICUBIC` or `PIL.Image.LANCZOS`. Only has an effect
|
||||||
|
codebook_do_center_crop (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to crop the input for codebook at the center. If the input size is smaller than
|
||||||
|
`codebook_crop_size` along any edge, the image is padded with 0's and then center cropped.
|
||||||
|
codebook_crop_size (`int`, *optional*, defaults to 224):
|
||||||
|
Desired output size for codebook input when applying center-cropping. Only has an effect if
|
||||||
|
`codebook_do_center_crop` is set to `True`.
|
||||||
|
codebook_do_normalize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not to normalize the input for codebook with `codebook_image_mean` and `codebook_image_std`.
|
||||||
|
codebook_image_mean (`Tuple[float, float, float]`, *optional*, defaults to `[0, 0, 0]`):
|
||||||
|
The sequence of means for each channel, to be used when normalizing images for codebook.
|
||||||
|
codebook_image_std (`Tuple[float, float, float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
|
||||||
|
The sequence of standard deviations for each channel, to be used when normalizing images for codebook.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
model_input_names = ["pixel_values"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
do_resize: bool = True,
|
||||||
|
size: Union[int, Tuple[int, int]] = 224,
|
||||||
|
resample: int = Image.BICUBIC,
|
||||||
|
do_center_crop: bool = True,
|
||||||
|
crop_size: Union[int, Tuple[int, int]] = 224,
|
||||||
|
do_normalize: bool = True,
|
||||||
|
image_mean: Tuple[float, float, float] = FLAVA_IMAGE_MEAN,
|
||||||
|
image_std: Tuple[float, float, float] = FLAVA_IMAGE_STD,
|
||||||
|
# Mask related params
|
||||||
|
input_size_patches: int = 14,
|
||||||
|
total_mask_patches: int = 75,
|
||||||
|
mask_group_min_patches: int = 16,
|
||||||
|
mask_group_max_patches: Optional[int] = None,
|
||||||
|
mask_group_min_aspect_ratio: float = 0.3,
|
||||||
|
mask_group_max_aspect_ratio: Optional[float] = None,
|
||||||
|
# Codebook related params
|
||||||
|
codebook_do_resize: bool = True,
|
||||||
|
codebook_size: bool = 112,
|
||||||
|
codebook_resample: int = Image.LANCZOS,
|
||||||
|
codebook_do_center_crop: bool = True,
|
||||||
|
codebook_crop_size: int = 112,
|
||||||
|
codebook_do_map_pixels: bool = True,
|
||||||
|
codebook_do_normalize: bool = True,
|
||||||
|
codebook_image_mean: Tuple[float, float, float] = FLAVA_CODEBOOK_MEAN,
|
||||||
|
codebook_image_std: Tuple[float, float, float] = FLAVA_CODEBOOK_STD,
|
||||||
|
**kwargs: Any,
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.do_resize = do_resize
|
||||||
|
self.size = size
|
||||||
|
self.resample = resample
|
||||||
|
self.do_center_crop = do_center_crop
|
||||||
|
self.crop_size = crop_size
|
||||||
|
self.do_normalize = do_normalize
|
||||||
|
self.image_mean = image_mean
|
||||||
|
self.image_std = image_std
|
||||||
|
|
||||||
|
self.input_size_patches = input_size_patches
|
||||||
|
self.total_mask_patches = total_mask_patches
|
||||||
|
self.mask_group_min_patches = mask_group_min_patches
|
||||||
|
self.mask_group_max_patches = mask_group_max_patches
|
||||||
|
self.mask_group_min_aspect_ratio = mask_group_min_aspect_ratio
|
||||||
|
self.mask_group_max_aspect_ratio = mask_group_max_aspect_ratio
|
||||||
|
|
||||||
|
self.codebook_do_resize = codebook_do_resize
|
||||||
|
self.codebook_size = codebook_size
|
||||||
|
self.codebook_resample = codebook_resample
|
||||||
|
self.codebook_do_center_crop = codebook_do_center_crop
|
||||||
|
self.codebook_crop_size = codebook_crop_size
|
||||||
|
self.codebook_do_map_pixels = codebook_do_map_pixels
|
||||||
|
self.codebook_do_normalize = codebook_do_normalize
|
||||||
|
self.codebook_image_mean = codebook_image_mean
|
||||||
|
self.codebook_image_std = codebook_image_std
|
||||||
|
|
||||||
|
@property
|
||||||
|
@lru_cache()
|
||||||
|
def masking_generator(self):
|
||||||
|
return FlavaMaskingGenerator(
|
||||||
|
input_size=self.input_size_patches,
|
||||||
|
total_mask_patches=self.total_mask_patches,
|
||||||
|
mask_group_min_patches=self.mask_group_min_patches,
|
||||||
|
mask_group_max_patches=self.mask_group_max_patches,
|
||||||
|
mask_group_min_aspect_ratio=self.mask_group_min_aspect_ratio,
|
||||||
|
mask_group_max_aspect_ratio=self.mask_group_max_aspect_ratio,
|
||||||
|
)
|
||||||
|
|
||||||
|
def map_pixels(self, x):
|
||||||
|
return (1 - 2 * LOGIT_LAPLACE_EPS) * x + LOGIT_LAPLACE_EPS
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
images: Union[
|
||||||
|
Image.Image, np.ndarray, "torch.Tensor", List[Image.Image], List[np.ndarray], List["torch.Tensor"] # noqa
|
||||||
|
],
|
||||||
|
return_image_mask: Optional[bool] = None,
|
||||||
|
return_codebook_pixels: Optional[bool] = None,
|
||||||
|
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||||
|
**kwargs: Any
|
||||||
|
) -> BatchFeature:
|
||||||
|
"""
|
||||||
|
Main method to prepare for the model one or several image(s).
|
||||||
|
|
||||||
|
<Tip warning={true}>
|
||||||
|
|
||||||
|
NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so the most efficient is to pass
|
||||||
|
PIL images.
|
||||||
|
|
||||||
|
</Tip>
|
||||||
|
|
||||||
|
Args:
|
||||||
|
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||||
|
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||||
|
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||||
|
number of channels, H and W are image height and width.
|
||||||
|
|
||||||
|
return_image_mask (`bool`, *optional*, defaults to None):
|
||||||
|
If True, the processor will return `bool_masked_pos` suggesting masks for image's patch version.
|
||||||
|
|
||||||
|
return_codebook_pixels (`bool`, *optional*, defaults to None):
|
||||||
|
If True, the processor will return `codebook_pixel_values` providing image pixels to be used with the
|
||||||
|
default FLAVA codebook. Used in pretraining by Masked Image Modeling (MIM) loss.
|
||||||
|
|
||||||
|
return_tensors (`str` or [`~utils.TensorType`], *optional*, defaults to `'np'`):
|
||||||
|
If set, will return tensors of a particular framework. Acceptable values are:
|
||||||
|
|
||||||
|
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||||
|
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||||
|
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||||
|
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
|
||||||
|
|
||||||
|
- **pixel_values** -- Pixel values to be fed to a model.
|
||||||
|
"""
|
||||||
|
# Input type checking for clearer error
|
||||||
|
if isinstance(images, (list, tuple)) and len(images) != 0:
|
||||||
|
self._ensure_format_supported(images[0])
|
||||||
|
else:
|
||||||
|
self._ensure_format_supported(images)
|
||||||
|
|
||||||
|
is_batched = bool(
|
||||||
|
isinstance(images, (list, tuple))
|
||||||
|
and (isinstance(images[0], (Image.Image, np.ndarray)) or is_torch_tensor(images[0]))
|
||||||
|
)
|
||||||
|
|
||||||
|
if not is_batched:
|
||||||
|
images = [images]
|
||||||
|
|
||||||
|
images_for_codebook = images
|
||||||
|
|
||||||
|
# transformations (resizing + center cropping + normalization)
|
||||||
|
if self.do_resize and self.size is not None and self.resample is not None:
|
||||||
|
images = [self.resize(image=image, size=self.size, resample=self.resample) for image in images]
|
||||||
|
if self.do_center_crop and self.crop_size is not None:
|
||||||
|
images = [self.center_crop(image, self.crop_size) for image in images]
|
||||||
|
if self.do_normalize:
|
||||||
|
images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
|
||||||
|
# return as BatchFeature
|
||||||
|
data = {"pixel_values": images}
|
||||||
|
|
||||||
|
if return_codebook_pixels:
|
||||||
|
images = images_for_codebook
|
||||||
|
if self.codebook_do_resize and self.codebook_size is not None and self.codebook_resample is not None:
|
||||||
|
images = [
|
||||||
|
self.resize(image=image, size=self.codebook_size, resample=self.codebook_resample)
|
||||||
|
for image in images
|
||||||
|
]
|
||||||
|
if self.codebook_do_center_crop and self.codebook_crop_size is not None:
|
||||||
|
images = [self.center_crop(image, self.codebook_crop_size) for image in images]
|
||||||
|
if self.codebook_do_normalize:
|
||||||
|
images = [
|
||||||
|
self.normalize(image=image, mean=self.codebook_image_mean, std=self.codebook_image_std)
|
||||||
|
for image in images
|
||||||
|
]
|
||||||
|
if self.codebook_do_map_pixels:
|
||||||
|
images = [self.map_pixels(image) for image in images]
|
||||||
|
|
||||||
|
data["codebook_pixel_values"] = images
|
||||||
|
|
||||||
|
if return_image_mask:
|
||||||
|
masks = [self.masking_generator() for _ in images]
|
||||||
|
data["bool_masked_pos"] = masks
|
||||||
|
|
||||||
|
encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
|
||||||
|
|
||||||
|
return encoded_inputs
|
||||||
2095
src/transformers/models/flava/modeling_flava.py
Normal file
2095
src/transformers/models/flava/modeling_flava.py
Normal file
File diff suppressed because it is too large
Load Diff
124
src/transformers/models/flava/processing_flava.py
Normal file
124
src/transformers/models/flava/processing_flava.py
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""
|
||||||
|
Image/Text processor class for FLAVA
|
||||||
|
"""
|
||||||
|
from typing import List, Optional, Union
|
||||||
|
|
||||||
|
from ...image_utils import ImageInput
|
||||||
|
from ...processing_utils import ProcessorMixin
|
||||||
|
from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
|
||||||
|
from ...utils import TensorType
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaProcessor(ProcessorMixin):
|
||||||
|
r"""
|
||||||
|
Constructs a FLAVA processor which wraps a FLAVA feature extractor and a FLAVA tokenizer into a single processor.
|
||||||
|
|
||||||
|
[`FlavaProcessor`] offers all the functionalities of [`FlavaFeatureExtractor`] and [`BertTokenizerFast`]. See the
|
||||||
|
[`~FlavaProcessor.__call__`] and [`~FlavaProcessor.decode`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
feature_extractor ([`FlavaFeatureExtractor`]): The feature extractor is a required input.
|
||||||
|
tokenizer ([`BertTokenizerFast`]): The tokenizer is a required input.
|
||||||
|
"""
|
||||||
|
feature_extractor_class = "FlavaFeatureExtractor"
|
||||||
|
tokenizer_class = ("BertTokenizer", "BertTokenizerFast")
|
||||||
|
|
||||||
|
def __init__(self, feature_extractor, tokenizer):
|
||||||
|
super().__init__(feature_extractor, tokenizer)
|
||||||
|
self.current_processor = self.feature_extractor
|
||||||
|
|
||||||
|
def __call__(
|
||||||
|
self,
|
||||||
|
images: Optional[ImageInput] = None,
|
||||||
|
text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
|
||||||
|
add_special_tokens: bool = True,
|
||||||
|
padding: Union[bool, str, PaddingStrategy] = False,
|
||||||
|
truncation: Union[bool, str, TruncationStrategy] = False,
|
||||||
|
max_length: Optional[int] = None,
|
||||||
|
stride: int = 0,
|
||||||
|
pad_to_multiple_of: Optional[int] = None,
|
||||||
|
return_image_mask: Optional[bool] = None,
|
||||||
|
return_codebook_pixels: Optional[bool] = None,
|
||||||
|
return_token_type_ids: Optional[bool] = None,
|
||||||
|
return_attention_mask: Optional[bool] = None,
|
||||||
|
return_overflowing_tokens: bool = False,
|
||||||
|
return_special_tokens_mask: bool = False,
|
||||||
|
return_offsets_mapping: bool = False,
|
||||||
|
return_length: bool = False,
|
||||||
|
verbose: bool = True,
|
||||||
|
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
This method uses [`FLAVAFeatureExtractor.__call__`] method to prepare image(s) for the model, and
|
||||||
|
[`BertTokenizerFast.__call__`] to prepare text for the model.
|
||||||
|
|
||||||
|
Please refer to the docstring of the above two methods for more information.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if text is None and images is None:
|
||||||
|
raise ValueError("You have to specify either text or images. Both cannot be none.")
|
||||||
|
|
||||||
|
if text is not None:
|
||||||
|
encoding = self.tokenizer(
|
||||||
|
text=text,
|
||||||
|
add_special_tokens=add_special_tokens,
|
||||||
|
padding=padding,
|
||||||
|
truncation=truncation,
|
||||||
|
max_length=max_length,
|
||||||
|
stride=stride,
|
||||||
|
pad_to_multiple_of=pad_to_multiple_of,
|
||||||
|
return_token_type_ids=return_token_type_ids,
|
||||||
|
return_attention_mask=return_attention_mask,
|
||||||
|
return_overflowing_tokens=return_overflowing_tokens,
|
||||||
|
return_special_tokens_mask=return_special_tokens_mask,
|
||||||
|
return_offsets_mapping=return_offsets_mapping,
|
||||||
|
return_length=return_length,
|
||||||
|
verbose=verbose,
|
||||||
|
return_tensors=return_tensors,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
if images is not None:
|
||||||
|
image_features = self.feature_extractor(
|
||||||
|
images,
|
||||||
|
return_image_mask=return_image_mask,
|
||||||
|
return_codebook_pixels=return_codebook_pixels,
|
||||||
|
return_tensors=return_tensors,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
if text is not None and images is not None:
|
||||||
|
encoding.update(image_features)
|
||||||
|
return encoding
|
||||||
|
elif text is not None:
|
||||||
|
return encoding
|
||||||
|
else:
|
||||||
|
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
|
||||||
|
|
||||||
|
def batch_decode(self, *args, **kwargs):
|
||||||
|
"""
|
||||||
|
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
|
||||||
|
refer to the docstring of this method for more information.
|
||||||
|
"""
|
||||||
|
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||||
|
|
||||||
|
def decode(self, *args, **kwargs):
|
||||||
|
"""
|
||||||
|
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
|
||||||
|
the docstring of this method for more information.
|
||||||
|
"""
|
||||||
|
return self.tokenizer.decode(*args, **kwargs)
|
||||||
@@ -1787,6 +1787,58 @@ class FlaubertWithLMHeadModel(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
FLAVA_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaForPreTraining(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaImageCodebook(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaImageModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaMultimodalModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaTextModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
FNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
FNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -59,6 +59,20 @@ class DPTFeatureExtractor(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["vision"])
|
requires_backends(self, ["vision"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaFeatureExtractor(metaclass=DummyObject):
|
||||||
|
_backends = ["vision"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["vision"])
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaProcessor(metaclass=DummyObject):
|
||||||
|
_backends = ["vision"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["vision"])
|
||||||
|
|
||||||
|
|
||||||
class GLPNFeatureExtractor(metaclass=DummyObject):
|
class GLPNFeatureExtractor(metaclass=DummyObject):
|
||||||
_backends = ["vision"]
|
_backends = ["vision"]
|
||||||
|
|
||||||
|
|||||||
0
tests/models/flava/__init__.py
Normal file
0
tests/models/flava/__init__.py
Normal file
347
tests/models/flava/test_feature_extraction_flava.py
Normal file
347
tests/models/flava/test_feature_extraction_flava.py
Normal file
@@ -0,0 +1,347 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 Meta Platforms authors and HuggingFace Inc.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import random
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from transformers.testing_utils import require_torch, require_vision
|
||||||
|
from transformers.utils import is_torch_available, is_vision_available
|
||||||
|
|
||||||
|
from ...test_feature_extraction_common import FeatureExtractionSavingTestMixin, prepare_image_inputs
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from transformers import FlavaFeatureExtractor
|
||||||
|
from transformers.models.flava.feature_extraction_flava import (
|
||||||
|
FLAVA_CODEBOOK_MEAN,
|
||||||
|
FLAVA_CODEBOOK_STD,
|
||||||
|
FLAVA_IMAGE_MEAN,
|
||||||
|
FLAVA_IMAGE_STD,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
FLAVA_IMAGE_MEAN = FLAVA_IMAGE_STD = FLAVA_CODEBOOK_MEAN = FLAVA_CODEBOOK_STD = None
|
||||||
|
|
||||||
|
|
||||||
|
class FlavaFeatureExtractionTester(unittest.TestCase):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=7,
|
||||||
|
num_channels=3,
|
||||||
|
min_resolution=30,
|
||||||
|
max_resolution=400,
|
||||||
|
do_resize=True,
|
||||||
|
size=224,
|
||||||
|
do_center_crop=True,
|
||||||
|
crop_size=224,
|
||||||
|
resample=None,
|
||||||
|
do_normalize=True,
|
||||||
|
image_mean=FLAVA_IMAGE_MEAN,
|
||||||
|
image_std=FLAVA_IMAGE_STD,
|
||||||
|
input_size_patches=14,
|
||||||
|
total_mask_patches=75,
|
||||||
|
mask_group_max_patches=None,
|
||||||
|
mask_group_min_patches=16,
|
||||||
|
mask_group_min_aspect_ratio=0.3,
|
||||||
|
mask_group_max_aspect_ratio=None,
|
||||||
|
codebook_do_resize=True,
|
||||||
|
codebook_size=112,
|
||||||
|
codebook_resample=None,
|
||||||
|
codebook_do_center_crop=True,
|
||||||
|
codebook_crop_size=112,
|
||||||
|
codebook_do_map_pixels=True,
|
||||||
|
codebook_do_normalize=True,
|
||||||
|
codebook_image_mean=FLAVA_CODEBOOK_MEAN,
|
||||||
|
codebook_image_std=FLAVA_CODEBOOK_STD,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.do_resize = do_resize
|
||||||
|
self.min_resolution = min_resolution
|
||||||
|
self.max_resolution = max_resolution
|
||||||
|
self.size = size
|
||||||
|
self.resample = resample if resample is not None else Image.BICUBIC
|
||||||
|
self.do_normalize = do_normalize
|
||||||
|
self.image_mean = image_mean
|
||||||
|
self.image_std = image_std
|
||||||
|
self.do_center_crop = do_center_crop
|
||||||
|
self.crop_size = crop_size
|
||||||
|
|
||||||
|
self.input_size_patches = input_size_patches
|
||||||
|
self.total_mask_patches = total_mask_patches
|
||||||
|
self.mask_group_max_patches = mask_group_max_patches
|
||||||
|
self.mask_group_min_patches = mask_group_min_patches
|
||||||
|
self.mask_group_min_aspect_ratio = mask_group_min_aspect_ratio
|
||||||
|
self.mask_group_max_aspect_ratio = mask_group_max_aspect_ratio
|
||||||
|
|
||||||
|
self.codebook_do_resize = codebook_do_resize
|
||||||
|
self.codebook_size = codebook_size
|
||||||
|
self.codebook_resample = codebook_resample if codebook_resample is not None else Image.LANCZOS
|
||||||
|
self.codebook_do_center_crop = codebook_do_center_crop
|
||||||
|
self.codebook_crop_size = codebook_crop_size
|
||||||
|
self.codebook_do_map_pixels = codebook_do_map_pixels
|
||||||
|
self.codebook_do_normalize = codebook_do_normalize
|
||||||
|
self.codebook_image_mean = codebook_image_mean
|
||||||
|
self.codebook_image_std = codebook_image_std
|
||||||
|
|
||||||
|
def prepare_feat_extract_dict(self):
|
||||||
|
return {
|
||||||
|
"image_mean": self.image_mean,
|
||||||
|
"image_std": self.image_std,
|
||||||
|
"do_normalize": self.do_normalize,
|
||||||
|
"do_resize": self.do_resize,
|
||||||
|
"size": self.size,
|
||||||
|
"resample": self.resample,
|
||||||
|
"do_center_crop": self.do_center_crop,
|
||||||
|
"crop_size": self.crop_size,
|
||||||
|
"input_size_patches": self.input_size_patches,
|
||||||
|
"total_mask_patches": self.total_mask_patches,
|
||||||
|
"mask_group_max_patches": self.mask_group_max_patches,
|
||||||
|
"mask_group_min_patches": self.mask_group_min_patches,
|
||||||
|
"mask_group_min_aspect_ratio": self.mask_group_min_aspect_ratio,
|
||||||
|
"mask_group_max_aspect_ratio": self.mask_group_min_aspect_ratio,
|
||||||
|
"codebook_do_resize": self.codebook_do_resize,
|
||||||
|
"codebook_size": self.codebook_size,
|
||||||
|
"codebook_resample": self.codebook_resample,
|
||||||
|
"codebook_do_center_crop": self.codebook_do_center_crop,
|
||||||
|
"codebook_crop_size": self.codebook_crop_size,
|
||||||
|
"codebook_do_map_pixels": self.codebook_do_map_pixels,
|
||||||
|
"codebook_do_normalize": self.codebook_do_normalize,
|
||||||
|
"codebook_image_mean": self.codebook_image_mean,
|
||||||
|
"codebook_image_std": self.codebook_image_std,
|
||||||
|
}
|
||||||
|
|
||||||
|
def get_expected_image_size(self):
|
||||||
|
return (self.size, self.size) if not isinstance(self.size, tuple) else self.size
|
||||||
|
|
||||||
|
def get_expected_mask_size(self):
|
||||||
|
return (
|
||||||
|
(self.input_size_patches, self.input_size_patches)
|
||||||
|
if not isinstance(self.input_size_patches, tuple)
|
||||||
|
else self.input_size_patches
|
||||||
|
)
|
||||||
|
|
||||||
|
def get_expected_codebook_image_size(self):
|
||||||
|
if not isinstance(self.codebook_size, tuple):
|
||||||
|
return (self.codebook_size, self.codebook_size)
|
||||||
|
else:
|
||||||
|
return self.codebook_size
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_vision
|
||||||
|
class FlavaFeatureExtractionTest(FeatureExtractionSavingTestMixin, unittest.TestCase):
|
||||||
|
|
||||||
|
feature_extraction_class = FlavaFeatureExtractor if is_vision_available() else None
|
||||||
|
maxDiff = None
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.feature_extract_tester = FlavaFeatureExtractionTester(self)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def feat_extract_dict(self):
|
||||||
|
return self.feature_extract_tester.prepare_feat_extract_dict()
|
||||||
|
|
||||||
|
def test_feat_extract_properties(self):
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "image_mean"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "image_std"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "do_normalize"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "do_resize"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "resample"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "crop_size"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "do_center_crop"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "masking_generator"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_do_resize"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_size"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_resample"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_do_center_crop"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_crop_size"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_do_map_pixels"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_do_normalize"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_image_mean"))
|
||||||
|
self.assertTrue(hasattr(feature_extractor, "codebook_image_std"))
|
||||||
|
|
||||||
|
def test_batch_feature(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_call_pil(self):
|
||||||
|
# Initialize feature_extractor
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
# create random PIL images
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||||
|
for image in image_inputs:
|
||||||
|
self.assertIsInstance(image, Image.Image)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt")
|
||||||
|
|
||||||
|
# Test no bool masked pos
|
||||||
|
self.assertFalse("bool_masked_pos" in encoded_images)
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_image_size()
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.pixel_values.shape,
|
||||||
|
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test batched
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_tensors="pt")
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_image_size()
|
||||||
|
|
||||||
|
# Test no bool masked pos
|
||||||
|
self.assertFalse("bool_masked_pos" in encoded_images)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.pixel_values.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
def _test_call_framework(self, instance_class, prepare_kwargs):
|
||||||
|
# Initialize feature_extractor
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
# create random tensors
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, **prepare_kwargs)
|
||||||
|
for image in image_inputs:
|
||||||
|
self.assertIsInstance(image, instance_class)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_images = feature_extractor(image_inputs[0], return_tensors="pt")
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_image_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.pixel_values.shape,
|
||||||
|
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||||
|
)
|
||||||
|
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_image_mask=True, return_tensors="pt")
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_image_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.pixel_values.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_mask_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.bool_masked_pos.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test batched
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_image_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test masking
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_image_mask=True, return_tensors="pt")
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_image_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.pixel_values.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_mask_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.bool_masked_pos.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_call_numpy(self):
|
||||||
|
self._test_call_framework(np.ndarray, prepare_kwargs={"numpify": True})
|
||||||
|
|
||||||
|
def test_call_pytorch(self):
|
||||||
|
self._test_call_framework(torch.Tensor, prepare_kwargs={"torchify": True})
|
||||||
|
|
||||||
|
def test_masking(self):
|
||||||
|
# Initialize feature_extractor
|
||||||
|
random.seed(1234)
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False, torchify=True)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_images = feature_extractor(image_inputs[0], return_image_mask=True, return_tensors="pt")
|
||||||
|
self.assertEqual(encoded_images.bool_masked_pos.sum().item(), 75)
|
||||||
|
|
||||||
|
def test_codebook_pixels(self):
|
||||||
|
# Initialize feature_extractor
|
||||||
|
feature_extractor = self.feature_extraction_class(**self.feat_extract_dict)
|
||||||
|
# create random PIL images
|
||||||
|
image_inputs = prepare_image_inputs(self.feature_extract_tester, equal_resolution=False)
|
||||||
|
for image in image_inputs:
|
||||||
|
self.assertIsInstance(image, Image.Image)
|
||||||
|
|
||||||
|
# Test not batched input
|
||||||
|
encoded_images = feature_extractor(image_inputs[0], return_codebook_pixels=True, return_tensors="pt")
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_codebook_image_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.codebook_pixel_values.shape,
|
||||||
|
(1, self.feature_extract_tester.num_channels, expected_height, expected_width),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test batched
|
||||||
|
encoded_images = feature_extractor(image_inputs, return_codebook_pixels=True, return_tensors="pt")
|
||||||
|
expected_height, expected_width = self.feature_extract_tester.get_expected_codebook_image_size()
|
||||||
|
self.assertEqual(
|
||||||
|
encoded_images.codebook_pixel_values.shape,
|
||||||
|
(
|
||||||
|
self.feature_extract_tester.batch_size,
|
||||||
|
self.feature_extract_tester.num_channels,
|
||||||
|
expected_height,
|
||||||
|
expected_width,
|
||||||
|
),
|
||||||
|
)
|
||||||
1224
tests/models/flava/test_modeling_flava.py
Normal file
1224
tests/models/flava/test_modeling_flava.py
Normal file
File diff suppressed because it is too large
Load Diff
234
tests/models/flava/test_processor_flava.py
Normal file
234
tests/models/flava/test_processor_flava.py
Normal file
@@ -0,0 +1,234 @@
|
|||||||
|
# Copyright 2022 Meta Platforms authors and The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import shutil
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from transformers import BertTokenizer, BertTokenizerFast
|
||||||
|
from transformers.models.bert.tokenization_bert import VOCAB_FILES_NAMES
|
||||||
|
from transformers.testing_utils import require_vision
|
||||||
|
from transformers.utils import FEATURE_EXTRACTOR_NAME, is_vision_available
|
||||||
|
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from transformers import FlavaFeatureExtractor, FlavaProcessor
|
||||||
|
from transformers.models.flava.feature_extraction_flava import (
|
||||||
|
FLAVA_CODEBOOK_MEAN,
|
||||||
|
FLAVA_CODEBOOK_STD,
|
||||||
|
FLAVA_IMAGE_MEAN,
|
||||||
|
FLAVA_IMAGE_STD,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@require_vision
|
||||||
|
class FlavaProcessorTest(unittest.TestCase):
|
||||||
|
def setUp(self):
|
||||||
|
self.tmpdirname = tempfile.mkdtemp()
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]", "want", "##want", "##ed", "wa", "un", "runn", "##ing", ",", "low", "lowest"]
|
||||||
|
# fmt: on
|
||||||
|
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
|
||||||
|
|
||||||
|
with open(self.vocab_file, "w", encoding="utf-8") as fp:
|
||||||
|
fp.write("".join([x + "\n" for x in vocab_tokens]))
|
||||||
|
|
||||||
|
feature_extractor_map = {
|
||||||
|
"image_mean": FLAVA_IMAGE_MEAN,
|
||||||
|
"image_std": FLAVA_IMAGE_STD,
|
||||||
|
"do_normalize": True,
|
||||||
|
"do_resize": True,
|
||||||
|
"size": 224,
|
||||||
|
"do_center_crop": True,
|
||||||
|
"crop_size": 224,
|
||||||
|
"input_size_patches": 14,
|
||||||
|
"total_mask_patches": 75,
|
||||||
|
"mask_group_max_patches": None,
|
||||||
|
"mask_group_min_patches": 16,
|
||||||
|
"mask_group_min_aspect_ratio": 0.3,
|
||||||
|
"mask_group_max_aspect_ratio": None,
|
||||||
|
"codebook_do_resize": True,
|
||||||
|
"codebook_size": 112,
|
||||||
|
"codebook_resample": None,
|
||||||
|
"codebook_do_center_crop": True,
|
||||||
|
"codebook_crop_size": 112,
|
||||||
|
"codebook_do_map_pixels": True,
|
||||||
|
"codebook_do_normalize": True,
|
||||||
|
"codebook_image_mean": FLAVA_CODEBOOK_MEAN,
|
||||||
|
"codebook_image_std": FLAVA_CODEBOOK_STD,
|
||||||
|
}
|
||||||
|
|
||||||
|
self.feature_extractor_file = os.path.join(self.tmpdirname, FEATURE_EXTRACTOR_NAME)
|
||||||
|
with open(self.feature_extractor_file, "w", encoding="utf-8") as fp:
|
||||||
|
json.dump(feature_extractor_map, fp)
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
return BertTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def get_rust_tokenizer(self, **kwargs):
|
||||||
|
return BertTokenizerFast.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def get_feature_extractor(self, **kwargs):
|
||||||
|
return FlavaFeatureExtractor.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def tearDown(self):
|
||||||
|
shutil.rmtree(self.tmpdirname)
|
||||||
|
|
||||||
|
def prepare_image_inputs(self):
|
||||||
|
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
|
||||||
|
or a list of PyTorch tensors if one specifies torchify=True.
|
||||||
|
"""
|
||||||
|
|
||||||
|
image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
|
||||||
|
|
||||||
|
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
|
||||||
|
|
||||||
|
return image_inputs
|
||||||
|
|
||||||
|
def test_save_load_pretrained_default(self):
|
||||||
|
tokenizer_slow = self.get_tokenizer()
|
||||||
|
tokenizer_fast = self.get_rust_tokenizer()
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
|
||||||
|
processor_slow = FlavaProcessor(tokenizer=tokenizer_slow, feature_extractor=feature_extractor)
|
||||||
|
processor_slow.save_pretrained(self.tmpdirname)
|
||||||
|
processor_slow = FlavaProcessor.from_pretrained(self.tmpdirname, use_fast=False)
|
||||||
|
|
||||||
|
processor_fast = FlavaProcessor(tokenizer=tokenizer_fast, feature_extractor=feature_extractor)
|
||||||
|
processor_fast.save_pretrained(self.tmpdirname)
|
||||||
|
processor_fast = FlavaProcessor.from_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
self.assertEqual(processor_slow.tokenizer.get_vocab(), tokenizer_slow.get_vocab())
|
||||||
|
self.assertEqual(processor_fast.tokenizer.get_vocab(), tokenizer_fast.get_vocab())
|
||||||
|
self.assertEqual(tokenizer_slow.get_vocab(), tokenizer_fast.get_vocab())
|
||||||
|
self.assertIsInstance(processor_slow.tokenizer, BertTokenizer)
|
||||||
|
self.assertIsInstance(processor_fast.tokenizer, BertTokenizerFast)
|
||||||
|
|
||||||
|
self.assertEqual(processor_slow.feature_extractor.to_json_string(), feature_extractor.to_json_string())
|
||||||
|
self.assertEqual(processor_fast.feature_extractor.to_json_string(), feature_extractor.to_json_string())
|
||||||
|
self.assertIsInstance(processor_slow.feature_extractor, FlavaFeatureExtractor)
|
||||||
|
self.assertIsInstance(processor_fast.feature_extractor, FlavaFeatureExtractor)
|
||||||
|
|
||||||
|
def test_save_load_pretrained_additional_features(self):
|
||||||
|
processor = FlavaProcessor(tokenizer=self.get_tokenizer(), feature_extractor=self.get_feature_extractor())
|
||||||
|
processor.save_pretrained(self.tmpdirname)
|
||||||
|
|
||||||
|
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||||
|
feature_extractor_add_kwargs = self.get_feature_extractor(do_normalize=False, padding_value=1.0)
|
||||||
|
|
||||||
|
processor = FlavaProcessor.from_pretrained(
|
||||||
|
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||||
|
self.assertIsInstance(processor.tokenizer, BertTokenizerFast)
|
||||||
|
|
||||||
|
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
|
||||||
|
self.assertIsInstance(processor.feature_extractor, FlavaFeatureExtractor)
|
||||||
|
|
||||||
|
def test_feature_extractor(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = FlavaProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
image_input = self.prepare_image_inputs()
|
||||||
|
|
||||||
|
input_feat_extract = feature_extractor(image_input, return_tensors="np")
|
||||||
|
input_processor = processor(images=image_input, return_tensors="np")
|
||||||
|
|
||||||
|
for key in input_feat_extract.keys():
|
||||||
|
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||||
|
|
||||||
|
# With rest of the args
|
||||||
|
random.seed(1234)
|
||||||
|
input_feat_extract = feature_extractor(
|
||||||
|
image_input, return_image_mask=True, return_codebook_pixels=True, return_tensors="np"
|
||||||
|
)
|
||||||
|
random.seed(1234)
|
||||||
|
input_processor = processor(
|
||||||
|
images=image_input, return_image_mask=True, return_codebook_pixels=True, return_tensors="np"
|
||||||
|
)
|
||||||
|
|
||||||
|
for key in input_feat_extract.keys():
|
||||||
|
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||||
|
|
||||||
|
def test_tokenizer(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = FlavaProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
input_str = "lower newer"
|
||||||
|
|
||||||
|
encoded_processor = processor(text=input_str)
|
||||||
|
|
||||||
|
encoded_tok = tokenizer(input_str)
|
||||||
|
|
||||||
|
for key in encoded_tok.keys():
|
||||||
|
self.assertListEqual(encoded_tok[key], encoded_processor[key])
|
||||||
|
|
||||||
|
def test_processor(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = FlavaProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
input_str = "lower newer"
|
||||||
|
image_input = self.prepare_image_inputs()
|
||||||
|
|
||||||
|
inputs = processor(text=input_str, images=image_input)
|
||||||
|
|
||||||
|
self.assertListEqual(list(inputs.keys()), ["input_ids", "token_type_ids", "attention_mask", "pixel_values"])
|
||||||
|
|
||||||
|
# add extra args
|
||||||
|
inputs = processor(text=input_str, images=image_input, return_codebook_pixels=True, return_image_mask=True)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
list(inputs.keys()),
|
||||||
|
[
|
||||||
|
"input_ids",
|
||||||
|
"token_type_ids",
|
||||||
|
"attention_mask",
|
||||||
|
"pixel_values",
|
||||||
|
"codebook_pixel_values",
|
||||||
|
"bool_masked_pos",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
# test if it raises when no input is passed
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
processor()
|
||||||
|
|
||||||
|
def test_tokenizer_decode(self):
|
||||||
|
feature_extractor = self.get_feature_extractor()
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
processor = FlavaProcessor(tokenizer=tokenizer, feature_extractor=feature_extractor)
|
||||||
|
|
||||||
|
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
|
||||||
|
|
||||||
|
decoded_processor = processor.batch_decode(predicted_ids)
|
||||||
|
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||||
|
|
||||||
|
self.assertListEqual(decoded_tok, decoded_processor)
|
||||||
@@ -146,6 +146,10 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
|||||||
"DetrForSegmentation",
|
"DetrForSegmentation",
|
||||||
"DPRReader",
|
"DPRReader",
|
||||||
"FlaubertForQuestionAnswering",
|
"FlaubertForQuestionAnswering",
|
||||||
|
"FlavaImageCodebook",
|
||||||
|
"FlavaTextModel",
|
||||||
|
"FlavaImageModel",
|
||||||
|
"FlavaMultimodalModel",
|
||||||
"GPT2DoubleHeadsModel",
|
"GPT2DoubleHeadsModel",
|
||||||
"LukeForMaskedLM",
|
"LukeForMaskedLM",
|
||||||
"LukeForEntityClassification",
|
"LukeForEntityClassification",
|
||||||
|
|||||||
Reference in New Issue
Block a user