Add GIT (GenerativeImage2Text) (#20295)
* First draft * Make model instantiation work * Fix copied from statement * More fixes * Add correct output head * Improve configuration * Add conversion script * Improve conversion script * Remove token_type_ids * Fix conversion of projection layers * Convert all weights * Use cats image * Make logits match * Generate caption on cats image * Add GITProcessor * Update conversion script * Add support for more checkpoints * Fix conversion script * Add initial tests * Remove cross-attention * More improvements * Remove is_decoder * Improve model tests * Improve tests * Improve model outputs * Fix model outputs equivalence * Fix more tests * Remove unused code * Use generate to generate text, no use of cache for now * Use generate more appropriately * Fix config tests * Fix style * Add support for use_cache Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> * Fix style * Fix GIT vision encoder * Update README * Fix integration test * Set bos and eos token ids * Improve docs * Improve code * Add support for provided attention_mask * Add copied from statement * Fix gradient checkpointing test * Set model_input_names * Investigate model_input_names * Remove script * Fix model inputs * Fix docstring * Rename GIT to Git * Support more models * Add support for textvqa model * Add video support * Extend conversion script for video * Add support for large variant * Add support for more models * Fix config archive map * Update integration test * Fix README * Fix CLIP mean and std * Update processor * Fix use_cache for video, thanks @gante * Remove print statements * Remove assertion * Add processor tests * Fix model_input_names * Use Auto API for processor * Fix processor tests * Fix integration test * Fix pipeline test * Make tests faster * Update conversion script * Update conversion script * Convert more checkpoints * Update conversion script * Fix typo * Update docstrings * Improve code snippets * Fix doc tests * Add more code examplesé * Fix doc tests * Add integration tests * Fix unused variable * revert * Add GIT to Japanese README Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
This commit is contained in:
@@ -316,6 +316,7 @@ Current number of checkpoints: ** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/main/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
|
||||
|
||||
@@ -316,6 +316,7 @@ Número actual de puntos de control: ** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/main/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
|
||||
|
||||
@@ -289,6 +289,7 @@ conda install -c huggingface transformers
|
||||
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (FLAVA: A फाउंडेशनल लैंग्वेज एंड विजन अलाइनमेंट मॉडल) (https://arxiv) साथ वाला पेपर .org/abs/2112.04482) अमनप्रीत सिंह, रोंगहांग हू, वेदानुज गोस्वामी, गुइल्यूम कुएरॉन, वोज्शिएक गालुबा, मार्कस रोहरबैक, और डौवे कीला द्वारा।
|
||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (गूगल रिसर्च से) साथ वाला पेपर [FNet: मिक्सिंग टोकन विद फूरियर ट्रांसफॉर्म्स](https://arxiv.org /abs/2105.03824) जेम्स ली-थॉर्प, जोशुआ आइंस्ली, इल्या एकस्टीन, सैंटियागो ओंटानन द्वारा।
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [फ़नल-ट्रांसफॉर्मर: कुशल भाषा प्रसंस्करण के लिए अनुक्रमिक अतिरेक को छानना](https://arxiv.org/abs/2006.03236) जिहांग दाई, गुओकुन लाई, यिमिंग यांग, क्वोक वी. ले द्वारा रिहाई।
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/main/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (KAIST से) साथ वाला पेपर [वर्टिकल कटडेप्थ के साथ मोनोकुलर डेप्थ एस्टीमेशन के लिए ग्लोबल-लोकल पाथ नेटवर्क्स](https:/ /arxiv.org/abs/2201.07436) डोयोन किम, वूंगह्युन गा, प्युंगवान आह, डोंगग्यू जू, सेहवान चुन, जुनमो किम द्वारा।
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI से) साथ में दिया गया पेपर [जेनरेटिव प्री-ट्रेनिंग द्वारा भाषा की समझ में सुधार](https://blog .openai.com/language-unsupervised/) एलेक रैडफोर्ड, कार्तिक नरसिम्हन, टिम सालिमन्स और इल्या सुत्स्केवर द्वारा।
|
||||
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (EleutherAI से) रिपॉजिटरी के साथ [EleutherAI/gpt-neo](https://github.com/ EleutherAI /gpt-neo) रिलीज। सिड ब्लैक, स्टेला बिडरमैन, लियो गाओ, फिल वांग और कॉनर लेही द्वारा पोस्ट किया गया।
|
||||
|
||||
@@ -351,6 +351,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
||||
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (Facebook AI から) Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela から公開された研究論文: [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482)
|
||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (Google Research から) James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon から公開された研究論文: [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824)
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (CMU/Google Brain から) Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le から公開された研究論文: [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236)
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/main/model_doc/git)** (Microsoft Research から) Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. から公開された研究論文 [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100)
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (KAIST から) Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim から公開された研究論文: [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436)
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI から) Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever から公開された研究論文: [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/)
|
||||
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (EleutherAI から) Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy から公開されたレポジトリー : [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo)
|
||||
|
||||
@@ -266,6 +266,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/main/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
|
||||
|
||||
@@ -290,6 +290,7 @@ conda install -c huggingface transformers
|
||||
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (来自 Facebook AI) 伴随论文 [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) 由 Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela 发布。
|
||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (来自 Google Research) 伴随论文 [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) 由 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon 发布。
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (来自 CMU/Google Brain) 伴随论文 [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) 由 Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le 发布。
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/main/model_doc/git)** (来自 Microsoft Research) 伴随论文 [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) 由 Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang 发布。
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (来自 KAIST) 伴随论文 [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) 由 Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim 发布。
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (来自 OpenAI) 伴随论文 [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) 由 Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever 发布。
|
||||
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (来自 EleutherAI) 随仓库 [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) 发布。作者为 Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy 发布。
|
||||
|
||||
@@ -302,6 +302,7 @@ conda install -c huggingface transformers
|
||||
1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||
1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[GIT](https://huggingface.co/docs/transformers/main/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
|
||||
|
||||
@@ -512,6 +512,8 @@
|
||||
title: Donut
|
||||
- local: model_doc/flava
|
||||
title: FLAVA
|
||||
- local: model_doc/git
|
||||
title: GIT
|
||||
- local: model_doc/groupvit
|
||||
title: GroupViT
|
||||
- local: model_doc/layoutlmv2
|
||||
|
||||
@@ -103,6 +103,7 @@ The documentation is organized into five sections:
|
||||
1. **[FLAVA](model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela.
|
||||
1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon.
|
||||
1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le.
|
||||
1. **[GIT](model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang.
|
||||
1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim.
|
||||
1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
|
||||
1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy.
|
||||
@@ -276,6 +277,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| FLAVA | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| FNet | ✅ | ✅ | ✅ | ❌ | ❌ |
|
||||
| Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
| GIT | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| GLPN | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| GPT Neo | ❌ | ❌ | ✅ | ❌ | ✅ |
|
||||
| GPT NeoX | ❌ | ✅ | ✅ | ❌ | ❌ |
|
||||
|
||||
66
docs/source/en/model_doc/git.mdx
Normal file
66
docs/source/en/model_doc/git.mdx
Normal file
@@ -0,0 +1,66 @@
|
||||
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# GIT
|
||||
|
||||
## Overview
|
||||
|
||||
The GIT model was proposed in [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by
|
||||
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. GIT is a decoder-only Transformer
|
||||
that leverages [CLIP](clip)'s vision encoder to condition the model on vision inputs besides text. The model obtains state-of-the-art results on
|
||||
image captioning and visual question answering benchmarks.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.*
|
||||
|
||||
Tips:
|
||||
|
||||
- GIT is implemented in a very similar way to GPT-2, the only difference being that the model is also conditioned on `pixel_values`.
|
||||
- One can use [`GitProcessor`] to prepare images for the model, and the `generate` method for autoregressive generation.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/git_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> GIT architecture. Taken from the <a href="https://arxiv.org/abs/2205.14100" target="_blank">original paper</a>. </small>
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr).
|
||||
The original code can be found [here](https://github.com/microsoft/GenerativeImage2Text).
|
||||
|
||||
## GitVisionConfig
|
||||
|
||||
[[autodoc]] GitVisionConfig
|
||||
|
||||
## GitVisionModel
|
||||
|
||||
[[autodoc]] GitVisionModel
|
||||
- forward
|
||||
|
||||
## GitConfig
|
||||
|
||||
[[autodoc]] GitConfig
|
||||
- all
|
||||
|
||||
## GitProcessor
|
||||
|
||||
[[autodoc]] GitProcessor
|
||||
- __call__
|
||||
|
||||
## GitModel
|
||||
|
||||
[[autodoc]] GitModel
|
||||
- forward
|
||||
|
||||
## GitForCausalLM
|
||||
|
||||
[[autodoc]] GitForCausalLM
|
||||
- forward
|
||||
@@ -256,6 +256,7 @@ _import_structure = {
|
||||
"models.fnet": ["FNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "FNetConfig"],
|
||||
"models.fsmt": ["FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FSMTConfig", "FSMTTokenizer"],
|
||||
"models.funnel": ["FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP", "FunnelConfig", "FunnelTokenizer"],
|
||||
"models.git": ["GIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "GitConfig", "GitProcessor", "GitVisionConfig"],
|
||||
"models.glpn": ["GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP", "GLPNConfig"],
|
||||
"models.gpt2": ["GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPT2Config", "GPT2Tokenizer"],
|
||||
"models.gpt_neo": ["GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoConfig"],
|
||||
@@ -1447,6 +1448,15 @@ else:
|
||||
"load_tf_weights_in_funnel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.git"].extend(
|
||||
[
|
||||
"GIT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"GitForCausalLM",
|
||||
"GitModel",
|
||||
"GitPreTrainedModel",
|
||||
"GitVisionModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.glpn"].extend(
|
||||
[
|
||||
"GLPN_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
@@ -3595,6 +3605,7 @@ if TYPE_CHECKING:
|
||||
from .models.fnet import FNET_PRETRAINED_CONFIG_ARCHIVE_MAP, FNetConfig
|
||||
from .models.fsmt import FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP, FSMTConfig, FSMTTokenizer
|
||||
from .models.funnel import FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP, FunnelConfig, FunnelTokenizer
|
||||
from .models.git import GIT_PRETRAINED_CONFIG_ARCHIVE_MAP, GitConfig, GitProcessor, GitVisionConfig
|
||||
from .models.glpn import GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP, GLPNConfig
|
||||
from .models.gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config, GPT2Tokenizer
|
||||
from .models.gpt_neo import GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoConfig
|
||||
@@ -4610,6 +4621,13 @@ if TYPE_CHECKING:
|
||||
FunnelPreTrainedModel,
|
||||
load_tf_weights_in_funnel,
|
||||
)
|
||||
from .models.git import (
|
||||
GIT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
GitForCausalLM,
|
||||
GitModel,
|
||||
GitPreTrainedModel,
|
||||
GitVisionModel,
|
||||
)
|
||||
from .models.glpn import (
|
||||
GLPN_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
GLPNForDepthEstimation,
|
||||
|
||||
@@ -73,6 +73,7 @@ from . import (
|
||||
fnet,
|
||||
fsmt,
|
||||
funnel,
|
||||
git,
|
||||
glpn,
|
||||
gpt2,
|
||||
gpt_neo,
|
||||
|
||||
@@ -77,6 +77,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
("fnet", "FNetConfig"),
|
||||
("fsmt", "FSMTConfig"),
|
||||
("funnel", "FunnelConfig"),
|
||||
("git", "GitConfig"),
|
||||
("glpn", "GLPNConfig"),
|
||||
("gpt-sw3", "GPT2Config"),
|
||||
("gpt2", "GPT2Config"),
|
||||
@@ -234,6 +235,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||
("fnet", "FNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("fsmt", "FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("funnel", "FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("git", "GIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("glpn", "GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("gpt2", "GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("gpt_neo", "GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
@@ -390,6 +392,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
("fnet", "FNet"),
|
||||
("fsmt", "FairSeq Machine-Translation"),
|
||||
("funnel", "Funnel Transformer"),
|
||||
("git", "GIT"),
|
||||
("glpn", "GLPN"),
|
||||
("gpt-sw3", "GPT-Sw3"),
|
||||
("gpt2", "OpenAI GPT-2"),
|
||||
|
||||
@@ -54,6 +54,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||
("donut-swin", "DonutImageProcessor"),
|
||||
("dpt", "DPTImageProcessor"),
|
||||
("flava", "FlavaImageProcessor"),
|
||||
("git", ("CLIPImageProcessor", "VideoMAEImageProcessor")),
|
||||
("glpn", "GLPNImageProcessor"),
|
||||
("groupvit", "CLIPImageProcessor"),
|
||||
("imagegpt", "ImageGPTImageProcessor"),
|
||||
|
||||
@@ -76,6 +76,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("fnet", "FNetModel"),
|
||||
("fsmt", "FSMTModel"),
|
||||
("funnel", ("FunnelModel", "FunnelBaseModel")),
|
||||
("git", "GitModel"),
|
||||
("glpn", "GLPNModel"),
|
||||
("gpt-sw3", "GPT2Model"),
|
||||
("gpt2", "GPT2Model"),
|
||||
@@ -264,6 +265,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
|
||||
("fnet", "FNetForMaskedLM"),
|
||||
("fsmt", "FSMTForConditionalGeneration"),
|
||||
("funnel", "FunnelForMaskedLM"),
|
||||
("git", "GitForCausalLM"),
|
||||
("gpt-sw3", "GPT2LMHeadModel"),
|
||||
("gpt2", "GPT2LMHeadModel"),
|
||||
("gpt_neo", "GPTNeoForCausalLM"),
|
||||
@@ -329,6 +331,7 @@ MODEL_FOR_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
||||
("data2vec-text", "Data2VecTextForCausalLM"),
|
||||
("electra", "ElectraForCausalLM"),
|
||||
("ernie", "ErnieForCausalLM"),
|
||||
("git", "GitForCausalLM"),
|
||||
("gpt-sw3", "GPT2LMHeadModel"),
|
||||
("gpt2", "GPT2LMHeadModel"),
|
||||
("gpt_neo", "GPTNeoForCausalLM"),
|
||||
|
||||
@@ -46,6 +46,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||
("clip", "CLIPProcessor"),
|
||||
("clipseg", "CLIPSegProcessor"),
|
||||
("flava", "FlavaProcessor"),
|
||||
("git", "GITProcessor"),
|
||||
("groupvit", "CLIPProcessor"),
|
||||
("layoutlmv2", "LayoutLMv2Processor"),
|
||||
("layoutlmv3", "LayoutLMv3Processor"),
|
||||
|
||||
@@ -137,6 +137,7 @@ else:
|
||||
("fnet", ("FNetTokenizer", "FNetTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("fsmt", ("FSMTTokenizer", None)),
|
||||
("funnel", ("FunnelTokenizer", "FunnelTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("git", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
||||
("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)),
|
||||
("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
||||
("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)),
|
||||
|
||||
@@ -39,15 +39,14 @@ CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
|
||||
class CLIPTextConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate an CLIP
|
||||
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||
defaults will yield a similar configuration to that of the CLIP
|
||||
This is the configuration class to store the configuration of a [`CLIPTextModel`]. It is used to instantiate a CLIP
|
||||
text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the text encoder of the CLIP
|
||||
[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (`int`, *optional*, defaults to 49408):
|
||||
Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by
|
||||
@@ -65,7 +64,7 @@ class CLIPTextConfig(PretrainedConfig):
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
|
||||
defaults to 1e-5): The epsilon used by the layer normalization layers.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
@@ -73,7 +72,7 @@ class CLIPTextConfig(PretrainedConfig):
|
||||
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
initializer_factor (`float``, *optional*, defaults to 1):
|
||||
initializer_factor (`float`, *optional*, defaults to 1):
|
||||
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
|
||||
testing).
|
||||
|
||||
@@ -149,15 +148,14 @@ class CLIPTextConfig(PretrainedConfig):
|
||||
|
||||
class CLIPVisionConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate an CLIP
|
||||
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||
defaults will yield a similar configuration to that of the CLIP
|
||||
This is the configuration class to store the configuration of a [`CLIPVisionModel`]. It is used to instantiate a
|
||||
CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP
|
||||
[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
|
||||
Args:
|
||||
hidden_size (`int`, *optional*, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
@@ -181,7 +179,7 @@ class CLIPVisionConfig(PretrainedConfig):
|
||||
The dropout ratio for the attention probabilities.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
initializer_factor (`float``, *optional*, defaults to 1):
|
||||
initializer_factor (`float`, *optional*, defaults to 1):
|
||||
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
|
||||
testing).
|
||||
|
||||
@@ -258,8 +256,8 @@ class CLIPVisionConfig(PretrainedConfig):
|
||||
class CLIPConfig(PretrainedConfig):
|
||||
r"""
|
||||
[`CLIPConfig`] is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate
|
||||
CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the CLIP
|
||||
a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating
|
||||
a configuration with the defaults will yield a similar configuration to that of the CLIP
|
||||
[openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
|
||||
@@ -70,10 +70,10 @@ class CLIPImageProcessor(BaseImageProcessor):
|
||||
method.
|
||||
do_normalize:
|
||||
Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
|
||||
image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
|
||||
image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
|
||||
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
|
||||
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
|
||||
image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
|
||||
image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
|
||||
Image standard deviation.
|
||||
do_convert_rgb (`bool`, *optional*, defaults to `True`):
|
||||
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
|
||||
|
||||
64
src/transformers/models/git/__init__.py
Normal file
64
src/transformers/models/git/__init__.py
Normal file
@@ -0,0 +1,64 @@
|
||||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_git": ["GIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "GitConfig", "GitVisionConfig"],
|
||||
"processing_git": ["GitProcessor"],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_git"] = [
|
||||
"GIT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"GitForCausalLM",
|
||||
"GitModel",
|
||||
"GitPreTrainedModel",
|
||||
"GitVisionModel",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_git import GIT_PRETRAINED_CONFIG_ARCHIVE_MAP, GitConfig, GitVisionConfig
|
||||
from .processing_git import GitProcessor
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_git import (
|
||||
GIT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
GitForCausalLM,
|
||||
GitModel,
|
||||
GitPreTrainedModel,
|
||||
GitVisionModel,
|
||||
)
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||
265
src/transformers/models/git/configuration_git.py
Normal file
265
src/transformers/models/git/configuration_git.py
Normal file
@@ -0,0 +1,265 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import copy
|
||||
import os
|
||||
from typing import Union
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
GIT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"microsoft/git-base": "https://huggingface.co/microsoft/git-base/resolve/main/config.json",
|
||||
}
|
||||
|
||||
|
||||
# Copied from transformers.models.clip.configuration_clip.CLIPVisionConfig with CLIPVision->GitVision, CLIP->GIT, clip->git, openai/git-vit-base-patch32->microsoft/git-base, 32->16
|
||||
class GitVisionConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`GitVisionModel`]. It is used to instantiate a GIT
|
||||
vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
|
||||
with the defaults will yield a similar configuration to that of the vision encoder of the GIT
|
||||
[microsoft/git-base](https://huggingface.co/microsoft/git-base) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
hidden_size (`int`, *optional*, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
image_size (`int`, *optional*, defaults to 224):
|
||||
The size (resolution) of each image.
|
||||
patch_size (`int`, *optional*, defaults to 16):
|
||||
The size (resolution) of each patch.
|
||||
hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. layer_norm_eps (`float`, *optional*,
|
||||
defaults to 1e-5): The epsilon used by the layer normalization layers.
|
||||
dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout ratio for the attention probabilities.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
initializer_factor (`float`, *optional*, defaults to 1):
|
||||
A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
|
||||
testing).
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import GitVisionConfig, GitVisionModel
|
||||
|
||||
>>> # Initializing a GitVisionConfig with microsoft/git-base style configuration
|
||||
>>> configuration = GitVisionConfig()
|
||||
|
||||
>>> # Initializing a GitVisionModel (with random weights) from the microsoft/git-base style configuration
|
||||
>>> model = GitVisionModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
|
||||
model_type = "git_vision_model"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
hidden_size=768,
|
||||
intermediate_size=3072,
|
||||
projection_dim=512,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=12,
|
||||
num_channels=3,
|
||||
image_size=224,
|
||||
patch_size=16,
|
||||
hidden_act="quick_gelu",
|
||||
layer_norm_eps=0.00001,
|
||||
dropout=0.0,
|
||||
attention_dropout=0.0,
|
||||
initializer_range=0.02,
|
||||
initializer_factor=1.0,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self.hidden_size = hidden_size
|
||||
self.intermediate_size = intermediate_size
|
||||
self.projection_dim = projection_dim
|
||||
self.dropout = dropout
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.num_channels = num_channels
|
||||
self.patch_size = patch_size
|
||||
self.image_size = image_size
|
||||
self.initializer_range = initializer_range
|
||||
self.initializer_factor = initializer_factor
|
||||
self.attention_dropout = attention_dropout
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.hidden_act = hidden_act
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
|
||||
|
||||
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||
|
||||
# get the vision config dict if we are loading from GITConfig
|
||||
if config_dict.get("model_type") == "git":
|
||||
config_dict = config_dict["vision_config"]
|
||||
|
||||
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||
logger.warning(
|
||||
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||
)
|
||||
|
||||
return cls.from_dict(config_dict, **kwargs)
|
||||
|
||||
|
||||
class GitConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`GitModel`]. It is used to instantiate a GIT model
|
||||
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||
defaults will yield a similar configuration to that of the GIT
|
||||
[microsoft/git-base](https://huggingface.co/microsoft/git-base) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
vision_config (`dict`, *optional*):
|
||||
Dictionary of configuration options used to initialize [`GitVisionConfig`].
|
||||
vocab_size (`int`, *optional*, defaults to 30522):
|
||||
Vocabulary size of the GIT model. Defines the number of different tokens that can be represented by the
|
||||
`inputs_ids` passed when calling [`GitModel`].
|
||||
hidden_size (`int`, *optional*, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
num_hidden_layers (`int`, *optional*, defaults to 6):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (`int`, *optional*, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
intermediate_size (`int`, *optional*, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
|
||||
hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||
hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
|
||||
The dropout ratio for the attention probabilities.
|
||||
max_position_embeddings (`int`, *optional*, defaults to 1024):
|
||||
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (`float`, *optional*, defaults to 1e-12):
|
||||
The epsilon used by the layer normalization layers.
|
||||
position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
|
||||
Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
|
||||
positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
|
||||
[Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
|
||||
For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
|
||||
with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
classifier_dropout (`float`, *optional*):
|
||||
The dropout ratio for the classification head.
|
||||
num_image_with_embedding (`int`, *optional*):
|
||||
The number of temporal embeddings to add, in case the model is used for video captioning/VQA.
|
||||
|
||||
Examples:
|
||||
|
||||
```python
|
||||
>>> from transformers import GitConfig, GitModel
|
||||
|
||||
>>> # Initializing a GIT microsoft/git-base style configuration
|
||||
>>> configuration = GitConfig()
|
||||
|
||||
>>> # Initializing a model (with random weights) from the microsoft/git-base style configuration
|
||||
>>> model = GitModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```"""
|
||||
model_type = "git"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vision_config=None,
|
||||
vocab_size=30522,
|
||||
hidden_size=768,
|
||||
num_hidden_layers=6,
|
||||
num_attention_heads=12,
|
||||
intermediate_size=3072,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=1024,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-12,
|
||||
pad_token_id=0,
|
||||
position_embedding_type="absolute",
|
||||
use_cache=True,
|
||||
classifier_dropout=None,
|
||||
tie_word_embeddings=False,
|
||||
bos_token_id=101,
|
||||
eos_token_id=102,
|
||||
num_image_with_embedding=None,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, pad_token_id=pad_token_id, **kwargs)
|
||||
|
||||
if vision_config is None:
|
||||
vision_config = {}
|
||||
logger.info("vision_config is None. initializing the GitVisionConfig with default values.")
|
||||
|
||||
self.vision_config = GitVisionConfig(**vision_config)
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_act = hidden_act
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.initializer_range = initializer_range
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.position_embedding_type = position_embedding_type
|
||||
self.use_cache = use_cache
|
||||
self.classifier_dropout = classifier_dropout
|
||||
self.tie_word_embeddings = tie_word_embeddings
|
||||
self.num_image_with_embedding = num_image_with_embedding
|
||||
|
||||
self.bos_token_id = bos_token_id
|
||||
self.eos_token_id = eos_token_id
|
||||
|
||||
def to_dict(self):
|
||||
"""
|
||||
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. Returns:
|
||||
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
output["vision_config"] = self.vision_config.to_dict()
|
||||
output["model_type"] = self.__class__.model_type
|
||||
return output
|
||||
404
src/transformers/models/git/convert_git_to_pytorch.py
Normal file
404
src/transformers/models/git/convert_git_to_pytorch.py
Normal file
@@ -0,0 +1,404 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert GIT checkpoints from the original repository.
|
||||
|
||||
URL: https://github.com/microsoft/GenerativeImage2Text/tree/main"""
|
||||
|
||||
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from PIL import Image
|
||||
from torchvision.transforms import CenterCrop, Compose, Normalize, Resize, ToTensor
|
||||
|
||||
import requests
|
||||
from huggingface_hub import hf_hub_download
|
||||
from transformers import (
|
||||
AutoTokenizer,
|
||||
CLIPImageProcessor,
|
||||
GitConfig,
|
||||
GitForCausalLM,
|
||||
GitProcessor,
|
||||
GitVisionConfig,
|
||||
VideoMAEImageProcessor,
|
||||
)
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
def get_git_config(model_name):
|
||||
if "base" in model_name and "vqa" in model_name:
|
||||
image_size = 480
|
||||
elif "large" in model_name and "vqa" in model_name:
|
||||
image_size = 420
|
||||
else:
|
||||
image_size = 224
|
||||
|
||||
vision_config = GitVisionConfig(image_size=image_size)
|
||||
|
||||
if "large" in model_name:
|
||||
vision_config.patch_size = 14
|
||||
vision_config.hidden_size = 1024
|
||||
vision_config.intermediate_size = 4096
|
||||
vision_config.num_hidden_layers = 24
|
||||
vision_config.num_attention_heads = 16
|
||||
|
||||
is_video = "vatex" in model_name or "msrvtt" in model_name
|
||||
num_image_with_embedding = 6 if is_video else None
|
||||
config = GitConfig(vision_config=vision_config.to_dict(), num_image_with_embedding=num_image_with_embedding)
|
||||
|
||||
return config, image_size, is_video
|
||||
|
||||
|
||||
# here we list all keys to be renamed (original name on the left, our name on the right)
|
||||
def create_rename_keys(config, prefix=""):
|
||||
rename_keys = []
|
||||
|
||||
# image encoder
|
||||
# ftm: off
|
||||
rename_keys.append(
|
||||
(f"{prefix}image_encoder.class_embedding", "git.image_encoder.vision_model.embeddings.class_embedding")
|
||||
)
|
||||
rename_keys.append(
|
||||
(
|
||||
f"{prefix}image_encoder.positional_embedding",
|
||||
"git.image_encoder.vision_model.embeddings.position_embedding.weight",
|
||||
)
|
||||
)
|
||||
rename_keys.append(
|
||||
(f"{prefix}image_encoder.conv1.weight", "git.image_encoder.vision_model.embeddings.patch_embedding.weight")
|
||||
)
|
||||
rename_keys.append((f"{prefix}image_encoder.ln_pre.weight", "git.image_encoder.vision_model.pre_layrnorm.weight"))
|
||||
rename_keys.append((f"{prefix}image_encoder.ln_pre.bias", "git.image_encoder.vision_model.pre_layrnorm.bias"))
|
||||
rename_keys.append(
|
||||
(f"{prefix}image_encoder.ln_post.weight", "git.image_encoder.vision_model.post_layernorm.weight")
|
||||
)
|
||||
rename_keys.append((f"{prefix}image_encoder.ln_post.bias", "git.image_encoder.vision_model.post_layernorm.bias"))
|
||||
# fmt: on
|
||||
rename_keys.append((f"{prefix}image_encoder.proj", "git.image_encoder.visual_projection.weight"))
|
||||
|
||||
# fmt: off
|
||||
for i in range(config.vision_config.num_hidden_layers):
|
||||
# image encoder layers: output projection, 2 feedforward neural networks and 2 layernorms
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.attn.out_proj.weight", f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.out_proj.weight"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.attn.out_proj.bias", f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.out_proj.bias"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.ln_1.weight", f"git.image_encoder.vision_model.encoder.layers.{i}.layer_norm1.weight"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.ln_1.bias", f"git.image_encoder.vision_model.encoder.layers.{i}.layer_norm1.bias"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.mlp.c_fc.weight", f"git.image_encoder.vision_model.encoder.layers.{i}.mlp.fc1.weight"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.mlp.c_fc.bias", f"git.image_encoder.vision_model.encoder.layers.{i}.mlp.fc1.bias"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.mlp.c_proj.weight", f"git.image_encoder.vision_model.encoder.layers.{i}.mlp.fc2.weight"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.mlp.c_proj.bias", f"git.image_encoder.vision_model.encoder.layers.{i}.mlp.fc2.bias"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.ln_2.weight", f"git.image_encoder.vision_model.encoder.layers.{i}.layer_norm2.weight"))
|
||||
rename_keys.append((f"{prefix}image_encoder.transformer.resblocks.{i}.ln_2.bias", f"git.image_encoder.vision_model.encoder.layers.{i}.layer_norm2.bias"))
|
||||
# fmt: on
|
||||
|
||||
# text decoder
|
||||
# fmt: off
|
||||
rename_keys.append((f"{prefix}textual.embedding.words.weight", "git.embeddings.word_embeddings.weight"))
|
||||
rename_keys.append((f"{prefix}textual.embedding.positions.weight", "git.embeddings.position_embeddings.weight"))
|
||||
rename_keys.append((f"{prefix}textual.visual_projection.0.weight", "git.visual_projection.visual_projection.0.weight"))
|
||||
rename_keys.append((f"{prefix}textual.visual_projection.0.bias", "git.visual_projection.visual_projection.0.bias"))
|
||||
rename_keys.append((f"{prefix}textual.visual_projection.1.weight", "git.visual_projection.visual_projection.1.weight"))
|
||||
rename_keys.append((f"{prefix}textual.visual_projection.1.bias", "git.visual_projection.visual_projection.1.bias"))
|
||||
|
||||
rename_keys.append((f"{prefix}textual.embedding.layer_norm.weight", "git.embeddings.LayerNorm.weight"))
|
||||
rename_keys.append((f"{prefix}textual.embedding.layer_norm.bias", "git.embeddings.LayerNorm.bias"))
|
||||
rename_keys.append((f"{prefix}textual.output.weight", "output.weight"))
|
||||
rename_keys.append((f"{prefix}textual.output.bias", "output.bias"))
|
||||
for i in range(config.num_hidden_layers):
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.self.query.weight", f"git.encoder.layer.{i}.attention.self.query.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.self.query.bias", f"git.encoder.layer.{i}.attention.self.query.bias"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.self.key.weight", f"git.encoder.layer.{i}.attention.self.key.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.self.key.bias", f"git.encoder.layer.{i}.attention.self.key.bias"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.self.value.weight", f"git.encoder.layer.{i}.attention.self.value.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.self.value.bias", f"git.encoder.layer.{i}.attention.self.value.bias"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.output.dense.weight", f"git.encoder.layer.{i}.attention.output.dense.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.output.dense.bias", f"git.encoder.layer.{i}.attention.output.dense.bias"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.output.LayerNorm.weight", f"git.encoder.layer.{i}.attention.output.LayerNorm.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.attention.output.LayerNorm.bias", f"git.encoder.layer.{i}.attention.output.LayerNorm.bias"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.intermediate.dense.weight", f"git.encoder.layer.{i}.intermediate.dense.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.intermediate.dense.bias", f"git.encoder.layer.{i}.intermediate.dense.bias"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.output.dense.weight", f"git.encoder.layer.{i}.output.dense.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.output.dense.bias", f"git.encoder.layer.{i}.output.dense.bias"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.output.LayerNorm.weight", f"git.encoder.layer.{i}.output.LayerNorm.weight"))
|
||||
rename_keys.append((f"{prefix}textual.transformer.encoder.layer.{i}.output.LayerNorm.bias", f"git.encoder.layer.{i}.output.LayerNorm.bias"))
|
||||
# fmt: on
|
||||
|
||||
if config.num_image_with_embedding is not None:
|
||||
rename_keys.append(("img_temperal_embedding.0", "git.img_temperal_embedding.0"))
|
||||
rename_keys.append(("img_temperal_embedding.1", "git.img_temperal_embedding.1"))
|
||||
rename_keys.append(("img_temperal_embedding.2", "git.img_temperal_embedding.2"))
|
||||
rename_keys.append(("img_temperal_embedding.3", "git.img_temperal_embedding.3"))
|
||||
rename_keys.append(("img_temperal_embedding.4", "git.img_temperal_embedding.4"))
|
||||
rename_keys.append(("img_temperal_embedding.5", "git.img_temperal_embedding.5"))
|
||||
|
||||
return rename_keys
|
||||
|
||||
|
||||
def rename_key(dct, old, new):
|
||||
val = dct.pop(old)
|
||||
dct[new] = val.T if "image_encoder.visual_projection" in new else val
|
||||
|
||||
|
||||
# we split up the matrix of each CLIP encoder layer into queries, keys and values
|
||||
def read_in_q_k_v(state_dict, config, prefix=""):
|
||||
dim = config.vision_config.hidden_size
|
||||
for i in range(config.vision_config.num_hidden_layers):
|
||||
# read in weights + bias of input projection layer (in the original implementation, this is a single matrix + bias)
|
||||
in_proj_weight = state_dict.pop(f"{prefix}image_encoder.transformer.resblocks.{i}.attn.in_proj_weight")
|
||||
in_proj_bias = state_dict.pop(f"{prefix}image_encoder.transformer.resblocks.{i}.attn.in_proj_bias")
|
||||
# next, add query, keys and values (in that order) to the state dict
|
||||
state_dict[f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.q_proj.weight"] = in_proj_weight[
|
||||
:dim, :
|
||||
]
|
||||
state_dict[f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.q_proj.bias"] = in_proj_bias[:dim]
|
||||
state_dict[f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.k_proj.weight"] = in_proj_weight[
|
||||
dim : dim * 2, :
|
||||
]
|
||||
state_dict[f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.k_proj.bias"] = in_proj_bias[
|
||||
dim : dim * 2
|
||||
]
|
||||
state_dict[f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.v_proj.weight"] = in_proj_weight[
|
||||
-dim:, :
|
||||
]
|
||||
state_dict[f"git.image_encoder.vision_model.encoder.layers.{i}.self_attn.v_proj.bias"] = in_proj_bias[-dim:]
|
||||
|
||||
|
||||
# We will verify our results on an image
|
||||
def prepare_img(model_name):
|
||||
if "textvqa" in model_name:
|
||||
filepath = hf_hub_download(repo_id="nielsr/textvqa-sample", filename="bus.png", repo_type="dataset")
|
||||
image = Image.open(filepath).convert("RGB")
|
||||
else:
|
||||
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||
image = Image.open(requests.get(url, stream=True).raw)
|
||||
|
||||
return image
|
||||
|
||||
|
||||
def prepare_video():
|
||||
from decord import VideoReader, cpu
|
||||
|
||||
# set seed for reproducability
|
||||
np.random.seed(0)
|
||||
|
||||
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
|
||||
converted_len = int(clip_len * frame_sample_rate)
|
||||
end_idx = np.random.randint(converted_len, seg_len)
|
||||
start_idx = end_idx - converted_len
|
||||
indices = np.linspace(start_idx, end_idx, num=clip_len)
|
||||
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
|
||||
return indices
|
||||
|
||||
# video clip consists of 300 frames (10 seconds at 30 FPS)
|
||||
file_path = hf_hub_download(repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset")
|
||||
videoreader = VideoReader(file_path, num_threads=1, ctx=cpu(0))
|
||||
|
||||
# sample 6 frames
|
||||
videoreader.seek(0)
|
||||
indices = sample_frame_indices(clip_len=6, frame_sample_rate=4, seg_len=len(videoreader))
|
||||
video = videoreader.get_batch(indices).asnumpy()
|
||||
|
||||
return video
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_git_checkpoint(model_name, pytorch_dump_folder_path, push_to_hub=False):
|
||||
"""
|
||||
Copy/paste/tweak model's weights to our GIT structure.
|
||||
"""
|
||||
|
||||
model_name_to_url = {
|
||||
"git-base": "https://publicgit.blob.core.windows.net/data/output/GIT_BASE/snapshot/model.pt",
|
||||
"git-base-coco": "https://publicgit.blob.core.windows.net/data/output/GIT_BASE_COCO/snapshot/model.pt",
|
||||
"git-base-textcaps": "https://publicgit.blob.core.windows.net/data/output/GIT_BASE_TEXTCAPS/snapshot/model.pt",
|
||||
"git-base-vqav2": "https://publicgit.blob.core.windows.net/data/output/GIT_BASE_VQAv2/snapshot/model.pt",
|
||||
"git-base-textvqa": "https://publicgit.blob.core.windows.net/data/output/GIT_BASE_TEXTVQA/snapshot/model.pt", # todo
|
||||
"git-base-vatex": "https://publicgit.blob.core.windows.net/data/output/GIT_BASE_VATEX/snapshot/model.pt",
|
||||
"git-base-msrvtt-qa": (
|
||||
"https://publicgit.blob.core.windows.net/data/output/GIT_BASE_MSRVTT_QA/snapshot/model.pt"
|
||||
),
|
||||
"git-large": "https://publicgit.blob.core.windows.net/data/output/GIT_LARGE/snapshot/model.pt",
|
||||
"git-large-coco": "https://publicgit.blob.core.windows.net/data/output/GIT_LARGE_COCO/snapshot/model.pt",
|
||||
"git-large-textcaps": (
|
||||
"https://publicgit.blob.core.windows.net/data/output/GIT_LARGE_TEXTCAPS/snapshot/model.pt"
|
||||
),
|
||||
"git-large-vqav2": "https://publicgit.blob.core.windows.net/data/output/GIT_LARGE_VQAv2/snapshot/model.pt",
|
||||
"git-large-textvqa": "https://publicgit.blob.core.windows.net/data/output/GIT_LARGE_TEXTVQA/snapshot/model.pt",
|
||||
"git-large-vatex": "https://publicgit.blob.core.windows.net/data/output/GIT_LARGE_VATEX/snapshot/model.pt",
|
||||
"git-large-msrvtt-qa": (
|
||||
"https://publicgit.blob.core.windows.net/data/output/GIT_LARGE_MSRVTT_QA/snapshot/model.pt"
|
||||
),
|
||||
}
|
||||
|
||||
model_name_to_path = {
|
||||
"git-large": "/Users/nielsrogge/Documents/GIT/git_large_model.pt",
|
||||
"git-large-coco": "/Users/nielsrogge/Documents/GIT/git_large_coco_model.pt",
|
||||
"git-large-textcaps": "/Users/nielsrogge/Documents/GIT/git_large_textcaps_model.pt",
|
||||
"git-large-vqav2": "/Users/nielsrogge/Documents/GIT/git_large_vqav2_model.pt",
|
||||
"git-large-textvqa": "/Users/nielsrogge/Documents/GIT/git_large_textvqa_model.pt",
|
||||
}
|
||||
|
||||
# define GIT configuration based on model name
|
||||
config, image_size, is_video = get_git_config(model_name)
|
||||
if "large" in model_name and not is_video:
|
||||
# large checkpoints take way too long to download
|
||||
checkpoint_path = model_name_to_path[model_name]
|
||||
state_dict = torch.load(checkpoint_path, map_location="cpu")["model"]
|
||||
else:
|
||||
checkpoint_url = model_name_to_url[model_name]
|
||||
state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu", file_name=model_name)[
|
||||
"model"
|
||||
]
|
||||
# rename keys
|
||||
prefix = "module." if model_name == "git-base" else ""
|
||||
rename_keys = create_rename_keys(config, prefix=prefix)
|
||||
for src, dest in rename_keys:
|
||||
rename_key(state_dict, src, dest)
|
||||
read_in_q_k_v(state_dict, config, prefix=prefix)
|
||||
|
||||
# load HuggingFace model
|
||||
model = GitForCausalLM(config)
|
||||
missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
|
||||
model.eval()
|
||||
|
||||
print("Missing keys:", missing_keys)
|
||||
print("Unexpected keys:", unexpected_keys)
|
||||
|
||||
assert missing_keys == ["git.embeddings.position_ids", "git.image_encoder.vision_model.embeddings.position_ids"]
|
||||
assert unexpected_keys == ["git.image_encoder.visual_projection.weight"]
|
||||
|
||||
# verify results
|
||||
image_processor = (
|
||||
VideoMAEImageProcessor(
|
||||
size={"shortest_edge": image_size}, crop_size={"height": image_size, "width": image_size}
|
||||
)
|
||||
if is_video
|
||||
else CLIPImageProcessor(
|
||||
size={"shortest_edge": image_size}, crop_size={"height": image_size, "width": image_size}
|
||||
)
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_input_names=["input_ids", "attention_mask"])
|
||||
processor = GitProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||
|
||||
if is_video:
|
||||
video = prepare_video()
|
||||
pixel_values = processor(images=list(video), return_tensors="pt").pixel_values
|
||||
else:
|
||||
image = prepare_img(model_name)
|
||||
image_transforms = Compose(
|
||||
[
|
||||
Resize(image_size, interpolation=Image.BICUBIC),
|
||||
CenterCrop(image_size),
|
||||
ToTensor(),
|
||||
Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
|
||||
]
|
||||
)
|
||||
original_pixel_values = image_transforms(image).unsqueeze(0)
|
||||
pixel_values = processor(images=image, return_tensors="pt").pixel_values
|
||||
|
||||
assert torch.allclose(pixel_values, original_pixel_values)
|
||||
|
||||
input_ids = torch.tensor([[101]])
|
||||
outputs = model(input_ids, pixel_values=pixel_values)
|
||||
logits = outputs.logits
|
||||
print("Logits:", logits[0, -1, :3])
|
||||
|
||||
if model_name == "git-base":
|
||||
expected_slice_logits = torch.tensor([-1.2832, -1.2835, -1.2840])
|
||||
elif model_name == "git-base-coco":
|
||||
expected_slice_logits = torch.tensor([-0.9925, -0.9930, -0.9935])
|
||||
elif model_name == "git-base-textcaps":
|
||||
expected_slice_logits = torch.tensor([-1.2980, -1.2983, -1.2985])
|
||||
elif model_name == "git-base-vqav2":
|
||||
expected_slice_logits = torch.tensor([-0.8570, -0.8568, -0.8561])
|
||||
elif model_name == "git-base-textvqa":
|
||||
expected_slice_logits = torch.tensor([-1.4085, -1.4083, -1.4082])
|
||||
elif model_name == "git-base-vatex":
|
||||
expected_slice_logits = torch.tensor([-1.3451, -1.3447, -1.3447])
|
||||
elif model_name == "git-base-msrvtt-qa":
|
||||
expected_slice_logits = torch.tensor([-0.8554, -0.8550, -0.8540])
|
||||
elif model_name == "git-large":
|
||||
expected_slice_logits = torch.tensor([-1.1708, -1.1707, -1.1705])
|
||||
elif model_name == "git-large-coco":
|
||||
expected_slice_logits = torch.tensor([-1.0425, -1.0423, -1.0422])
|
||||
elif model_name == "git-large-textcaps":
|
||||
expected_slice_logits = torch.tensor([-1.2705, -1.2708, -1.2706])
|
||||
elif model_name == "git-large-vqav2":
|
||||
expected_slice_logits = torch.tensor([-0.7042, -0.7043, -0.7043])
|
||||
elif model_name == "git-large-textvqa":
|
||||
expected_slice_logits = torch.tensor([-0.8590, -0.8592, -0.8590])
|
||||
elif model_name == "git-large-vatex":
|
||||
expected_slice_logits = torch.tensor([-1.0113, -1.0114, -1.0113])
|
||||
elif model_name == "git-large-msrvtt-qa":
|
||||
expected_slice_logits = torch.tensor([0.0130, 0.0134, 0.0131])
|
||||
|
||||
assert torch.allclose(logits[0, -1, :3], expected_slice_logits, atol=1e-4)
|
||||
print("Looks ok!")
|
||||
|
||||
prompt = ""
|
||||
if "textvqa" in model_name:
|
||||
prompt = "what does the front of the bus say at the top?"
|
||||
elif "msrvtt-qa" in model_name:
|
||||
prompt = "what does the woman eat?"
|
||||
elif "vqa" in model_name:
|
||||
prompt = "what are the cats doing?"
|
||||
input_ids = tokenizer(prompt, add_special_tokens=False).input_ids
|
||||
input_ids = [processor.tokenizer.cls_token_id] + input_ids
|
||||
input_ids = torch.tensor(input_ids).unsqueeze(0)
|
||||
print("Generating caption...")
|
||||
generated_ids = model.generate(pixel_values=pixel_values, input_ids=input_ids, max_length=50)
|
||||
print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True))
|
||||
|
||||
if pytorch_dump_folder_path is not None:
|
||||
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
|
||||
print(f"Saving model and processor of {model_name} to {pytorch_dump_folder_path}")
|
||||
model.save_pretrained(pytorch_dump_folder_path)
|
||||
processor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
if push_to_hub:
|
||||
print(f"Pushing model and processor of {model_name} to the hub...")
|
||||
model.push_to_hub(f"microsoft/{model_name}")
|
||||
processor.push_to_hub(f"microsoft/{model_name}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
parser.add_argument(
|
||||
"--model_name",
|
||||
default="git-base",
|
||||
type=str,
|
||||
help="Name of the model you'd like to convert.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--pytorch_dump_folder_path",
|
||||
default=None,
|
||||
type=str,
|
||||
help="Path to the output PyTorch model directory.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--push_to_hub",
|
||||
action="store_true",
|
||||
help="Whether to push the model to the hub.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
convert_git_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
|
||||
1532
src/transformers/models/git/modeling_git.py
Normal file
1532
src/transformers/models/git/modeling_git.py
Normal file
File diff suppressed because it is too large
Load Diff
113
src/transformers/models/git/processing_git.py
Normal file
113
src/transformers/models/git/processing_git.py
Normal file
@@ -0,0 +1,113 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Image/Text processor class for GIT
|
||||
"""
|
||||
|
||||
from ...processing_utils import ProcessorMixin
|
||||
from ...tokenization_utils_base import BatchEncoding
|
||||
|
||||
|
||||
class GitProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs a GIT processor which wraps a CLIP image processor and a BERT tokenizer into a single processor.
|
||||
|
||||
[`GitProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`BertTokenizerFast`]. See the
|
||||
[`~GitProcessor.__call__`] and [`~GitProcessor.decode`] for more information.
|
||||
|
||||
Args:
|
||||
image_processor ([`AutoImageProcessor`]):
|
||||
The image processor is a required input.
|
||||
tokenizer ([`AutoTokenizer`]):
|
||||
The tokenizer is a required input.
|
||||
"""
|
||||
attributes = ["image_processor", "tokenizer"]
|
||||
image_processor_class = "AutoImageProcessor"
|
||||
tokenizer_class = "AutoTokenizer"
|
||||
|
||||
def __init__(self, image_processor, tokenizer):
|
||||
super().__init__(image_processor, tokenizer)
|
||||
self.current_processor = self.image_processor
|
||||
|
||||
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
|
||||
"""
|
||||
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
|
||||
and `kwargs` arguments to BertTokenizerFast's [`~BertTokenizerFast.__call__`] if `text` is not `None` to encode
|
||||
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
|
||||
CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
|
||||
of the above two methods for more information.
|
||||
|
||||
Args:
|
||||
text (`str`, `List[str]`, `List[List[str]]`):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
|
||||
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
|
||||
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
|
||||
number of channels, H and W are image height and width.
|
||||
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors of a particular framework. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||
- `'jax'`: Return JAX `jnp.ndarray` objects.
|
||||
|
||||
Returns:
|
||||
[`BatchEncoding`]: A [`BatchEncoding`] with the following fields:
|
||||
|
||||
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
||||
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
||||
`None`).
|
||||
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
|
||||
"""
|
||||
|
||||
if text is None and images is None:
|
||||
raise ValueError("You have to specify either text or images. Both cannot be none.")
|
||||
|
||||
if text is not None:
|
||||
encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)
|
||||
|
||||
if images is not None:
|
||||
image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)
|
||||
|
||||
if text is not None and images is not None:
|
||||
encoding["pixel_values"] = image_features.pixel_values
|
||||
return encoding
|
||||
elif text is not None:
|
||||
return encoding
|
||||
else:
|
||||
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
|
||||
|
||||
def batch_decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
|
||||
refer to the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.batch_decode(*args, **kwargs)
|
||||
|
||||
def decode(self, *args, **kwargs):
|
||||
"""
|
||||
This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
|
||||
the docstring of this method for more information.
|
||||
"""
|
||||
return self.tokenizer.decode(*args, **kwargs)
|
||||
|
||||
@property
|
||||
def model_input_names(self):
|
||||
return ["input_ids", "attention_mask", "pixel_values"]
|
||||
@@ -2588,6 +2588,37 @@ def load_tf_weights_in_funnel(*args, **kwargs):
|
||||
requires_backends(load_tf_weights_in_funnel, ["torch"])
|
||||
|
||||
|
||||
GIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class GitForCausalLM(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class GitModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class GitPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class GitVisionModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
GLPN_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
||||
@@ -156,6 +156,7 @@ _SPECIAL_SUPPORTED_MODELS = [
|
||||
"CLIPTextModelWithProjection",
|
||||
"CLIPVisionModel",
|
||||
"CLIPVisionModelWithProjection",
|
||||
"GitVisionModel",
|
||||
"GPT2DoubleHeadsModel",
|
||||
"Speech2Text2Decoder",
|
||||
"TrOCRDecoder",
|
||||
|
||||
0
tests/models/git/__init__.py
Normal file
0
tests/models/git/__init__.py
Normal file
462
tests/models/git/test_modeling_git.py
Normal file
462
tests/models/git/test_modeling_git.py
Normal file
@@ -0,0 +1,462 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import inspect
|
||||
import unittest
|
||||
|
||||
from huggingface_hub import hf_hub_download
|
||||
from transformers import GitConfig, GitProcessor, GitVisionConfig, is_torch_available, is_vision_available
|
||||
from transformers.models.auto import get_values
|
||||
from transformers.testing_utils import require_torch, require_vision, slow, torch_device
|
||||
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
from torch import nn
|
||||
|
||||
from transformers import MODEL_FOR_PRETRAINING_MAPPING, GitForCausalLM, GitModel, GitVisionModel
|
||||
from transformers.models.git.modeling_git import GIT_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
|
||||
class GitVisionModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=12,
|
||||
image_size=32,
|
||||
patch_size=16,
|
||||
num_channels=3,
|
||||
is_training=True,
|
||||
hidden_size=32,
|
||||
projection_dim=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
initializer_range=0.02,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.num_channels = num_channels
|
||||
self.is_training = is_training
|
||||
self.hidden_size = hidden_size
|
||||
self.projection_dim = projection_dim
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.initializer_range = initializer_range
|
||||
self.scope = scope
|
||||
|
||||
# in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
|
||||
num_patches = (image_size // patch_size) ** 2
|
||||
self.seq_length = num_patches + 1
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
config = self.get_config()
|
||||
|
||||
return config, pixel_values
|
||||
|
||||
def get_config(self):
|
||||
return GitVisionConfig(
|
||||
image_size=self.image_size,
|
||||
patch_size=self.patch_size,
|
||||
num_channels=self.num_channels,
|
||||
hidden_size=self.hidden_size,
|
||||
projection_dim=self.projection_dim,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
dropout=self.dropout,
|
||||
attention_dropout=self.attention_dropout,
|
||||
initializer_range=self.initializer_range,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, pixel_values):
|
||||
model = GitVisionModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
result = model(pixel_values)
|
||||
# expected sequence length = num_patches + 1 (we add 1 for the [CLS] token)
|
||||
image_size = (self.image_size, self.image_size)
|
||||
patch_size = (self.patch_size, self.patch_size)
|
||||
num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, num_patches + 1, self.hidden_size))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
config, pixel_values = config_and_inputs
|
||||
inputs_dict = {"pixel_values": pixel_values}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class GitVisionModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
"""
|
||||
Here we also overwrite some of the tests of test_modeling_common.py, as GIT does not use input_ids, inputs_embeds,
|
||||
attention_mask and seq_length.
|
||||
"""
|
||||
|
||||
all_model_classes = (GitVisionModel,) if is_torch_available() else ()
|
||||
fx_compatible = True
|
||||
test_pruning = False
|
||||
test_resize_embeddings = False
|
||||
test_head_masking = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = GitVisionModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=GitVisionConfig, has_text_modality=False, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
@unittest.skip(reason="GIT does not use inputs_embeds")
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
def test_model_common_attributes(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
|
||||
x = model.get_output_embeddings()
|
||||
self.assertTrue(x is None or isinstance(x, nn.Linear))
|
||||
|
||||
def test_forward_signature(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
signature = inspect.signature(model.forward)
|
||||
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||
arg_names = [*signature.parameters.keys()]
|
||||
|
||||
expected_arg_names = ["pixel_values"]
|
||||
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_training(self):
|
||||
pass
|
||||
|
||||
def test_training_gradient_checkpointing(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="GitVisionModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_from_base(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(reason="GitVisionModel has no base class and is not available in MODEL_MAPPING")
|
||||
def test_save_load_fast_init_to_base(self):
|
||||
pass
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in GIT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = GitVisionModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
class GitModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
num_channels=3,
|
||||
image_size=32,
|
||||
patch_size=16,
|
||||
batch_size=13,
|
||||
text_seq_length=7,
|
||||
is_training=True,
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=99,
|
||||
hidden_size=32,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
intermediate_size=37,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout_prob=0.1,
|
||||
attention_probs_dropout_prob=0.1,
|
||||
max_position_embeddings=512,
|
||||
type_vocab_size=16,
|
||||
initializer_range=0.02,
|
||||
num_labels=3,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.num_channels = num_channels
|
||||
self.image_size = image_size
|
||||
self.patch_size = patch_size
|
||||
self.batch_size = batch_size
|
||||
self.text_seq_length = text_seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.hidden_act = hidden_act
|
||||
self.hidden_dropout_prob = hidden_dropout_prob
|
||||
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.type_vocab_size = type_vocab_size
|
||||
self.initializer_range = initializer_range
|
||||
self.num_labels = num_labels
|
||||
self.scope = scope
|
||||
|
||||
# make sure the BOS, EOS and PAD tokens are within the vocab
|
||||
self.bos_token_id = vocab_size - 1
|
||||
self.eos_token_id = vocab_size - 1
|
||||
self.pad_token_id = vocab_size - 1
|
||||
|
||||
# for GIT, the sequence length is the sum of the text and patch tokens, + 1 due to the CLS token
|
||||
self.seq_length = self.text_seq_length + int((self.image_size / self.patch_size) ** 2) + 1
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.text_seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.text_seq_length])
|
||||
|
||||
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
|
||||
|
||||
token_labels = None
|
||||
if self.use_labels:
|
||||
token_labels = ids_tensor([self.batch_size, self.text_seq_length], self.num_labels)
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_ids, input_mask, pixel_values, token_labels
|
||||
|
||||
def get_config(self):
|
||||
"""
|
||||
Returns a tiny configuration by default.
|
||||
"""
|
||||
return GitConfig(
|
||||
vision_config={
|
||||
"num_channels": self.num_channels,
|
||||
"image_size": self.image_size,
|
||||
"patch_size": self.patch_size,
|
||||
},
|
||||
vocab_size=self.vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
intermediate_size=self.intermediate_size,
|
||||
hidden_act=self.hidden_act,
|
||||
hidden_dropout_prob=self.hidden_dropout_prob,
|
||||
attention_probs_dropout_prob=self.attention_probs_dropout_prob,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
initializer_range=self.initializer_range,
|
||||
bos_token_id=self.bos_token_id,
|
||||
eos_token_id=self.eos_token_id,
|
||||
pad_token_id=self.pad_token_id,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, input_ids, input_mask, pixel_values, token_labels):
|
||||
model = GitModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
# inference with pixel values
|
||||
result = model(input_ids, attention_mask=input_mask, pixel_values=pixel_values)
|
||||
|
||||
self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
|
||||
|
||||
# inference without pixel values
|
||||
result = model(input_ids, attention_mask=input_mask)
|
||||
result = model(input_ids)
|
||||
|
||||
self.parent.assertEqual(
|
||||
result.last_hidden_state.shape, (self.batch_size, self.text_seq_length, self.hidden_size)
|
||||
)
|
||||
|
||||
def create_and_check_for_causal_lm(self, config, input_ids, input_mask, pixel_values, token_labels):
|
||||
model = GitForCausalLM(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
# inference with pixel values
|
||||
result = model(input_ids, attention_mask=input_mask, pixel_values=pixel_values)
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
|
||||
# inference without pixel values
|
||||
result = model(input_ids, attention_mask=input_mask)
|
||||
result = model(input_ids)
|
||||
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.text_seq_length, self.vocab_size))
|
||||
|
||||
# TODO training
|
||||
# result = model(input_ids, attention_mask=input_mask, pixel_values=pixel_values)
|
||||
# self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
# self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
input_mask,
|
||||
pixel_values,
|
||||
token_labels,
|
||||
) = config_and_inputs
|
||||
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"attention_mask": input_mask,
|
||||
"pixel_values": pixel_values,
|
||||
}
|
||||
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class GitModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
|
||||
all_model_classes = (GitModel, GitForCausalLM) if is_torch_available() else ()
|
||||
all_generative_model_classes = (GitForCausalLM,) if is_torch_available() else ()
|
||||
fx_compatible = False
|
||||
test_torchscript = False
|
||||
|
||||
# special case for ForPreTraining model
|
||||
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
|
||||
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
|
||||
|
||||
if return_labels:
|
||||
if model_class in get_values(MODEL_FOR_PRETRAINING_MAPPING):
|
||||
inputs_dict["labels"] = torch.zeros(
|
||||
(self.model_tester.batch_size, self.model_tester.seq_length), dtype=torch.long, device=torch_device
|
||||
)
|
||||
inputs_dict["next_sentence_label"] = torch.zeros(
|
||||
self.model_tester.batch_size, dtype=torch.long, device=torch_device
|
||||
)
|
||||
return inputs_dict
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = GitModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=GitConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_model_various_embeddings(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
for type in ["absolute", "relative_key", "relative_key_query"]:
|
||||
config_and_inputs[0].position_embedding_type = type
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_for_causal_lm(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_for_causal_lm(*config_and_inputs)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in GIT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = GitModel.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_vision
|
||||
@slow
|
||||
class GitModelIntegrationTest(unittest.TestCase):
|
||||
def test_forward_pass(self):
|
||||
processor = GitProcessor.from_pretrained("microsoft/git-base")
|
||||
model = GitForCausalLM.from_pretrained("microsoft/git-base")
|
||||
|
||||
model.to(torch_device)
|
||||
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
inputs = processor(images=image, text="hello world", return_tensors="pt").to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(**inputs)
|
||||
|
||||
expected_shape = torch.Size((1, 201, 30522))
|
||||
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||
expected_slice = torch.tensor(
|
||||
[[-0.9514, -0.9512, -0.9507], [-0.5454, -0.5453, -0.5453], [-0.8862, -0.8857, -0.8848]],
|
||||
device=torch_device,
|
||||
)
|
||||
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_slice, atol=1e-4))
|
||||
|
||||
def test_inference_image_captioning(self):
|
||||
processor = GitProcessor.from_pretrained("microsoft/git-base")
|
||||
model = GitForCausalLM.from_pretrained("microsoft/git-base")
|
||||
model.to(torch_device)
|
||||
|
||||
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||
inputs = processor(images=image, return_tensors="pt")
|
||||
pixel_values = inputs.pixel_values.to(torch_device)
|
||||
|
||||
outputs = model.generate(
|
||||
pixel_values=pixel_values, max_length=20, output_scores=True, return_dict_in_generate=True
|
||||
)
|
||||
generated_caption = processor.batch_decode(outputs.sequences, skip_special_tokens=True)[0]
|
||||
|
||||
expected_shape = torch.Size((1, 9))
|
||||
self.assertEqual(outputs.sequences.shape, expected_shape)
|
||||
self.assertEquals(generated_caption, "two cats laying on a pink blanket")
|
||||
self.assertTrue(outputs.scores[-1].shape, expected_shape)
|
||||
expected_slice = torch.tensor([[-0.8805, -0.8803, -0.8799]], device=torch_device)
|
||||
self.assertTrue(torch.allclose(outputs.scores[-1][0, :3], expected_slice, atol=1e-4))
|
||||
|
||||
def test_visual_question_answering(self):
|
||||
processor = GitProcessor.from_pretrained("microsoft/git-base-textvqa")
|
||||
model = GitForCausalLM.from_pretrained("microsoft/git-base-textvqa")
|
||||
model.to(torch_device)
|
||||
|
||||
# prepare image
|
||||
file_path = hf_hub_download(repo_id="nielsr/textvqa-sample", filename="bus.png", repo_type="dataset")
|
||||
image = Image.open(file_path).convert("RGB")
|
||||
inputs = processor(images=image, return_tensors="pt")
|
||||
pixel_values = inputs.pixel_values.to(torch_device)
|
||||
|
||||
# prepare question
|
||||
question = "what does the front of the bus say at the top?"
|
||||
input_ids = processor(text=question, add_special_tokens=False).input_ids
|
||||
input_ids = [processor.tokenizer.cls_token_id] + input_ids
|
||||
input_ids = torch.tensor(input_ids).unsqueeze(0).to(torch_device)
|
||||
|
||||
generated_ids = model.generate(pixel_values=pixel_values, input_ids=input_ids, max_length=20)
|
||||
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
||||
|
||||
expected_shape = torch.Size((1, 15))
|
||||
self.assertEqual(generated_ids.shape, expected_shape)
|
||||
self.assertEquals(generated_caption, "what does the front of the bus say at the top? special")
|
||||
153
tests/models/git/test_processor_git.py
Normal file
153
tests/models/git/test_processor_git.py
Normal file
@@ -0,0 +1,153 @@
|
||||
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import shutil
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from transformers.testing_utils import require_vision
|
||||
from transformers.utils import is_vision_available
|
||||
|
||||
|
||||
if is_vision_available():
|
||||
from PIL import Image
|
||||
|
||||
from transformers import AutoProcessor, BertTokenizer, CLIPImageProcessor, GitProcessor, PreTrainedTokenizerFast
|
||||
|
||||
|
||||
@require_vision
|
||||
class GitProcessorTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
image_processor = CLIPImageProcessor()
|
||||
tokenizer = BertTokenizer.from_pretrained(
|
||||
"hf-internal-testing/tiny-random-BertModel", model_input_names=["input_ids", "attention_mask"]
|
||||
)
|
||||
|
||||
processor = GitProcessor(image_processor, tokenizer)
|
||||
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
|
||||
|
||||
def get_image_processor(self, **kwargs):
|
||||
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).image_processor
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmpdirname)
|
||||
|
||||
def prepare_image_inputs(self):
|
||||
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
|
||||
or a list of PyTorch tensors if one specifies torchify=True.
|
||||
"""
|
||||
|
||||
image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
|
||||
|
||||
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
|
||||
|
||||
return image_inputs
|
||||
|
||||
def test_save_load_pretrained_additional_features(self):
|
||||
processor = GitProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
|
||||
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||
image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
|
||||
|
||||
processor = GitProcessor.from_pretrained(
|
||||
self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
|
||||
)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||
self.assertIsInstance(processor.tokenizer, PreTrainedTokenizerFast)
|
||||
|
||||
self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
|
||||
self.assertIsInstance(processor.image_processor, CLIPImageProcessor)
|
||||
|
||||
def test_image_processor(self):
|
||||
image_processor = self.get_image_processor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = GitProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||
|
||||
image_input = self.prepare_image_inputs()
|
||||
|
||||
input_feat_extract = image_processor(image_input, return_tensors="np")
|
||||
input_processor = processor(images=image_input, return_tensors="np")
|
||||
|
||||
for key in input_feat_extract.keys():
|
||||
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)
|
||||
|
||||
def test_tokenizer(self):
|
||||
image_processor = self.get_image_processor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = GitProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||
|
||||
input_str = "lower newer"
|
||||
|
||||
encoded_processor = processor(text=input_str)
|
||||
|
||||
encoded_tok = tokenizer(input_str, return_token_type_ids=False)
|
||||
|
||||
for key in encoded_tok.keys():
|
||||
self.assertListEqual(encoded_tok[key], encoded_processor[key])
|
||||
|
||||
def test_processor(self):
|
||||
image_processor = self.get_image_processor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = GitProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||
|
||||
input_str = "lower newer"
|
||||
image_input = self.prepare_image_inputs()
|
||||
|
||||
inputs = processor(text=input_str, images=image_input)
|
||||
|
||||
self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "pixel_values"])
|
||||
|
||||
# test if it raises when no input is passed
|
||||
with pytest.raises(ValueError):
|
||||
processor()
|
||||
|
||||
def test_tokenizer_decode(self):
|
||||
image_processor = self.get_image_processor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = GitProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||
|
||||
predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
|
||||
|
||||
decoded_processor = processor.batch_decode(predicted_ids)
|
||||
decoded_tok = tokenizer.batch_decode(predicted_ids)
|
||||
|
||||
self.assertListEqual(decoded_tok, decoded_processor)
|
||||
|
||||
def test_model_input_names(self):
|
||||
image_processor = self.get_image_processor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = GitProcessor(tokenizer=tokenizer, image_processor=image_processor)
|
||||
|
||||
input_str = "lower newer"
|
||||
image_input = self.prepare_image_inputs()
|
||||
|
||||
inputs = processor(text=input_str, images=image_input)
|
||||
|
||||
# For now the processor supports only ['input_ids', 'attention_mask', 'pixel_values']
|
||||
self.assertListEqual(list(inputs.keys()), ["input_ids", "attention_mask", "pixel_values"])
|
||||
@@ -212,7 +212,11 @@ class TextGenerationPipelineTests(unittest.TestCase, metaclass=PipelineTestCaseM
|
||||
# it requires BOS token to exist.
|
||||
# Special case for Pegasus which will always append EOS so will
|
||||
# work even without BOS.
|
||||
if text_generator.tokenizer.bos_token_id is not None or "Pegasus" in tokenizer.__class__.__name__:
|
||||
if (
|
||||
text_generator.tokenizer.bos_token_id is not None
|
||||
or "Pegasus" in tokenizer.__class__.__name__
|
||||
or "Git" in model.__class__.__name__
|
||||
):
|
||||
outputs = text_generator("")
|
||||
self.assertEqual(outputs, [{"generated_text": ANY(str)}])
|
||||
else:
|
||||
|
||||
@@ -148,6 +148,7 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
|
||||
# should **not** be the rule.
|
||||
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||
# models to ignore for model xxx mapping
|
||||
"GitVisionModel",
|
||||
"BlipForConditionalGeneration",
|
||||
"BlipForImageTextRetrieval",
|
||||
"BlipForQuestionAnswering",
|
||||
|
||||
@@ -81,6 +81,7 @@ src/transformers/models/ernie/configuration_ernie.py
|
||||
src/transformers/models/flava/configuration_flava.py
|
||||
src/transformers/models/fnet/configuration_fnet.py
|
||||
src/transformers/models/fsmt/configuration_fsmt.py
|
||||
src/transformers/models/git/modeling_git.py
|
||||
src/transformers/models/glpn/modeling_glpn.py
|
||||
src/transformers/models/gpt2/configuration_gpt2.py
|
||||
src/transformers/models/gpt2/modeling_gpt2.py
|
||||
|
||||
Reference in New Issue
Block a user