Add Segment Anything Model (SAM) (#22654)

* initial commit

* keys match

* update, fix conversion

* fixes, inference working

* fix

* more fixes

* more fixes

* clean up

* more clean up

* fix copies and add convext copied layer norm

* stash

* pretty big upfate

* cleaning

* more cleaning

* fixup stuffs

* fix copies

* fix iinit

* update test removing tokenizer

* nits

* add pretrained

* more nits

* remove tracking of pipeline

* few fixes

* update san and conversion script

* fix mask decoder and prompt encoder conversion

* fixes

* small update

* fix order

* fix

* fix image embeddings

* nites

* few fixes

* fix logits

* clean up

* fixes boxes inference

* v1 AMG

* clean up

* some clean up

* multi points support

* amg working

* fixup

* clean up

* readme

* update toctree

* fix type hint

* multiple fixes

* fixup

* fixes

* updates

* updates

* more tests

* few fixes

* change to `SamForMaskGeneration`

* doc

* fixup

* fix more tests

* multiple fixes

* fix CI tests

* refactor processor

* renamings

* draft the pipeline

* refactor

* fix tests

* fix test

* few cleanings

* fix test

* edit pipelien support chunking

* udate

* add slow tests

* fix nit

* fixup

* fix nit

* current chunk pipleine

* cast boxes in fp32

* nit

* current updates

* piepleine works

* fixup

* clean up config

* fix slow tests

* fix slow tests

* clean up

* update doc and pipeline

* adds more slow tests

* fix slow tests

* cleaning

* tests pass

* add docstring

* fix copies

* clean up

* support batch of images

* style

* dummy is needed, add tests

* fix slow tests

* fix CI

* update

* adds more tests

* fixes

* fixes

* fixup

* fixes

* few fixes

* filter

* few fixes

* some refactor

* touches finales

* fix

* style

* remove pipeline files

* fixes nits

* revert pipeline changes

* fix test

* fixup

* remove automodel for automatic mask generation

* fix failing torch tests

* update mdx

* revert removal of `MODEL_FOR_AUTOMATIC_MASK_GENERATION_MAPPING`

* update sam config based on review

Co-authored-by: amyeroberts <aeroberts4444@gmail.com>
Co-authored-by: sgugger <sylvain.gugger@gmail.com>

* update low_resolution_masks -> pred_masks
inti ln with layer_norm_eps
add_decomposed_rel_pos doc
forward doc of SamForMaskGeneration

* update processor docstring

* remove image processor import empty

* update for testing

* output vision hidden states + clean recomm
also test all iou values

* fixup

* fixup

* remove unused

* Update src/transformers/models/sam/modeling_sam.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/sam/image_processing_sam.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* nits

* fix

* fix CI tests and slow tests

* replace with Amy's processor

* clearer docstring

* add `SamVisionNeck`

* refactor - all CI tests should pass

* fix broken import on Gcolab

* few fixes here and there

* fix another bug

* fix more bugs

* update and merge

* correct ckpt

* address comments

* add tips

* revert

* fix docstring

* replace with `SamModel`

* make fixup

* add support for bathed images and batch ed points

* make fixup this time, really

* make fixup again and again

* few fixes here and there, this should be the touche finale

* Update docs/source/en/model_doc/sam.mdx

* fixup

* correct checkpoints

* correct name

* rm unneeded file

* add notebook

---------

Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: amyeroberts <aeroberts4444@gmail.com>
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
This commit is contained in:
Arthur
2023-04-19 21:01:49 +02:00
committed by GitHub
parent 898efca72a
commit 474bf508df
30 changed files with 3645 additions and 1 deletions

View File

@@ -421,6 +421,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[Segment Anything](https://huggingface.co/docs/transformers/main/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.

View File

@@ -409,6 +409,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[Segment Anything](https://huggingface.co/docs/transformers/main/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.

View File

@@ -381,6 +381,7 @@ conda install -c huggingface transformers
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (झुईई टेक्नोलॉजी से), साथ में पेपर [रोफॉर्मर: रोटरी पोजिशन एंबेडिंग के साथ एन्हांस्ड ट्रांसफॉर्मर] (https://arxiv.org/pdf/2104.09864v1.pdf) जियानलिन सु और यू लू और शेंगफेंग पैन और बो वेन और युनफेंग लियू द्वारा प्रकाशित।
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[Segment Anything](https://huggingface.co/docs/transformers/main/model_doc/sam)** (Meta AI से) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. द्वाराअनुसंधान पत्र [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/) के साथ जारी किया गया
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP से) साथ देने वाला पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https ://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योव आर्टज़ी द्वारा।
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP से) साथ में पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स] (https://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योआव आर्टज़ी द्वारा पोस्ट किया गया।
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.

View File

@@ -443,6 +443,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI から) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou から公開された研究論文: [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf)
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology から), Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu から公開された研究論文: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA から) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo から公開された研究論文: [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203)
1. **[Segment Anything](https://huggingface.co/docs/transformers/main/model_doc/sam)** (Meta AI から) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. から公開された研究論文 [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/)
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870)
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (Microsoft Research から) Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. から公開された研究論文 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205)

View File

@@ -358,6 +358,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI 에서) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 의 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 논문과 함께 발표했습니다.
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology 에서) Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 의 a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 논문과 함께 발표했습니다.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA 에서) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 의 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 논문과 함께 발표했습니다.
1. **[Segment Anything](https://huggingface.co/docs/transformers/main/model_doc/sam)** (Meta AI 에서 제공)은 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.의 [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/)논문과 함께 발표했습니다.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (Microsoft Research 에서 제공)은 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.의 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205)논문과 함께 발표했습니다.

View File

@@ -382,6 +382,7 @@ conda install -c huggingface transformers
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (来自 WeChatAI), 伴随论文 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 由 HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 发布。
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (来自 NVIDIA) 伴随论文 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 由 Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 发布。
1. **[Segment Anything](https://huggingface.co/docs/transformers/main/model_doc/sam)** (来自 Meta AI) 伴随论文 [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/) 由 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick 发布。
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (来自 Microsoft Research) 伴随论文 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) 由 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei 发布。

View File

@@ -394,6 +394,7 @@ conda install -c huggingface transformers
1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[Segment Anything](https://huggingface.co/docs/transformers/main/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.

View File

@@ -610,6 +610,8 @@
title: Perceiver
- local: model_doc/pix2struct
title: Pix2Struct
- local: model_doc/sam
title: Segment Anything
- local: model_doc/speech-encoder-decoder
title: Speech Encoder Decoder Models
- local: model_doc/tapas

View File

@@ -195,6 +195,7 @@ The documentation is organized into five sections:
1. **[RoCBert](model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou.
1. **[RoFormer](model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1. **[SegFormer](model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[Segment Anything](model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SEW](model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SEW-D](model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
1. **[SpeechT5](model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
@@ -391,6 +392,7 @@ Flax), PyTorch, and/or TensorFlow.
| RoBERTa-PreLayerNorm | ❌ | ❌ | ✅ | ✅ | ✅ |
| RoCBert | ✅ | ❌ | ✅ | ❌ | ❌ |
| RoFormer | ✅ | ✅ | ✅ | ✅ | ✅ |
| SAM | ❌ | ❌ | ✅ | ❌ | ❌ |
| SegFormer | ❌ | ❌ | ✅ | ✅ | ❌ |
| SEW | ❌ | ❌ | ✅ | ❌ | ❌ |
| SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ |

View File

@@ -0,0 +1,96 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# SAM
## Overview
SAM (Segment Anything Model) was proposed in [Segment Anything](https://ai.facebook.com/research/publications/segment-anything/) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
The model can be used to predict segmentation masks of any object of interest given an input image.
![example image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-output.png)
The abstract from the paper is the following:
*We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at \href{https://segment-anything.com}{https://segment-anything.com} to foster research into foundation models for computer vision.*
Tips:
- The model predicts binary masks that states the presence or not of the object of interest given an image.
- The model predicts much better results if input 2D points and/or input bounding boxes are provided
- You can prompt multiple points for the same image, and predict a single mask.
- Fine-tuning the model is not supported yet
- According to the paper, textual input should be also supported. However, at this time of writing this seems to be not supported according to [the official repository](https://github.com/facebookresearch/segment-anything/issues/4#issuecomment-1497626844).
This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ).
The original code can be found [here](https://github.com/facebookresearch/segment-anything).
Below is an example on how to run mask generation given an image and a 2D point:
```python
from PIL import Image
import requests
from transformers import SamModelForMaskedGeneration, SamProcessor
model = SamModelForMaskedGeneration.from_pretrained("facebook/sam-vit-huge")
processsor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_points = [[[450, 600]]] # 2D location of a window in the image
inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to(device)
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(
outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()
)
scores = outputs.iou_scores
```
Resources:
- [Demo notebook](https://github.com/huggingface/notebooks/blob/main/examples/segment_anything.ipynb) for using the model
## SamConfig
[[autodoc]] SamConfig
## SamVisionConfig
[[autodoc]] SamVisionConfig
## SamMaskDecoderConfig
[[autodoc]] SamMaskDecoderConfig
## SamPromptEncoderConfig
[[autodoc]] SamPromptEncoderConfig
## SamProcessor
[[autodoc]] SamProcessor
## SamImageProcessor
[[autodoc]] SamImageProcessor
## SamModel
[[autodoc]] SamModel
- forward

View File

@@ -429,6 +429,14 @@ _import_structure = {
"models.roberta_prelayernorm": ["ROBERTA_PRELAYERNORM_PRETRAINED_CONFIG_ARCHIVE_MAP", "RobertaPreLayerNormConfig"],
"models.roc_bert": ["ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoCBertConfig", "RoCBertTokenizer"],
"models.roformer": ["ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoFormerConfig", "RoFormerTokenizer"],
"models.sam": [
"SAM_PRETRAINED_CONFIG_ARCHIVE_MAP",
"SamConfig",
"SamMaskDecoderConfig",
"SamProcessor",
"SamPromptEncoderConfig",
"SamVisionConfig",
],
"models.segformer": ["SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegformerConfig"],
"models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
"models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
@@ -875,6 +883,7 @@ else:
_import_structure["models.perceiver"].extend(["PerceiverFeatureExtractor", "PerceiverImageProcessor"])
_import_structure["models.pix2struct"].extend(["Pix2StructImageProcessor"])
_import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"])
_import_structure["models.sam"].extend(["SamImageProcessor"])
_import_structure["models.segformer"].extend(["SegformerFeatureExtractor", "SegformerImageProcessor"])
_import_structure["models.swin2sr"].append("Swin2SRImageProcessor")
_import_structure["models.tvlt"].append("TvltImageProcessor")
@@ -2332,6 +2341,13 @@ else:
"load_tf_weights_in_roformer",
]
)
_import_structure["models.sam"].extend(
[
"SAM_PRETRAINED_MODEL_ARCHIVE_LIST",
"SamModel",
"SamPreTrainedModel",
]
)
_import_structure["models.segformer"].extend(
[
"SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4126,6 +4142,14 @@ if TYPE_CHECKING:
)
from .models.roc_bert import ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RoCBertConfig, RoCBertTokenizer
from .models.roformer import ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, RoFormerConfig, RoFormerTokenizer
from .models.sam import (
SAM_PRETRAINED_CONFIG_ARCHIVE_MAP,
SamConfig,
SamMaskDecoderConfig,
SamProcessor,
SamPromptEncoderConfig,
SamVisionConfig,
)
from .models.segformer import SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, SegformerConfig
from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
@@ -4516,6 +4540,7 @@ if TYPE_CHECKING:
from .models.perceiver import PerceiverFeatureExtractor, PerceiverImageProcessor
from .models.pix2struct import Pix2StructImageProcessor
from .models.poolformer import PoolFormerFeatureExtractor, PoolFormerImageProcessor
from .models.sam import SamImageProcessor
from .models.segformer import SegformerFeatureExtractor, SegformerImageProcessor
from .models.swin2sr import Swin2SRImageProcessor
from .models.tvlt import TvltImageProcessor
@@ -5709,6 +5734,11 @@ if TYPE_CHECKING:
RoFormerPreTrainedModel,
load_tf_weights_in_roformer,
)
from .models.sam import (
SAM_PRETRAINED_MODEL_ARCHIVE_LIST,
SamModel,
SamPreTrainedModel,
)
from .models.segformer import (
SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
SegformerDecodeHead,

View File

@@ -467,7 +467,7 @@ class BaseImageProcessor(ImageProcessingMixin):
raise NotImplementedError("Each image processor must implement its own preprocess method")
VALID_SIZE_DICT_KEYS = ({"height", "width"}, {"shortest_edge"}, {"shortest_edge", "longest_edge"})
VALID_SIZE_DICT_KEYS = ({"height", "width"}, {"shortest_edge"}, {"shortest_edge", "longest_edge"}, {"longest_edge"})
def is_valid_size_dict(size_dict):
@@ -501,6 +501,10 @@ def convert_to_size_dict(
return {"height": size[0], "width": size[1]}
elif isinstance(size, (tuple, list)) and not height_width_order:
return {"height": size[1], "width": size[0]}
elif size is None and max_size is not None:
if default_to_square:
raise ValueError("Cannot specify both default_to_square=True and max_size")
return {"longest_edge": max_size}
raise ValueError(f"Could not convert size input to size dict: {size}")

View File

@@ -160,6 +160,7 @@ from . import (
roberta_prelayernorm,
roc_bert,
roformer,
sam,
segformer,
sew,
sew_d,

View File

@@ -161,6 +161,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("roberta-prelayernorm", "RobertaPreLayerNormConfig"),
("roc_bert", "RoCBertConfig"),
("roformer", "RoFormerConfig"),
("sam", "SamConfig"),
("segformer", "SegformerConfig"),
("sew", "SEWConfig"),
("sew-d", "SEWDConfig"),
@@ -338,6 +339,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("roberta-prelayernorm", "ROBERTA_PRELAYERNORM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("roc_bert", "ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("roformer", "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("sam", "SAM_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -537,6 +539,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("roberta-prelayernorm", "RoBERTa-PreLayerNorm"),
("roc_bert", "RoCBert"),
("roformer", "RoFormer"),
("sam", "SAM"),
("segformer", "SegFormer"),
("sew", "SEW"),
("sew-d", "SEW-D"),

View File

@@ -84,6 +84,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
("poolformer", "PoolFormerImageProcessor"),
("regnet", "ConvNextImageProcessor"),
("resnet", "ConvNextImageProcessor"),
("sam", "SamImageProcessor"),
("segformer", "SegformerImageProcessor"),
("swin", "ViTImageProcessor"),
("swin2sr", "Swin2SRImageProcessor"),

View File

@@ -156,6 +156,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("roberta-prelayernorm", "RobertaPreLayerNormModel"),
("roc_bert", "RoCBertModel"),
("roformer", "RoFormerModel"),
("sam", "SamModel"),
("segformer", "SegformerModel"),
("sew", "SEWModel"),
("sew-d", "SEWDModel"),
@@ -976,6 +977,12 @@ MODEL_FOR_BACKBONE_MAPPING_NAMES = OrderedDict(
]
)
MODEL_FOR_AUTOMATIC_MASK_GENERATION_MAPPING_NAMES = OrderedDict(
[
("sam", "SamModel"),
]
)
MODEL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_MAPPING_NAMES)
MODEL_FOR_PRETRAINING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_PRETRAINING_MAPPING_NAMES)
MODEL_WITH_LM_HEAD_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_WITH_LM_HEAD_MAPPING_NAMES)
@@ -1051,6 +1058,10 @@ MODEL_FOR_AUDIO_XVECTOR_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_F
MODEL_FOR_BACKBONE_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_BACKBONE_MAPPING_NAMES)
MODEL_FOR_AUTOMATIC_MASK_GENERATION_MAPPING = _LazyAutoMapping(
CONFIG_MAPPING_NAMES, MODEL_FOR_AUTOMATIC_MASK_GENERATION_MAPPING_NAMES
)
class AutoModel(_BaseAutoModelClass):
_model_mapping = MODEL_MAPPING

View File

@@ -61,6 +61,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
("oneformer", "OneFormerProcessor"),
("owlvit", "OwlViTProcessor"),
("pix2struct", "Pix2StructProcessor"),
("sam", "SamProcessor"),
("sew", "Wav2Vec2Processor"),
("sew-d", "Wav2Vec2Processor"),
("speech_to_text", "Speech2TextProcessor"),

View File

@@ -0,0 +1,80 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
_import_structure = {
"configuration_sam": [
"SAM_PRETRAINED_CONFIG_ARCHIVE_MAP",
"SamConfig",
"SamMaskDecoderConfig",
"SamPromptEncoderConfig",
"SamVisionConfig",
],
"processing_sam": ["SamProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_sam"] = [
"SAM_PRETRAINED_MODEL_ARCHIVE_LIST",
"SamModel",
"SamPreTrainedModel",
]
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["image_processing_sam"] = ["SamImageProcessor"]
if TYPE_CHECKING:
from .configuration_sam import (
SAM_PRETRAINED_CONFIG_ARCHIVE_MAP,
SamConfig,
SamMaskDecoderConfig,
SamPromptEncoderConfig,
SamVisionConfig,
)
from .processing_sam import SamProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_sam import SAM_PRETRAINED_MODEL_ARCHIVE_LIST, SamModel, SamPreTrainedModel
try:
if not is_vision_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .image_processing_sam import SamImageProcessor
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@@ -0,0 +1,344 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" SAM model configuration"""
import copy
from ...configuration_utils import PretrainedConfig
from ...utils import logging
logger = logging.get_logger(__name__)
SAM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"facebook/sam-vit-huge": "https://huggingface.co/facebook/sam-vit-huge/resolve/main/config.json",
"facebook/sam-vit-large": "https://huggingface.co/facebook/sam-vit-large/resolve/main/config.json",
"facebook/sam-vit-big": "https://huggingface.co/facebook/sam-vit-big/resolve/main/config.json",
}
class SamPromptEncoderConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`SamPromptEncoder`]. The [`SamPromptEncoder`]
module is used to encode the input 2D points and bounding boxes. Instantiating a configuration defaults will yield
a similar configuration to that of the SAM-vit-h
[facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
hidden_size (`int`, *optional*, defaults to 256):
Dimensionality of the hidden states.
image_size (`int`, *optional*, defaults to 1024):
The expected output resolution of the image.
patch_size (`int`, *optional*, defaults to 16):
The size (resolution) of each patch.
mask_input_channels (`int`, *optional*, defaults to 16):
The number of channels to be fed to the `MaskDecoder` module.
num_point_embeddings (`int`, *optional*, defaults to 4):
The number of point embeddings to be used.
hidden_act (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function in the encoder and pooler.
"""
def __init__(
self,
hidden_size=256,
image_size=1024,
patch_size=16,
mask_input_channels=16,
num_point_embeddings=4,
hidden_act="gelu",
layer_norm_eps=1e-6,
**kwargs,
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.image_size = image_size
self.patch_size = patch_size
self.image_embedding_size = image_size // patch_size
self.mask_input_channels = mask_input_channels
self.num_point_embeddings = num_point_embeddings
self.hidden_act = hidden_act
self.layer_norm_eps = layer_norm_eps
class SamMaskDecoderConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`SamMaskDecoder`]. It is used to instantiate a SAM
mask decoder to the specified arguments, defining the model architecture. Instantiating a configuration defaults
will yield a similar configuration to that of the SAM-vit-h
[facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
hidden_size (`int`, *optional*, defaults to 256):
Dimensionality of the hidden states.
hidden_act (`str`, *optional*, defaults to `"relu"`):
The non-linear activation function used inside the `SamMaskDecoder` module.
mlp_dim (`int`, *optional*, defaults to 2048):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
num_hidden_layers (`int`, *optional*, defaults to 2):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 8):
Number of attention heads for each attention layer in the Transformer encoder.
attention_downsample_rate (`int`, *optional*, defaults to 2):
The downsampling rate of the attention layer.
num_multimask_outputs (`int`, *optional*, defaults to 3):
The number of outputs from the `SamMaskDecoder` module. In the Segment Anything paper, this is set to 3.
iou_head_depth (`int`, *optional*, defaults to 3):
The number of layers in the IoU head module.
iou_head_hidden_dim (`int`, *optional*, defaults to 256):
The dimensionality of the hidden states in the IoU head module.
layer_norm_eps (`float`, *optional*, defaults to 1e-6):
The epsilon used by the layer normalization layers.
"""
def __init__(
self,
hidden_size=256,
hidden_act="relu",
mlp_dim=2048,
num_hidden_layers=2,
num_attention_heads=8,
attention_downsample_rate=2,
num_multimask_outputs=3,
iou_head_depth=3,
iou_head_hidden_dim=256,
layer_norm_eps=1e-6,
**kwargs,
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.hidden_act = hidden_act
self.mlp_dim = mlp_dim
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.attention_downsample_rate = attention_downsample_rate
self.num_multimask_outputs = num_multimask_outputs
self.iou_head_depth = iou_head_depth
self.iou_head_hidden_dim = iou_head_hidden_dim
self.layer_norm_eps = layer_norm_eps
class SamVisionConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`SamVisionModel`]. It is used to instantiate a SAM
vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
defaults will yield a similar configuration to that of the SAM ViT-h
[facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
intermediate_size (`int`, *optional*, defaults to 6144):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
projection_dim (`int`, *optional*, defaults to 512):
Dimensionality of the projection layer in the Transformer encoder.
output_channels (`int`, *optional*, defaults to 256):
Dimensionality of the output channels in the Patch Encoder.
num_hidden_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
num_channels (`int`, *optional*, defaults to 3):
Number of channels in the input image.
image_size (`int`, *optional*, defaults to 1024):
Expected resolution. Target size of the resized input image.
patch_size (`int`, *optional*, defaults to 16):
Size of the patches to be extracted from the input image.
hidden_act (`str`, *optional*, defaults to `"gelu"`):
The non-linear activation function (function or string)
layer_norm_eps (`float`, *optional*, defaults to 1e-6):
The epsilon used by the layer normalization layers.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probability.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
initializer_range (`float`, *optional*, defaults to 1e-10):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
initializer_factor (`float`, *optional*, defaults to 1.0):
A factor for multiplying the initializer range.
qkv_bias (`bool`, *optional*, defaults to `True`):
Whether to add a bias to query, key, value projections.
mlp_ratio (`float`, *optional*, defaults to 4.0):
Ratio of mlp hidden dim to embedding dim.
use_abs_pos (`bool`, *optional*, defaults to True):
Whether to use absolute position embedding.
use_rel_pos (`bool`, *optional*, defaults to True):
Whether to use relative position embedding.
window_size (`int`, *optional*, defaults to 14):
Window size for relative position.
global_attn_indexes (`List[int]`, *optional*, defaults to `[2, 5, 8, 11]`):
The indexes of the global attention layers.
num_pos_feats (`int`, *optional*, defaults to 128):
The dimensionality of the position embedding.
mlp_dim (`int`, *optional*, defaults to None):
The dimensionality of the MLP layer in the Transformer encoder. If `None`, defaults to `mlp_ratio *
hidden_size`.
"""
def __init__(
self,
hidden_size=768,
intermediate_size=6144,
projection_dim=512,
output_channels=256,
num_hidden_layers=12,
num_attention_heads=12,
num_channels=3,
image_size=1024,
patch_size=16,
hidden_act="gelu",
layer_norm_eps=1e-06,
dropout=0.0,
attention_dropout=0.0,
initializer_range=1e-10,
initializer_factor=1.0,
qkv_bias=True,
mlp_ratio=4.0,
use_abs_pos=True,
use_rel_pos=True,
window_size=14,
global_attn_indexes=[2, 5, 8, 11],
num_pos_feats=128,
mlp_dim=None,
**kwargs,
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.projection_dim = projection_dim
self.output_channels = output_channels
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.image_size = image_size
self.patch_size = patch_size
self.hidden_act = hidden_act
self.layer_norm_eps = layer_norm_eps
self.dropout = dropout
self.attention_dropout = attention_dropout
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.qkv_bias = qkv_bias
self.mlp_ratio = mlp_ratio
self.use_abs_pos = use_abs_pos
self.use_rel_pos = use_rel_pos
self.window_size = window_size
self.global_attn_indexes = global_attn_indexes
self.num_pos_feats = num_pos_feats
self.mlp_dim = int(hidden_size * mlp_ratio) if mlp_dim is None else mlp_dim
class SamConfig(PretrainedConfig):
r"""
[`SamConfig`] is the configuration class to store the configuration of a [`SamModel`]. It is used to instantiate a
SAM model according to the specified arguments, defining the vision model, prompt-encoder model and mask decoder
configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the
SAM-ViT-H [facebook/sam-vit-huge](https://huggingface.co/facebook/sam-vit-huge) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vision_config (Union[`dict`, `SamVisionConfig`], *optional*):
Dictionary of configuration options used to initialize [`SamVisionConfig`].
prompt_encoder_config (Union[`dict`, `SamPromptEncoderConfig`], *optional*):
Dictionary of configuration options used to initialize [`SamPromptEncoderConfig`].
mask_decoder_config (Union[`dict`, `SamMaskDecoderConfig`], *optional*):
Dictionary of configuration options used to initialize [`SamMaskDecoderConfig`].
kwargs (*optional*):
Dictionary of keyword arguments.
Example:
```python
>>> from transformers import (
... SamVisionConfig,
... SamPromptEncoderConfig,
... SamMaskDecoderConfig,
... SamModel,
... )
>>> # Initializing a SamConfig with `"facebook/sam-vit-huge"` style configuration
>>> configuration = SamConfig()
>>> # Initializing a SamModel (with random weights) from the `"facebook/sam-vit-huge"` style configuration
>>> model = SamModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
>>> # We can also initialize a SamConfig from a SamVisionConfig, SamPromptEncoderConfig, and SamMaskDecoderConfig
>>> # Initializing SAM vision, SAM Q-Former and language model configurations
>>> vision_config = SamVisionConfig()
>>> prompt_encoder_config = SamPromptEncoderConfig()
>>> mask_decoder_config = SamMaskDecoderConfig()
>>> config = SamConfig(vision_config, prompt_encoder_config, mask_decoder_config)
```"""
model_type = "sam"
is_composition = True
def __init__(
self,
vision_config=None,
prompt_encoder_config=None,
mask_decoder_config=None,
initializer_range=0.02,
**kwargs,
):
super().__init__(**kwargs)
vision_config = vision_config if vision_config is not None else {}
prompt_encoder_config = prompt_encoder_config if prompt_encoder_config is not None else {}
mask_decoder_config = mask_decoder_config if mask_decoder_config is not None else {}
if isinstance(vision_config, SamVisionConfig):
vision_config = vision_config.to_dict()
if isinstance(prompt_encoder_config, SamPromptEncoderConfig):
prompt_encoder_config = prompt_encoder_config.to_dict()
if isinstance(mask_decoder_config, SamMaskDecoderConfig):
mask_decoder_config = mask_decoder_config.to_dict()
self.vision_config = SamVisionConfig(**vision_config)
self.prompt_encoder_config = SamPromptEncoderConfig(**prompt_encoder_config)
self.mask_decoder_config = SamMaskDecoderConfig(**mask_decoder_config)
self.initializer_range = initializer_range
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["vision_config"] = self.vision_config.to_dict()
output["prompt_encoder_config"] = self.prompt_encoder_config.to_dict()
output["mask_decoder_config"] = self.mask_decoder_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

View File

@@ -0,0 +1,206 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Convert SAM checkpoints from the original repository.
"""
import argparse
import re
import numpy as np
import requests
import torch
from huggingface_hub import hf_hub_download
from PIL import Image
from transformers import (
SamConfig,
SamImageProcessor,
SamModel,
SamProcessor,
SamVisionConfig,
)
KEYS_TO_MODIFY_MAPPING = {
"iou_prediction_head.layers.0": "iou_prediction_head.proj_in",
"iou_prediction_head.layers.1": "iou_prediction_head.layers.0",
"iou_prediction_head.layers.2": "iou_prediction_head.proj_out",
"mask_decoder.output_upscaling.0": "mask_decoder.upscale_conv1",
"mask_decoder.output_upscaling.1": "mask_decoder.upscale_layer_norm",
"mask_decoder.output_upscaling.3": "mask_decoder.upscale_conv2",
"mask_downscaling.0": "mask_embed.conv1",
"mask_downscaling.1": "mask_embed.layer_norm1",
"mask_downscaling.3": "mask_embed.conv2",
"mask_downscaling.4": "mask_embed.layer_norm2",
"mask_downscaling.6": "mask_embed.conv3",
"point_embeddings": "point_embed",
"pe_layer.positional_encoding_gaussian_matrix": "shared_embedding.positional_embedding",
"image_encoder": "vision_encoder",
"neck.0": "neck.conv1",
"neck.1": "neck.layer_norm1",
"neck.2": "neck.conv2",
"neck.3": "neck.layer_norm2",
"patch_embed.proj": "patch_embed.projection",
".norm": ".layer_norm",
"blocks": "layers",
}
def replace_keys(state_dict):
model_state_dict = {}
state_dict.pop("pixel_mean", None)
state_dict.pop("pixel_std", None)
output_hypernetworks_mlps_pattern = r".*.output_hypernetworks_mlps.(\d+).layers.(\d+).*"
for key, value in state_dict.items():
for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
if re.match(output_hypernetworks_mlps_pattern, key):
layer_nb = int(re.match(output_hypernetworks_mlps_pattern, key).group(2))
if layer_nb == 0:
key = key.replace("layers.0", "proj_in")
elif layer_nb == 1:
key = key.replace("layers.1", "layers.0")
elif layer_nb == 2:
key = key.replace("layers.2", "proj_out")
model_state_dict[key] = value
model_state_dict["shared_image_embedding.positional_embedding"] = model_state_dict[
"prompt_encoder.shared_embedding.positional_embedding"
]
return model_state_dict
def convert_sam_checkpoint(model_name, pytorch_dump_folder, push_to_hub, model_hub_id="ybelkada/segment-anything"):
checkpoint_path = hf_hub_download(model_hub_id, f"checkpoints/{model_name}.pth")
if "sam_vit_b" in model_name:
config = SamConfig()
elif "sam_vit_l" in model_name:
vision_config = SamVisionConfig(
hidden_size=1024,
num_hidden_layers=24,
num_attention_heads=16,
global_attn_indexes=[5, 11, 17, 23],
)
config = SamConfig(
vision_config=vision_config,
)
elif "sam_vit_h" in model_name:
vision_config = SamVisionConfig(
hidden_size=1280,
num_hidden_layers=32,
num_attention_heads=16,
global_attn_indexes=[7, 15, 23, 31],
)
config = SamConfig(
vision_config=vision_config,
)
state_dict = torch.load(checkpoint_path, map_location="cpu")
state_dict = replace_keys(state_dict)
image_processor = SamImageProcessor()
processor = SamProcessor(image_processor=image_processor)
hf_model = SamModel(config)
hf_model.load_state_dict(state_dict)
hf_model = hf_model.to("cuda")
img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_points = [[[400, 650]]]
input_labels = [[1]]
inputs = processor(images=np.array(raw_image), return_tensors="pt").to("cuda")
with torch.no_grad():
output = hf_model(**inputs)
scores = output.iou_scores.squeeze()
if model_name == "sam_vit_h_4b8939":
assert scores[-1].item() == 0.579890251159668
inputs = processor(
images=np.array(raw_image), input_points=input_points, input_labels=input_labels, return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = hf_model(**inputs)
scores = output.iou_scores.squeeze()
assert scores[-1].item() == 0.9712603092193604
input_boxes = ((75, 275, 1725, 850),)
inputs = processor(images=np.array(raw_image), input_boxes=input_boxes, return_tensors="pt").to("cuda")
with torch.no_grad():
output = hf_model(**inputs)
scores = output.iou_scores.squeeze()
assert scores[-1].item() == 0.8686015605926514
# Test with 2 points and 1 image.
input_points = [[[400, 650], [800, 650]]]
input_labels = [[1, 1]]
inputs = processor(
images=np.array(raw_image), input_points=input_points, input_labels=input_labels, return_tensors="pt"
).to("cuda")
with torch.no_grad():
output = hf_model(**inputs)
scores = output.iou_scores.squeeze()
assert scores[-1].item() == 0.9936047792434692
if __name__ == "__main__":
parser = argparse.ArgumentParser()
choices = ["sam_vit_b_01ec64", "sam_vit_h_4b8939", "sam_vit_l_0b3195"]
parser.add_argument(
"--model_name",
default="sam_vit_h_4b8939",
choices=choices,
type=str,
help="Path to hf config.json of model to convert",
)
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
parser.add_argument(
"--push_to_hub",
action="store_true",
help="Whether to push the model and processor to the hub after converting",
)
parser.add_argument(
"--model_hub_id",
default="ybelkada/segment-anything",
choices=choices,
type=str,
help="Path to hf config.json of model to convert",
)
args = parser.parse_args()
convert_sam_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub, args.model_hub_id)

View File

@@ -0,0 +1,402 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Image processor class for SAM."""
from typing import Dict, List, Optional, Tuple, Union
import numpy as np
from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
from ...image_transforms import convert_to_rgb, normalize, pad, rescale, resize, to_channel_dimension_format
from ...image_utils import (
IMAGENET_DEFAULT_MEAN,
IMAGENET_DEFAULT_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
get_image_size,
make_list_of_images,
to_numpy_array,
valid_images,
)
from ...utils import TensorType, is_torch_available, logging, requires_backends
if is_torch_available():
import torch.nn.functional as F
logger = logging.get_logger(__name__)
class SamImageProcessor(BaseImageProcessor):
r"""
Constructs a SAM image processor.
Args:
do_resize (`bool`, *optional*, defaults to `True`):
Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
`do_resize` parameter in the `preprocess` method.
size (`dict`, *optional*, defaults to `{"longest_edge": 1024}`):
Size of the output image after resizing. Resizes the longest edge of the image to match
`size["longest_edge"]` while maintaining the aspect ratio. Can be overridden by the `size` parameter in the
`preprocess` method.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the
`preprocess` method.
do_rescale (`bool`, *optional*, defaults to `True`):
Wwhether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the
`do_rescale` parameter in the `preprocess` method.
rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
overridden by the `rescale_factor` parameter in the `preprocess` method.
do_normalize (`bool`, *optional*, defaults to `True`):
Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_MEAN`):
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
overridden by the `image_mean` parameter in the `preprocess` method.
image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
Can be overridden by the `image_std` parameter in the `preprocess` method.
do_pad (`bool`, *optional*, defaults to `True`):
Whether to pad the image to the specified `pad_size`. Can be overridden by the `do_pad` parameter in the
`preprocess` method.
pad_size (`dict`, *optional*, defaults to `{"height": 1024, "width": 1024}`):
Size of the output image after padding. Can be overridden by the `pad_size` parameter in the `preprocess`
method.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
do_resize: bool = True,
size: Dict[str, int] = None,
resample: PILImageResampling = PILImageResampling.BILINEAR,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_pad: bool = True,
pad_size: int = None,
do_convert_rgb: bool = True,
**kwargs,
) -> None:
super().__init__(**kwargs)
size = size if size is not None else {"longest_edge": 1024}
size = get_size_dict(max_size=size, default_to_square=False) if not isinstance(size, dict) else size
pad_size = pad_size if pad_size is not None else {"height": 1024, "width": 1024}
pad_size = get_size_dict(pad_size, default_to_square=True)
self.do_resize = do_resize
self.size = size
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
self.do_pad = do_pad
self.pad_size = pad_size
self.do_convert_rgb = do_convert_rgb
def pad_image(
self,
image: np.ndarray,
pad_size: Dict[str, int],
data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Pad an image to `(pad_size["height"], pad_size["width"])` with zeros to the right and bottom.
Args:
image (`np.ndarray`):
Image to pad.
pad_size (`Dict[str, int]`):
Size of the output image after padding.
data_format (`str` or `ChannelDimension`, *optional*):
The data format of the image. Can be either "channels_first" or "channels_last". If `None`, the
`data_format` of the `image` will be used.
"""
output_height, output_width = pad_size["height"], pad_size["width"]
input_height, input_width = get_image_size(image)
pad_width = output_width - input_width
pad_height = output_height - input_height
padded_image = pad(image, ((0, pad_height), (0, pad_width)), data_format=data_format, **kwargs)
return padded_image
def _get_preprocess_shape(self, old_shape: Tuple[int, int], longest_edge: int):
"""
Compute the output size given input size and target long side length.
"""
oldh, oldw = old_shape
scale = longest_edge * 1.0 / max(oldh, oldw)
newh, neww = oldh * scale, oldw * scale
newh = int(newh + 0.5)
neww = int(neww + 0.5)
return (newh, neww)
def resize(
self,
image: np.ndarray,
size: Dict[str, int],
resample: PILImageResampling = PILImageResampling.BICUBIC,
data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Resize an image to `(size["height"], size["width"])`.
Args:
image (`np.ndarray`):
Image to resize.
size (`Dict[str, int]`):
Dictionary in the format `{"longest_edge": int}` specifying the size of the output image. The longest
edge of the image will be resized to the specified size, while the other edge will be resized to
maintain the aspect ratio.
resample:
`PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the output image. If unset, the channel dimension format of the input
image is used. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
Returns:
`np.ndarray`: The resized image.
"""
size = get_size_dict(size)
if "longest_edge" not in size:
raise ValueError(f"The `size` dictionary must contain the key `longest_edge`. Got {size.keys()}")
input_size = get_image_size(image)
output_height, output_width = self._get_preprocess_shape(input_size, size["longest_edge"])
return resize(image, size=(output_height, output_width), resample=resample, data_format=data_format, **kwargs)
def rescale(
self,
image: np.ndarray,
scale: Union[int, float],
data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
):
"""
Rescale an image by a scale factor. image = image * scale.
Args:
image (`np.ndarray`):
Image to rescale.
scale (`int` or `float`):
Scale to apply to the image.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format of the image. If not provided, it will be the same as the input image.
"""
return rescale(image, scale=scale, data_format=data_format, **kwargs)
def normalize(
self,
image: np.ndarray,
mean: Union[float, List[float]],
std: Union[float, List[float]],
data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Normalize an image. image = (image - image_mean) / image_std.
Args:
image (`np.ndarray`):
Image to normalize.
mean (`float` or `List[float]`):
Image mean.
std (`float` or `List[float]`):
Image standard deviation.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format of the image. If not provided, it will be the same as the input image.
"""
return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
def preprocess(
self,
images: ImageInput,
do_resize: Optional[bool] = None,
size: Optional[Dict[str, int]] = None,
resample: Optional["PILImageResampling"] = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[Union[int, float]] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_pad: Optional[bool] = None,
pad_size: Optional[Dict[str, int]] = None,
do_convert_rgb: bool = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: ChannelDimension = ChannelDimension.FIRST,
**kwargs,
):
"""
Preprocess an image or batch of images.
Args:
images (`ImageInput`):
Image to preprocess.
do_resize (`bool`, *optional*, defaults to `self.do_resize`):
Whether to resize the image.
size (`Dict[str, int]`, *optional*, defaults to `self.size`):
Controls the size of the image after `resize`. The longest edge of the image is resized to
`size["longest_edge"]` whilst preserving the aspect ratio.
resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
`PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BILINEAR`.
do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
Whether to rescale the image pixel values by rescaling factor.
rescale_factor (`int` or `float`, *optional*, defaults to `self.rescale_factor`):
Rescale factor to apply to the image pixel values.
do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
Whether to normalize the image.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to normalize the image by if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
do_pad (`bool`, *optional*, defaults to `self.do_pad`):
Whether to pad the image.
pad_size (`Dict[str, int]`, *optional*, defaults to `self.pad_size`):
Controls the size of the padding applied to the image. The image is padded to `pad_size["height"]` and
`pad_size["width"]` if `do_pad` is set to `True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
The channel dimension format for the output image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input image.
"""
do_resize = do_resize if do_resize is not None else self.do_resize
size = size if size is not None else self.size
size = get_size_dict(max_size=size, default_to_square=False) if not isinstance(size, dict) else size
resample = resample if resample is not None else self.resample
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
do_pad = do_pad if do_pad is not None else self.do_pad
pad_size = pad_size if pad_size is not None else self.pad_size
pad_size = get_size_dict(pad_size, default_to_square=True)
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
if do_resize and (size is None or resample is None):
raise ValueError("Size and resample must be specified if do_resize is True.")
if do_rescale and rescale_factor is None:
raise ValueError("Rescale factor must be specified if do_rescale is True.")
if do_normalize and (image_mean is None or image_std is None):
raise ValueError("Image mean and std must be specified if do_normalize is True.")
if do_pad and pad_size is None:
raise ValueError("Pad size must be specified if do_pad is True.")
# PIL RGBA images are converted to RGB
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
original_sizes = [get_image_size(image) for image in images]
if do_resize:
images = [self.resize(image=image, size=size, resample=resample) for image in images]
reshaped_input_sizes = [get_image_size(image) for image in images]
if do_rescale:
images = [self.rescale(image=image, scale=rescale_factor) for image in images]
if do_normalize:
images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
if do_pad:
images = [self.pad_image(image=image, pad_size=pad_size) for image in images]
images = [to_channel_dimension_format(image, data_format) for image in images]
data = {"pixel_values": images, "original_sizes": original_sizes, "reshaped_input_sizes": reshaped_input_sizes}
encoded_outputs = BatchFeature(data=data, tensor_type=return_tensors)
return encoded_outputs
def post_process_masks(
self, masks, original_sizes, reshaped_input_sizes, mask_threshold=0.0, binarize=True, pad_size=None
):
"""
Remove padding and upscale masks to the original image size.
Args:
masks (`torch.Tensor`):
Batched masks from the mask_decoder in (batch_size, num_channels, height, width) format.
original_sizes (`torch.Tensor`):
The original size of the images before resizing for input to the model, in (height, width) format.
reshaped_input_sizes (`torch.Tensor`):
The size of the image input to the model, in (height, width) format. Used to remove padding.
mask_threshold (`float`, *optional*, defaults to 0.0):
The threshold to use for binarizing the masks.
binarize (`bool`, *optional*, defaults to `True`):
Whether to binarize the masks.
pad_size (`int`, *optional*, defaults to `self.pad_size`):
The target size the images were padded to before being passed to the model. If None, the target size is
assumed to be the processor's `pad_size`.
Returns:
(`torch.Tensor`): Batched masks in batch_size, num_channels, height, width) format, where (height, width)
is given by original_size.
"""
requires_backends(self, ["torch"])
pad_size = self.pad_size if pad_size is None else pad_size
target_image_size = (pad_size["height"], pad_size["width"])
output_masks = []
for i, original_size in enumerate(original_sizes):
interpolated_mask = F.interpolate(masks[i], target_image_size, mode="bilinear", align_corners=False)
interpolated_mask = interpolated_mask[..., : reshaped_input_sizes[i][0], : reshaped_input_sizes[i][1]]
interpolated_mask = F.interpolate(
interpolated_mask, [*original_size.numpy()], mode="bilinear", align_corners=False
)
if binarize:
interpolated_mask = interpolated_mask > mask_threshold
output_masks.append(interpolated_mask)
return output_masks

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,248 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for SAM.
"""
from copy import deepcopy
from typing import Optional, Union
import numpy as np
from ...processing_utils import ProcessorMixin
from ...tokenization_utils_base import BatchEncoding
from ...utils import TensorType, is_torch_available
if is_torch_available():
import torch
class SamProcessor(ProcessorMixin):
r"""
Constructs a SAM processor which wraps a SAM image processor and an 2D points & Bounding boxes processor into a
single processor.
[`SamProcessor`] offers all the functionalities of [`SamImageProcessor`]. See the docstring of
[`~SamImageProcessor.__call__`] for more information.
Args:
image_processor (`SamImageProcessor`):
An instance of [`SamImageProcessor`]. The image processor is a required input.
"""
attributes = ["image_processor"]
image_processor_class = "SamImageProcessor"
def __init__(self, image_processor):
super().__init__(image_processor)
self.current_processor = self.image_processor
self.point_pad_value = -10
self.target_size = self.image_processor.size["longest_edge"]
def __call__(
self,
images=None,
input_points=None,
input_labels=None,
input_boxes=None,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchEncoding:
"""
This method uses [`SamImageProcessor.__call__`] method to prepare image(s) for the model. It also prepares 2D
points and bounding boxes for the model if they are provided.
"""
encoding_image_processor = self.image_processor(
images,
return_tensors=return_tensors,
**kwargs,
)
# pop arguments that are not used in the foward but used nevertheless
original_sizes = encoding_image_processor["original_sizes"]
if isinstance(original_sizes, torch.Tensor):
original_sizes = original_sizes.numpy()
input_points, input_labels, input_boxes = self._check_and_preprocess_points(
input_points=input_points,
input_labels=input_labels,
input_boxes=input_boxes,
)
encoding_image_processor = self._normalize_and_convert(
encoding_image_processor,
original_sizes,
input_points=input_points,
input_labels=input_labels,
input_boxes=input_boxes,
return_tensors=return_tensors,
)
return encoding_image_processor
def _normalize_and_convert(
self,
encoding_image_processor,
original_sizes,
input_points=None,
input_labels=None,
input_boxes=None,
return_tensors="pt",
):
if input_points is not None:
if len(original_sizes) != len(input_points):
input_points = [
self._normalize_coordinates(self.target_size, point, original_sizes[0]) for point in input_points
]
else:
input_points = [
self._normalize_coordinates(self.target_size, point, original_size)
for point, original_size in zip(input_points, original_sizes)
]
# check that all arrays have the same shape
if not all([point.shape == input_points[0].shape for point in input_points]):
if input_labels is not None:
input_points, input_labels = self._pad_points_and_labels(input_points, input_labels)
input_points = np.array(input_points)
if input_labels is not None:
input_labels = np.array(input_labels)
if input_boxes is not None:
if len(original_sizes) != len(input_boxes):
input_boxes = [
self._normalize_coordinates(self.target_size, box, original_sizes[0], is_bounding_box=True)
for box in input_boxes
]
else:
input_boxes = [
self._normalize_coordinates(self.target_size, box, original_size, is_bounding_box=True)
for box, original_size in zip(input_boxes, original_sizes)
]
input_boxes = np.array(input_boxes)
if input_boxes is not None:
if return_tensors == "pt":
input_boxes = torch.from_numpy(input_boxes)
# boxes batch size of 1 by default
input_boxes = input_boxes.unsqueeze(1) if len(input_boxes.shape) != 3 else input_boxes
encoding_image_processor.update({"input_boxes": input_boxes})
if input_points is not None:
if return_tensors == "pt":
input_points = torch.from_numpy(input_points)
# point batch size of 1 by default
input_points = input_points.unsqueeze(1) if len(input_points.shape) != 4 else input_points
encoding_image_processor.update({"input_points": input_points})
if input_labels is not None:
if return_tensors == "pt":
input_labels = torch.from_numpy(input_labels)
# point batch size of 1 by default
input_labels = input_labels.unsqueeze(1) if len(input_labels.shape) != 3 else input_labels
encoding_image_processor.update({"input_labels": input_labels})
return encoding_image_processor
def _pad_points_and_labels(self, input_points, input_labels):
r"""
The method pads the 2D points and labels to the maximum number of points in the batch.
"""
expected_nb_points = max([point.shape[0] for point in input_points])
processed_input_points = []
for i, point in enumerate(input_points):
if point.shape[0] != expected_nb_points:
point = np.concatenate(
[point, np.zeros((expected_nb_points - point.shape[0], 2)) + self.point_pad_value], axis=0
)
input_labels[i] = np.append(input_labels[i], [self.point_pad_value])
processed_input_points.append(point)
input_points = processed_input_points
return input_points, input_labels
def _normalize_coordinates(
self, target_size: int, coords: np.ndarray, original_size, is_bounding_box=False
) -> np.ndarray:
"""
Expects a numpy array of length 2 in the final dimension. Requires the original image size in (H, W) format.
"""
old_h, old_w = original_size
new_h, new_w = self.image_processor._get_preprocess_shape(original_size, longest_edge=target_size)
coords = deepcopy(coords).astype(float)
if is_bounding_box:
coords = coords.reshape(-1, 2, 2)
coords[..., 0] = coords[..., 0] * (new_w / old_w)
coords[..., 1] = coords[..., 1] * (new_h / old_h)
if is_bounding_box:
coords = coords.reshape(-1, 4)
return coords
def _check_and_preprocess_points(
self,
input_points=None,
input_labels=None,
input_boxes=None,
):
r"""
Check and preprocesses the 2D points, labels and bounding boxes. It checks if the input is valid and if they
are, it converts the coordinates of the points and bounding boxes. If a user passes directly a `torch.Tensor`,
it is converted to a `numpy.ndarray` and then to a `list`.
"""
if input_points is not None:
if isinstance(input_points, torch.Tensor):
input_points = input_points.numpy().tolist()
if not isinstance(input_points, list) and not isinstance(input_points[0], list):
raise ValueError("Input points must be a list of list of floating integers.")
input_points = [np.array(input_point) for input_point in input_points]
else:
input_points = None
if input_labels is not None:
if isinstance(input_labels, torch.Tensor):
input_labels = input_labels.numpy().tolist()
if not isinstance(input_labels, list) and not isinstance(input_labels[0], list):
raise ValueError("Input labels must be a list of list integers.")
input_labels = [np.array(label) for label in input_labels]
else:
input_labels = None
if input_boxes is not None:
if isinstance(input_boxes, torch.Tensor):
input_boxes = input_boxes.numpy().tolist()
if (
not isinstance(input_boxes, list)
and not isinstance(input_boxes[0], list)
and not isinstance(input_boxes[0][0], list)
):
raise ValueError("Input boxes must be a list of list of list of floating integers.")
input_boxes = [np.array(box).astype(np.float32) for box in input_boxes]
else:
input_boxes = None
return input_points, input_labels, input_boxes
@property
def model_input_names(self):
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(image_processor_input_names))
def post_process_masks(self, *args, **kwargs):
return self.image_processor.post_process_masks(*args, **kwargs)

View File

@@ -5987,6 +5987,23 @@ def load_tf_weights_in_roformer(*args, **kwargs):
requires_backends(load_tf_weights_in_roformer, ["torch"])
SAM_PRETRAINED_MODEL_ARCHIVE_LIST = None
class SamModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class SamPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

@@ -408,6 +408,13 @@ class PoolFormerImageProcessor(metaclass=DummyObject):
requires_backends(self, ["vision"])
class SamImageProcessor(metaclass=DummyObject):
_backends = ["vision"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["vision"])
class SegformerFeatureExtractor(metaclass=DummyObject):
_backends = ["vision"]

View File

View File

@@ -0,0 +1,735 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch SAM model. """
import inspect
import unittest
import requests
from transformers import SamConfig, SamMaskDecoderConfig, SamPromptEncoderConfig, SamVisionConfig
from transformers.testing_utils import require_torch, slow, torch_device
from transformers.utils import is_torch_available, is_vision_available
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, floats_tensor
if is_torch_available():
import torch
from torch import nn
from transformers import SamModel, SamProcessor
from transformers.models.sam.modeling_sam import SAM_PRETRAINED_MODEL_ARCHIVE_LIST
if is_vision_available():
from PIL import Image
class SamPromptEncoderTester:
def __init__(
self,
hidden_size=32,
input_image_size=24,
patch_size=2,
mask_input_channels=4,
num_point_embeddings=4,
hidden_act="gelu",
):
self.hidden_size = hidden_size
self.input_image_size = input_image_size
self.patch_size = patch_size
self.mask_input_channels = mask_input_channels
self.num_point_embeddings = num_point_embeddings
self.hidden_act = hidden_act
def get_config(self):
return SamPromptEncoderConfig(
image_size=self.input_image_size,
patch_size=self.patch_size,
mask_input_channels=self.mask_input_channels,
hidden_size=self.hidden_size,
num_point_embeddings=self.num_point_embeddings,
hidden_act=self.hidden_act,
)
def prepare_config_and_inputs(self):
dummy_points = floats_tensor([self.batch_size, 3, 2])
config = self.get_config()
return config, dummy_points
class SamMaskDecoderTester:
def __init__(
self,
hidden_size=32,
hidden_act="relu",
mlp_dim=64,
num_hidden_layers=2,
num_attention_heads=4,
attention_downsample_rate=2,
num_multimask_outputs=3,
iou_head_depth=3,
iou_head_hidden_dim=32,
layer_norm_eps=1e-6,
):
self.hidden_size = hidden_size
self.hidden_act = hidden_act
self.mlp_dim = mlp_dim
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.attention_downsample_rate = attention_downsample_rate
self.num_multimask_outputs = num_multimask_outputs
self.iou_head_depth = iou_head_depth
self.iou_head_hidden_dim = iou_head_hidden_dim
self.layer_norm_eps = layer_norm_eps
def get_config(self):
return SamMaskDecoderConfig(
hidden_size=self.hidden_size,
hidden_act=self.hidden_act,
mlp_dim=self.mlp_dim,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
attention_downsample_rate=self.attention_downsample_rate,
num_multimask_outputs=self.num_multimask_outputs,
iou_head_depth=self.iou_head_depth,
iou_head_hidden_dim=self.iou_head_hidden_dim,
layer_norm_eps=self.layer_norm_eps,
)
def prepare_config_and_inputs(self):
config = self.get_config()
dummy_inputs = {
"image_embedding": floats_tensor([self.batch_size, self.hidden_size]),
}
return config, dummy_inputs
class SamModelTester:
def __init__(
self,
parent,
hidden_size=36,
intermediate_size=72,
projection_dim=62,
output_channels=32,
num_hidden_layers=2,
num_attention_heads=4,
num_channels=3,
image_size=24,
patch_size=2,
hidden_act="gelu",
layer_norm_eps=1e-06,
dropout=0.0,
attention_dropout=0.0,
initializer_range=0.02,
initializer_factor=1.0,
qkv_bias=True,
mlp_ratio=4.0,
use_abs_pos=True,
use_rel_pos=True,
rel_pos_zero_init=False,
window_size=14,
global_attn_indexes=[2, 5, 8, 11],
num_pos_feats=16,
mlp_dim=None,
batch_size=2,
):
self.parent = parent
self.image_size = image_size
self.patch_size = patch_size
self.output_channels = output_channels
self.num_channels = num_channels
self.hidden_size = hidden_size
self.projection_dim = projection_dim
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.attention_dropout = attention_dropout
self.initializer_range = initializer_range
self.initializer_factor = initializer_factor
self.hidden_act = hidden_act
self.layer_norm_eps = layer_norm_eps
self.qkv_bias = qkv_bias
self.mlp_ratio = mlp_ratio
self.use_abs_pos = use_abs_pos
self.use_rel_pos = use_rel_pos
self.rel_pos_zero_init = rel_pos_zero_init
self.window_size = window_size
self.global_attn_indexes = global_attn_indexes
self.num_pos_feats = num_pos_feats
self.mlp_dim = mlp_dim
self.batch_size = batch_size
# in ViT, the seq length equals the number of patches + 1 (we add 1 for the [CLS] token)
num_patches = (image_size // patch_size) ** 2
self.seq_length = num_patches + 1
self.prompt_encoder_tester = SamPromptEncoderTester()
self.mask_decoder_tester = SamMaskDecoderTester()
def prepare_config_and_inputs(self):
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
config = self.get_config()
return config, pixel_values
def get_config(self):
vision_config = SamVisionConfig(
image_size=self.image_size,
patch_size=self.patch_size,
num_channels=self.num_channels,
hidden_size=self.hidden_size,
projection_dim=self.projection_dim,
num_hidden_layers=self.num_hidden_layers,
num_attention_heads=self.num_attention_heads,
intermediate_size=self.intermediate_size,
dropout=self.dropout,
attention_dropout=self.attention_dropout,
initializer_range=self.initializer_range,
initializer_factor=self.initializer_factor,
output_channels=self.output_channels,
qkv_bias=self.qkv_bias,
mlp_ratio=self.mlp_ratio,
use_abs_pos=self.use_abs_pos,
use_rel_pos=self.use_rel_pos,
rel_pos_zero_init=self.rel_pos_zero_init,
window_size=self.window_size,
global_attn_indexes=self.global_attn_indexes,
num_pos_feats=self.num_pos_feats,
mlp_dim=self.mlp_dim,
)
prompt_encoder_config = self.prompt_encoder_tester.get_config()
mask_decoder_config = self.mask_decoder_tester.get_config()
return SamConfig(
vision_config=vision_config,
prompt_encoder_config=prompt_encoder_config,
mask_decoder_config=mask_decoder_config,
)
def create_and_check_model(self, config, pixel_values):
model = SamModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model(pixel_values)
self.parent.assertEqual(result.iou_scores.shape, (self.batch_size, 1, 3))
self.parent.assertEqual(result.pred_masks.shape[:3], (self.batch_size, 1, 3))
def create_and_check_get_image_features(self, config, pixel_values):
model = SamModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model.get_image_embeddings(pixel_values)
self.parent.assertEqual(result[0].shape, (self.output_channels, 12, 12))
def create_and_check_get_image_hidden_states(self, config, pixel_values):
model = SamModel(config=config)
model.to(torch_device)
model.eval()
with torch.no_grad():
result = model.vision_encoder(
pixel_values,
output_hidden_states=True,
return_dict=True,
)
# after computing the convolutional features
expected_hidden_states_shape = (self.batch_size, 12, 12, 36)
self.parent.assertEqual(len(result[1]), self.num_hidden_layers + 1)
self.parent.assertEqual(result[1][0].shape, expected_hidden_states_shape)
with torch.no_grad():
result = model.vision_encoder(
pixel_values,
output_hidden_states=True,
return_dict=False,
)
# after computing the convolutional features
expected_hidden_states_shape = (self.batch_size, 12, 12, 36)
self.parent.assertEqual(len(result[1]), self.num_hidden_layers + 1)
self.parent.assertEqual(result[1][0].shape, expected_hidden_states_shape)
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, pixel_values = config_and_inputs
inputs_dict = {"pixel_values": pixel_values}
return config, inputs_dict
@require_torch
class SamModelTest(ModelTesterMixin, unittest.TestCase):
"""
Here we also overwrite some of the tests of test_modeling_common.py, as SAM's vision encoder does not use input_ids, inputs_embeds,
attention_mask and seq_length.
"""
all_model_classes = (SamModel,) if is_torch_available() else ()
fx_compatible = False
test_pruning = False
test_resize_embeddings = False
test_head_masking = False
test_torchscript = False
def setUp(self):
self.model_tester = SamModelTester(self)
self.vision_config_tester = ConfigTester(self, config_class=SamVisionConfig, has_text_modality=False)
self.prompt_encoder_config_tester = ConfigTester(
self,
config_class=SamPromptEncoderConfig,
has_text_modality=False,
num_attention_heads=12,
num_hidden_layers=2,
)
self.mask_decoder_config_tester = ConfigTester(
self, config_class=SamMaskDecoderConfig, has_text_modality=False
)
def test_config(self):
self.vision_config_tester.run_common_tests()
self.prompt_encoder_config_tester.run_common_tests()
self.mask_decoder_config_tester.run_common_tests()
@unittest.skip(reason="SAM's vision encoder does not use inputs_embeds")
def test_inputs_embeds(self):
pass
def test_model_common_attributes(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
self.assertIsInstance(model.get_input_embeddings(), (nn.Module))
x = model.get_output_embeddings()
self.assertTrue(x is None or isinstance(x, nn.Linear))
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["pixel_values"]
self.assertListEqual(arg_names[:1], expected_arg_names)
def test_model(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_model(*config_and_inputs)
def test_get_image_features(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_get_image_features(*config_and_inputs)
def test_image_hidden_states(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_get_image_hidden_states(*config_and_inputs)
def test_attention_outputs(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
config.return_dict = True
expected_vision_attention_shape = (
self.model_tester.batch_size * self.model_tester.num_attention_heads,
196,
196,
)
expected_mask_decoder_attention_shape = (self.model_tester.batch_size, 1, 144, 32)
for model_class in self.all_model_classes:
inputs_dict["output_attentions"] = True
inputs_dict["output_hidden_states"] = False
config.return_dict = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
vision_attentions = outputs.vision_attentions
self.assertEqual(len(vision_attentions), self.model_tester.num_hidden_layers)
mask_decoder_attentions = outputs.mask_decoder_attentions
self.assertEqual(len(mask_decoder_attentions), self.model_tester.mask_decoder_tester.num_hidden_layers)
# check that output_attentions also work using config
del inputs_dict["output_attentions"]
config.output_attentions = True
model = model_class(config)
model.to(torch_device)
model.eval()
with torch.no_grad():
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
vision_attentions = outputs.vision_attentions
self.assertEqual(len(vision_attentions), self.model_tester.num_hidden_layers)
mask_decoder_attentions = outputs.mask_decoder_attentions
self.assertEqual(len(mask_decoder_attentions), self.model_tester.mask_decoder_tester.num_hidden_layers)
self.assertListEqual(
list(vision_attentions[0].shape[-4:]),
list(expected_vision_attention_shape),
)
self.assertListEqual(
list(mask_decoder_attentions[0].shape[-4:]),
list(expected_mask_decoder_attention_shape),
)
@unittest.skip(reason="SamModel does not support training")
def test_training(self):
pass
@unittest.skip(reason="SamModel does not support training")
def test_training_gradient_checkpointing(self):
pass
@unittest.skip(reason="SamModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_from_base(self):
pass
@unittest.skip(reason="SamModel has no base class and is not available in MODEL_MAPPING")
def test_save_load_fast_init_to_base(self):
pass
@unittest.skip(reason="SamModel does not support training")
def test_retain_grad_hidden_states_attentions(self):
pass
@unittest.skip(reason="Hidden_states is tested in create_and_check_model tests")
def test_hidden_states_output(self):
pass
@slow
def test_model_from_pretrained(self):
for model_name in SAM_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
model = SamModel.from_pretrained(model_name)
self.assertIsNotNone(model)
def prepare_image():
img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
return raw_image
def prepare_dog_img():
img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dog-sam.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
return raw_image
@slow
class SamModelIntegrationTest(unittest.TestCase):
def test_inference_mask_generation_no_point(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
inputs = processor(images=raw_image, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.5798), atol=1e-4))
def test_inference_mask_generation_one_point_one_bb(self):
model = SamModel.from_pretrained("facebook/sam-vit-h")
processor = SamProcessor.from_pretrained("facebook/sam-vit-h")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
input_boxes = [[650, 900, 1000, 1250]]
input_points = [[[820, 1080]]]
inputs = processor(
images=raw_image, input_boxes=input_boxes, input_points=input_points, return_tensors="pt"
).to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.9935), atol=1e-4))
def test_inference_mask_generation_batched_points_batched_images(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
input_points = [
[[[820, 1080]], [[820, 1080]], [[820, 1080]], [[820, 1080]]],
[[[510, 1080]], [[820, 1080]], [[820, 1080]], [[820, 1080]]],
]
inputs = processor(images=[raw_image, raw_image], input_points=input_points, return_tensors="pt").to(
torch_device
)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze().cpu()
EXPECTED_SCORES = torch.tensor(
[
[
[0.9673, 0.9441, 0.9084],
[0.9673, 0.9441, 0.9084],
[0.9673, 0.9441, 0.9084],
[0.9673, 0.9441, 0.9084],
],
[
[0.8405, 0.6292, 0.3840],
[0.9673, 0.9441, 0.9084],
[0.9673, 0.9441, 0.9084],
[0.9673, 0.9441, 0.9084],
],
]
)
self.assertTrue(torch.allclose(scores, EXPECTED_SCORES, atol=1e-3))
def test_inference_mask_generation_one_point_one_bb_zero(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
input_boxes = [[620, 900, 1000, 1255]]
input_points = [[[820, 1080]]]
labels = [[0]]
inputs = processor(
images=raw_image,
input_boxes=input_boxes,
input_points=input_points,
input_labels=labels,
return_tensors="pt",
).to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.9689), atol=1e-4))
def test_inference_mask_generation_one_point(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
input_points = [[[400, 650]]]
input_labels = [[1]]
inputs = processor(
images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt"
).to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.9712), atol=1e-4))
# With no label
input_points = [[[400, 650]]]
inputs = processor(images=raw_image, input_points=input_points, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.9712), atol=1e-4))
def test_inference_mask_generation_two_points(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
input_points = [[[400, 650], [800, 650]]]
input_labels = [[1, 1]]
inputs = processor(
images=raw_image, input_points=input_points, input_labels=input_labels, return_tensors="pt"
).to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.9936), atol=1e-4))
# no labels
inputs = processor(images=raw_image, input_points=input_points, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.9936), atol=1e-4))
def test_inference_mask_generation_two_points_batched(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
input_points = [[[400, 650], [800, 650]], [[400, 650]]]
input_labels = [[1, 1], [1]]
inputs = processor(
images=[raw_image, raw_image], input_points=input_points, input_labels=input_labels, return_tensors="pt"
).to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[0][-1], torch.tensor(0.9936), atol=1e-4))
self.assertTrue(torch.allclose(scores[1][-1], torch.tensor(0.9716), atol=1e-4))
def test_inference_mask_generation_one_box(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
input_boxes = [[[75, 275, 1725, 850]]]
inputs = processor(images=raw_image, input_boxes=input_boxes, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores[-1], torch.tensor(0.8686), atol=1e-4))
def test_inference_mask_generation_batched_image_one_point(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
raw_dog_image = prepare_dog_img()
input_points = [[[820, 1080]], [[220, 470]]]
inputs = processor(images=[raw_image, raw_dog_image], input_points=input_points, return_tensors="pt").to(
torch_device
)
with torch.no_grad():
outputs = model(**inputs)
scores_batched = outputs.iou_scores.squeeze()
input_points = [[[220, 470]]]
inputs = processor(images=raw_dog_image, input_points=input_points, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
scores_single = outputs.iou_scores.squeeze()
self.assertTrue(torch.allclose(scores_batched[1, :], scores_single, atol=1e-4))
def test_inference_mask_generation_two_points_point_batch(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
# fmt: off
input_points = torch.Tensor([[[400, 650]], [[220, 470]]]).cpu()
# fmt: on
input_points = input_points.unsqueeze(0)
inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
iou_scores = outputs.iou_scores.cpu()
self.assertTrue(iou_scores.shape == (1, 2, 3))
torch.testing.assert_allclose(
iou_scores, torch.tensor([[[0.9848, 0.9788, 0.9713], [0.9211, 0.9128, 0.7427]]]), atol=1e-4, rtol=1e-4
)
def test_inference_mask_generation_three_boxes_point_batch(self):
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
model.to(torch_device)
model.eval()
raw_image = prepare_image()
# fmt: off
input_boxes = torch.Tensor([[[620, 900, 1000, 1255]], [[75, 275, 1725, 850]], [[75, 275, 1725, 850]]]).cpu()
EXPECTED_IOU = torch.tensor([[[1.0071, 1.0032, 0.9946], [0.4962, 0.8770, 0.8686], [0.4962, 0.8770, 0.8686]]])
# fmt: on
input_boxes = input_boxes.unsqueeze(0)
inputs = processor(raw_image, input_boxes=input_boxes, return_tensors="pt").to(torch_device)
with torch.no_grad():
outputs = model(**inputs)
iou_scores = outputs.iou_scores.cpu()
self.assertTrue(iou_scores.shape == (1, 3, 3))
torch.testing.assert_allclose(iou_scores, EXPECTED_IOU, atol=1e-4, rtol=1e-4)

View File

@@ -0,0 +1,81 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import shutil
import tempfile
import unittest
import numpy as np
from transformers.testing_utils import require_torchvision, require_vision
from transformers.utils import is_vision_available
if is_vision_available():
from PIL import Image
from transformers import AutoProcessor, SamImageProcessor, SamProcessor
@require_vision
@require_torchvision
class SamProcessorTest(unittest.TestCase):
def setUp(self):
self.tmpdirname = tempfile.mkdtemp()
image_processor = SamImageProcessor()
processor = SamProcessor(image_processor)
processor.save_pretrained(self.tmpdirname)
def get_image_processor(self, **kwargs):
return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).image_processor
def tearDown(self):
shutil.rmtree(self.tmpdirname)
def prepare_image_inputs(self):
"""This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
or a list of PyTorch tensors if one specifies torchify=True.
"""
image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
return image_inputs
def test_save_load_pretrained_additional_features(self):
processor = SamProcessor(image_processor=self.get_image_processor())
processor.save_pretrained(self.tmpdirname)
image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
processor = SamProcessor.from_pretrained(self.tmpdirname, do_normalize=False, padding_value=1.0)
self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
self.assertIsInstance(processor.image_processor, SamImageProcessor)
def test_image_processor(self):
image_processor = self.get_image_processor()
processor = SamProcessor(image_processor=image_processor)
image_input = self.prepare_image_inputs()
input_feat_extract = image_processor(image_input, return_tensors="np")
input_processor = processor(images=image_input, return_tensors="np")
input_feat_extract.pop("original_sizes") # pop original_sizes as it is popped in the processor
input_feat_extract.pop("reshaped_input_sizes") # pop original_sizes as it is popped in the processor
for key in input_feat_extract.keys():
self.assertAlmostEqual(input_feat_extract[key].sum(), input_processor[key].sum(), delta=1e-2)

View File

@@ -502,6 +502,7 @@ SPECIAL_MODEL_NAMES = {
"OpenAI GPT-2": "GPT-2",
"OpenAI GPT": "GPT",
"Perceiver": "Perceiver IO",
"SAM": "Segment Anything",
"ViT": "Vision Transformer (ViT)",
}

View File

@@ -231,6 +231,7 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
"PegasusXEncoder",
"PegasusXDecoder",
"PegasusXDecoderWrapper",
"SamModel",
"DPTForDepthEstimation",
"DecisionTransformerGPT2Model",
"GLPNForDepthEstimation",