Add table transformer [v2] (#19614)
* First draft * Add conversion script * Make conversion work * Upload checkpoints * Add final fixes * Revert changes of conditional and deformable detr * Fix toctree, add and remove copied from * Use model type * Improve docs * Improve code example * Update copies * Add copied formt * Don't update conditional detr * Don't update deformable detr
This commit is contained in:
@@ -375,6 +375,7 @@ Current number of checkpoints: ** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
||||||
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
|
1. **[Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
|
||||||
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
||||||
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
||||||
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
||||||
|
|||||||
@@ -375,6 +375,7 @@ Número actual de puntos de control: ** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
||||||
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
|
1. **[Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
|
||||||
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
||||||
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
||||||
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
||||||
|
|||||||
@@ -325,6 +325,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||||||
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
||||||
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
|
1. **[Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
|
||||||
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
||||||
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
||||||
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
||||||
|
|||||||
@@ -349,6 +349,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (来自 Microsoft) 伴随论文 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 由 Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 发布。
|
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (来自 Microsoft) 伴随论文 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 由 Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 发布。
|
||||||
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (来自 Google AI) 伴随论文 [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) 由 Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu 发布。
|
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (来自 Google AI) 伴随论文 [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) 由 Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu 发布。
|
||||||
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (来自 Google AI) 伴随论文 [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) 由 Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu 发布。
|
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (来自 Google AI) 伴随论文 [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) 由 Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu 发布。
|
||||||
|
1. **[Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer)** (来自 Microsoft Research) 伴随论文 [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) 由 Brandon Smock, Rohith Pesala, Robin Abraham 发布。
|
||||||
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (来自 Google AI) 伴随论文 [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) 由 Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos 发布。
|
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (来自 Google AI) 伴随论文 [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) 由 Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos 发布。
|
||||||
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (来自 Microsoft Research) 伴随论文 [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) 由 Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou 发布。
|
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (来自 Microsoft Research) 伴随论文 [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) 由 Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou 发布。
|
||||||
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
||||||
|
|||||||
@@ -361,6 +361,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
||||||
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released with the paper [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released with the paper [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
|
1. **[Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
|
||||||
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
||||||
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
||||||
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace).
|
||||||
|
|||||||
@@ -412,6 +412,8 @@
|
|||||||
title: Swin Transformer
|
title: Swin Transformer
|
||||||
- local: model_doc/swinv2
|
- local: model_doc/swinv2
|
||||||
title: Swin Transformer V2
|
title: Swin Transformer V2
|
||||||
|
- local: model_doc/table-transformer
|
||||||
|
title: Table Transformer
|
||||||
- local: model_doc/van
|
- local: model_doc/van
|
||||||
title: VAN
|
title: VAN
|
||||||
- local: model_doc/videomae
|
- local: model_doc/videomae
|
||||||
|
|||||||
@@ -164,6 +164,7 @@ The documentation is organized into five sections:
|
|||||||
1. **[Swin Transformer V2](model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
1. **[Swin Transformer V2](model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
||||||
1. **[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5](model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
1. **[T5v1.1](model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
1. **[T5v1.1](model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
|
1. **[Table Transformer](model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
|
||||||
1. **[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
1. **[TAPAS](model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
|
||||||
1. **[TAPEX](model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
1. **[TAPEX](model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou.
|
||||||
1. **[Time Series Transformer](model_doc/time_series_transformer)** (from HuggingFace).
|
1. **[Time Series Transformer](model_doc/time_series_transformer)** (from HuggingFace).
|
||||||
@@ -313,6 +314,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
| Swin Transformer | ❌ | ❌ | ✅ | ✅ | ❌ |
|
| Swin Transformer | ❌ | ❌ | ✅ | ✅ | ❌ |
|
||||||
| Swin Transformer V2 | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| Swin Transformer V2 | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| T5 | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| T5 | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
|
| Table Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| TAPAS | ✅ | ❌ | ✅ | ✅ | ❌ |
|
| TAPAS | ✅ | ❌ | ✅ | ✅ | ❌ |
|
||||||
| Time Series Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| Time Series Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
| Trajectory Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| Trajectory Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
|||||||
54
docs/source/en/model_doc/table-transformer.mdx
Normal file
54
docs/source/en/model_doc/table-transformer.mdx
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# Table Transformer
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Table Transformer model was proposed in [PubTables-1M: Towards comprehensive table extraction from unstructured documents](https://arxiv.org/abs/2110.00061) by
|
||||||
|
Brandon Smock, Rohith Pesala, Robin Abraham. The authors introduce a new dataset, PubTables-1M, to benchmark progress in table extraction from unstructured documents,
|
||||||
|
as well as table structure recognition and functional analysis. The authors train 2 [DETR](detr) models, one for table detection and one for table structure recognition, dubbed Table Transformers.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Recently, significant progress has been made applying machine learning to the problem of table structure inference and extraction from unstructured documents.
|
||||||
|
However, one of the greatest challenges remains the creation of datasets with complete, unambiguous ground truth at scale. To address this, we develop a new, more
|
||||||
|
comprehensive dataset for table extraction, called PubTables-1M. PubTables-1M contains nearly one million tables from scientific articles, supports multiple input
|
||||||
|
modalities, and contains detailed header and location information for table structures, making it useful for a wide variety of modeling approaches. It also addresses a significant
|
||||||
|
source of ground truth inconsistency observed in prior datasets called oversegmentation, using a novel canonicalization procedure. We demonstrate that these improvements lead to a
|
||||||
|
significant increase in training performance and a more reliable estimate of model performance at evaluation for table structure recognition. Further, we show that transformer-based
|
||||||
|
object detection models trained on PubTables-1M produce excellent results for all three tasks of detection, structure recognition, and functional analysis without the need for any
|
||||||
|
special customization for these tasks.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table).
|
||||||
|
- One can use the [`AutoFeatureExtractor`] API to prepare images and optional targets for the model. This will load a [`DetrFeatureExtractor`] behind the scenes.
|
||||||
|
- A demo notebook for the Table Transformer can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Table Transformer).
|
||||||
|
|
||||||
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be
|
||||||
|
found [here](https://github.com/microsoft/table-transformer).
|
||||||
|
|
||||||
|
|
||||||
|
## TableTransformerConfig
|
||||||
|
|
||||||
|
[[autodoc]] TableTransformerConfig
|
||||||
|
|
||||||
|
## TableTransformerModel
|
||||||
|
|
||||||
|
[[autodoc]] TableTransformerModel
|
||||||
|
- forward
|
||||||
|
|
||||||
|
## TableTransformerForObjectDetection
|
||||||
|
|
||||||
|
[[autodoc]] TableTransformerForObjectDetection
|
||||||
|
- forward
|
||||||
@@ -96,6 +96,7 @@ Ready-made configurations include the following architectures:
|
|||||||
- SqueezeBERT
|
- SqueezeBERT
|
||||||
- Swin Transformer
|
- Swin Transformer
|
||||||
- T5
|
- T5
|
||||||
|
- Table Transformer
|
||||||
- Vision Encoder decoder
|
- Vision Encoder decoder
|
||||||
- ViT
|
- ViT
|
||||||
- XLM
|
- XLM
|
||||||
|
|||||||
@@ -335,6 +335,7 @@ _import_structure = {
|
|||||||
"models.swin": ["SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP", "SwinConfig"],
|
"models.swin": ["SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP", "SwinConfig"],
|
||||||
"models.swinv2": ["SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Swinv2Config"],
|
"models.swinv2": ["SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Swinv2Config"],
|
||||||
"models.t5": ["T5_PRETRAINED_CONFIG_ARCHIVE_MAP", "T5Config"],
|
"models.t5": ["T5_PRETRAINED_CONFIG_ARCHIVE_MAP", "T5Config"],
|
||||||
|
"models.table_transformer": ["TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "TableTransformerConfig"],
|
||||||
"models.tapas": ["TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP", "TapasConfig", "TapasTokenizer"],
|
"models.tapas": ["TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP", "TapasConfig", "TapasTokenizer"],
|
||||||
"models.tapex": ["TapexTokenizer"],
|
"models.tapex": ["TapexTokenizer"],
|
||||||
"models.time_series_transformer": [
|
"models.time_series_transformer": [
|
||||||
@@ -738,6 +739,14 @@ else:
|
|||||||
"DetrPreTrainedModel",
|
"DetrPreTrainedModel",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.table_transformer"].extend(
|
||||||
|
[
|
||||||
|
"TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"TableTransformerForObjectDetection",
|
||||||
|
"TableTransformerModel",
|
||||||
|
"TableTransformerPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.conditional_detr"].extend(
|
_import_structure["models.conditional_detr"].extend(
|
||||||
[
|
[
|
||||||
"CONDITIONAL_DETR_PRETRAINED_MODEL_ARCHIVE_LIST",
|
"CONDITIONAL_DETR_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
@@ -3365,6 +3374,7 @@ if TYPE_CHECKING:
|
|||||||
from .models.swin import SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP, SwinConfig
|
from .models.swin import SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP, SwinConfig
|
||||||
from .models.swinv2 import SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP, Swinv2Config
|
from .models.swinv2 import SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP, Swinv2Config
|
||||||
from .models.t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
|
from .models.t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
|
||||||
|
from .models.table_transformer import TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, TableTransformerConfig
|
||||||
from .models.tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig, TapasTokenizer
|
from .models.tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig, TapasTokenizer
|
||||||
from .models.tapex import TapexTokenizer
|
from .models.tapex import TapexTokenizer
|
||||||
from .models.time_series_transformer import (
|
from .models.time_series_transformer import (
|
||||||
@@ -3717,6 +3727,12 @@ if TYPE_CHECKING:
|
|||||||
DetrModel,
|
DetrModel,
|
||||||
DetrPreTrainedModel,
|
DetrPreTrainedModel,
|
||||||
)
|
)
|
||||||
|
from .models.table_transformer import (
|
||||||
|
TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
TableTransformerForObjectDetection,
|
||||||
|
TableTransformerModel,
|
||||||
|
TableTransformerPreTrainedModel,
|
||||||
|
)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if not is_scatter_available():
|
if not is_scatter_available():
|
||||||
|
|||||||
@@ -138,6 +138,7 @@ from . import (
|
|||||||
swin,
|
swin,
|
||||||
swinv2,
|
swinv2,
|
||||||
t5,
|
t5,
|
||||||
|
table_transformer,
|
||||||
tapas,
|
tapas,
|
||||||
tapex,
|
tapex,
|
||||||
time_series_transformer,
|
time_series_transformer,
|
||||||
|
|||||||
@@ -134,6 +134,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
("swin", "SwinConfig"),
|
("swin", "SwinConfig"),
|
||||||
("swinv2", "Swinv2Config"),
|
("swinv2", "Swinv2Config"),
|
||||||
("t5", "T5Config"),
|
("t5", "T5Config"),
|
||||||
|
("table-transformer", "TableTransformerConfig"),
|
||||||
("tapas", "TapasConfig"),
|
("tapas", "TapasConfig"),
|
||||||
("time_series_transformer", "TimeSeriesTransformerConfig"),
|
("time_series_transformer", "TimeSeriesTransformerConfig"),
|
||||||
("trajectory_transformer", "TrajectoryTransformerConfig"),
|
("trajectory_transformer", "TrajectoryTransformerConfig"),
|
||||||
@@ -265,6 +266,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||||||
("swin", "SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("swin", "SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("swinv2", "SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("swinv2", "SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("t5", "T5_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("t5", "T5_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
("table-transformer", "TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("tapas", "TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("tapas", "TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("time_series_transformer", "TIME_SERIES_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("time_series_transformer", "TIME_SERIES_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("transfo-xl", "TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("transfo-xl", "TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
@@ -417,6 +419,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||||||
("swinv2", "Swin Transformer V2"),
|
("swinv2", "Swin Transformer V2"),
|
||||||
("t5", "T5"),
|
("t5", "T5"),
|
||||||
("t5v1.1", "T5v1.1"),
|
("t5v1.1", "T5v1.1"),
|
||||||
|
("table-transformer", "Table Transformer"),
|
||||||
("tapas", "TAPAS"),
|
("tapas", "TAPAS"),
|
||||||
("tapex", "TAPEX"),
|
("tapex", "TAPEX"),
|
||||||
("time_series_transformer", "Time Series Transformer"),
|
("time_series_transformer", "Time Series Transformer"),
|
||||||
|
|||||||
@@ -69,6 +69,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
|
|||||||
("speech_to_text", "Speech2TextFeatureExtractor"),
|
("speech_to_text", "Speech2TextFeatureExtractor"),
|
||||||
("swin", "ViTFeatureExtractor"),
|
("swin", "ViTFeatureExtractor"),
|
||||||
("swinv2", "ViTFeatureExtractor"),
|
("swinv2", "ViTFeatureExtractor"),
|
||||||
|
("table-transformer", "DetrFeatureExtractor"),
|
||||||
("van", "ConvNextFeatureExtractor"),
|
("van", "ConvNextFeatureExtractor"),
|
||||||
("videomae", "VideoMAEFeatureExtractor"),
|
("videomae", "VideoMAEFeatureExtractor"),
|
||||||
("vilt", "ViltFeatureExtractor"),
|
("vilt", "ViltFeatureExtractor"),
|
||||||
|
|||||||
@@ -130,6 +130,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("swin", "SwinModel"),
|
("swin", "SwinModel"),
|
||||||
("swinv2", "Swinv2Model"),
|
("swinv2", "Swinv2Model"),
|
||||||
("t5", "T5Model"),
|
("t5", "T5Model"),
|
||||||
|
("table-transformer", "TableTransformerModel"),
|
||||||
("tapas", "TapasModel"),
|
("tapas", "TapasModel"),
|
||||||
("time_series_transformer", "TimeSeriesTransformerModel"),
|
("time_series_transformer", "TimeSeriesTransformerModel"),
|
||||||
("trajectory_transformer", "TrajectoryTransformerModel"),
|
("trajectory_transformer", "TrajectoryTransformerModel"),
|
||||||
@@ -468,6 +469,7 @@ MODEL_FOR_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict(
|
|||||||
("conditional_detr", "ConditionalDetrForObjectDetection"),
|
("conditional_detr", "ConditionalDetrForObjectDetection"),
|
||||||
("deformable_detr", "DeformableDetrForObjectDetection"),
|
("deformable_detr", "DeformableDetrForObjectDetection"),
|
||||||
("detr", "DetrForObjectDetection"),
|
("detr", "DetrForObjectDetection"),
|
||||||
|
("table-transformer", "TableTransformerForObjectDetection"),
|
||||||
("yolos", "YolosForObjectDetection"),
|
("yolos", "YolosForObjectDetection"),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|||||||
69
src/transformers/models/table_transformer/__init__.py
Normal file
69
src/transformers/models/table_transformer/__init__.py
Normal file
@@ -0,0 +1,69 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2022 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_timm_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_table_transformer": [
|
||||||
|
"TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
|
"TableTransformerConfig",
|
||||||
|
"TableTransformerOnnxConfig",
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_timm_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
_import_structure["modeling_table_transformer"] = [
|
||||||
|
"TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"TableTransformerForObjectDetection",
|
||||||
|
"TableTransformerModel",
|
||||||
|
"TableTransformerPreTrainedModel",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_table_transformer import (
|
||||||
|
TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
TableTransformerConfig,
|
||||||
|
TableTransformerOnnxConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_timm_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
from .modeling_table_transformer import (
|
||||||
|
TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
TableTransformerForObjectDetection,
|
||||||
|
TableTransformerModel,
|
||||||
|
TableTransformerPreTrainedModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||||
@@ -0,0 +1,241 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Table Transformer model configuration"""
|
||||||
|
|
||||||
|
from collections import OrderedDict
|
||||||
|
from typing import Mapping
|
||||||
|
|
||||||
|
from packaging import version
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...onnx import OnnxConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"microsoft/table-transformer-table-detection": (
|
||||||
|
"https://huggingface.co/microsoft/table-transformer-table-detection/resolve/main/config.json"
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class TableTransformerConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`TableTransformerModel`]. It is used to
|
||||||
|
instantiate a Table Transformer model according to the specified arguments, defining the model architecture.
|
||||||
|
Instantiating a configuration with the defaults will yield a similar configuration to that of the Table Transformer
|
||||||
|
[microsoft/table-transformer-table-detection](https://huggingface.co/microsoft/table-transformer-table-detection)
|
||||||
|
architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
num_channels (`int`, *optional*, defaults to 3):
|
||||||
|
The number of input channels.
|
||||||
|
num_queries (`int`, *optional*, defaults to 100):
|
||||||
|
Number of object queries, i.e. detection slots. This is the maximal number of objects
|
||||||
|
[`TableTransformerModel`] can detect in a single image. For COCO, we recommend 100 queries.
|
||||||
|
d_model (`int`, *optional*, defaults to 256):
|
||||||
|
Dimension of the layers.
|
||||||
|
encoder_layers (`int`, *optional*, defaults to 6):
|
||||||
|
Number of encoder layers.
|
||||||
|
decoder_layers (`int`, *optional*, defaults to 6):
|
||||||
|
Number of decoder layers.
|
||||||
|
encoder_attention_heads (`int`, *optional*, defaults to 8):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
decoder_attention_heads (`int`, *optional*, defaults to 8):
|
||||||
|
Number of attention heads for each attention layer in the Transformer decoder.
|
||||||
|
decoder_ffn_dim (`int`, *optional*, defaults to 2048):
|
||||||
|
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
|
||||||
|
encoder_ffn_dim (`int`, *optional*, defaults to 2048):
|
||||||
|
Dimension of the "intermediate" (often named feed-forward) layer in decoder.
|
||||||
|
activation_function (`str` or `function`, *optional*, defaults to `"relu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"silu"` and `"gelu_new"` are supported.
|
||||||
|
dropout (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
activation_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for activations inside the fully connected layer.
|
||||||
|
init_std (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
init_xavier_std (`float`, *optional*, defaults to 1):
|
||||||
|
The scaling factor used for the Xavier initialization gain in the HM Attention map module.
|
||||||
|
encoder_layerdrop (`float`, *optional*, defaults to 0.0):
|
||||||
|
The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||||
|
for more details.
|
||||||
|
decoder_layerdrop (`float`, *optional*, defaults to 0.0):
|
||||||
|
The LayerDrop probability for the decoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||||
|
for more details.
|
||||||
|
auxiliary_loss (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
|
||||||
|
position_embedding_type (`str`, *optional*, defaults to `"sine"`):
|
||||||
|
Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
|
||||||
|
backbone (`str`, *optional*, defaults to `"resnet50"`):
|
||||||
|
Name of convolutional backbone to use. Supports any convolutional backbone from the timm package. For a
|
||||||
|
list of all available models, see [this
|
||||||
|
page](https://rwightman.github.io/pytorch-image-models/#load-a-pretrained-model).
|
||||||
|
use_pretrained_backbone (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to use pretrained weights for the backbone.
|
||||||
|
dilation (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to replace stride with dilation in the last convolutional block (DC5).
|
||||||
|
class_cost (`float`, *optional*, defaults to 1):
|
||||||
|
Relative weight of the classification error in the Hungarian matching cost.
|
||||||
|
bbox_cost (`float`, *optional*, defaults to 5):
|
||||||
|
Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost.
|
||||||
|
giou_cost (`float`, *optional*, defaults to 2):
|
||||||
|
Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost.
|
||||||
|
mask_loss_coefficient (`float`, *optional*, defaults to 1):
|
||||||
|
Relative weight of the Focal loss in the panoptic segmentation loss.
|
||||||
|
dice_loss_coefficient (`float`, *optional*, defaults to 1):
|
||||||
|
Relative weight of the DICE/F-1 loss in the panoptic segmentation loss.
|
||||||
|
bbox_loss_coefficient (`float`, *optional*, defaults to 5):
|
||||||
|
Relative weight of the L1 bounding box loss in the object detection loss.
|
||||||
|
giou_loss_coefficient (`float`, *optional*, defaults to 2):
|
||||||
|
Relative weight of the generalized IoU loss in the object detection loss.
|
||||||
|
eos_coefficient (`float`, *optional*, defaults to 0.1):
|
||||||
|
Relative classification weight of the 'no-object' class in the object detection loss.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import TableTransformerModel, TableTransformerConfig
|
||||||
|
|
||||||
|
>>> # Initializing a Table Transformer microsoft/table-transformer-table-detection style configuration
|
||||||
|
>>> configuration = TableTransformerConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the microsoft/table-transformer-table-detection style configuration
|
||||||
|
>>> model = TableTransformerModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
model_type = "table-transformer"
|
||||||
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
|
attribute_map = {
|
||||||
|
"hidden_size": "d_model",
|
||||||
|
"num_attention_heads": "encoder_attention_heads",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Copied from transformers.models.detr.configuration_detr.DetrConfig.__init__
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
num_channels=3,
|
||||||
|
num_queries=100,
|
||||||
|
max_position_embeddings=1024,
|
||||||
|
encoder_layers=6,
|
||||||
|
encoder_ffn_dim=2048,
|
||||||
|
encoder_attention_heads=8,
|
||||||
|
decoder_layers=6,
|
||||||
|
decoder_ffn_dim=2048,
|
||||||
|
decoder_attention_heads=8,
|
||||||
|
encoder_layerdrop=0.0,
|
||||||
|
decoder_layerdrop=0.0,
|
||||||
|
is_encoder_decoder=True,
|
||||||
|
activation_function="relu",
|
||||||
|
d_model=256,
|
||||||
|
dropout=0.1,
|
||||||
|
attention_dropout=0.0,
|
||||||
|
activation_dropout=0.0,
|
||||||
|
init_std=0.02,
|
||||||
|
init_xavier_std=1.0,
|
||||||
|
classifier_dropout=0.0,
|
||||||
|
scale_embedding=False,
|
||||||
|
auxiliary_loss=False,
|
||||||
|
position_embedding_type="sine",
|
||||||
|
backbone="resnet50",
|
||||||
|
use_pretrained_backbone=True,
|
||||||
|
dilation=False,
|
||||||
|
class_cost=1,
|
||||||
|
bbox_cost=5,
|
||||||
|
giou_cost=2,
|
||||||
|
mask_loss_coefficient=1,
|
||||||
|
dice_loss_coefficient=1,
|
||||||
|
bbox_loss_coefficient=5,
|
||||||
|
giou_loss_coefficient=2,
|
||||||
|
eos_coefficient=0.1,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.num_queries = num_queries
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.d_model = d_model
|
||||||
|
self.encoder_ffn_dim = encoder_ffn_dim
|
||||||
|
self.encoder_layers = encoder_layers
|
||||||
|
self.encoder_attention_heads = encoder_attention_heads
|
||||||
|
self.decoder_ffn_dim = decoder_ffn_dim
|
||||||
|
self.decoder_layers = decoder_layers
|
||||||
|
self.decoder_attention_heads = decoder_attention_heads
|
||||||
|
self.dropout = dropout
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.activation_dropout = activation_dropout
|
||||||
|
self.activation_function = activation_function
|
||||||
|
self.init_std = init_std
|
||||||
|
self.init_xavier_std = init_xavier_std
|
||||||
|
self.encoder_layerdrop = encoder_layerdrop
|
||||||
|
self.decoder_layerdrop = decoder_layerdrop
|
||||||
|
self.num_hidden_layers = encoder_layers
|
||||||
|
self.scale_embedding = scale_embedding # scale factor will be sqrt(d_model) if True
|
||||||
|
self.auxiliary_loss = auxiliary_loss
|
||||||
|
self.position_embedding_type = position_embedding_type
|
||||||
|
self.backbone = backbone
|
||||||
|
self.use_pretrained_backbone = use_pretrained_backbone
|
||||||
|
self.dilation = dilation
|
||||||
|
# Hungarian matcher
|
||||||
|
self.class_cost = class_cost
|
||||||
|
self.bbox_cost = bbox_cost
|
||||||
|
self.giou_cost = giou_cost
|
||||||
|
# Loss coefficients
|
||||||
|
self.mask_loss_coefficient = mask_loss_coefficient
|
||||||
|
self.dice_loss_coefficient = dice_loss_coefficient
|
||||||
|
self.bbox_loss_coefficient = bbox_loss_coefficient
|
||||||
|
self.giou_loss_coefficient = giou_loss_coefficient
|
||||||
|
self.eos_coefficient = eos_coefficient
|
||||||
|
super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def num_attention_heads(self) -> int:
|
||||||
|
return self.encoder_attention_heads
|
||||||
|
|
||||||
|
@property
|
||||||
|
def hidden_size(self) -> int:
|
||||||
|
return self.d_model
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.detr.configuration_detr.DetrOnnxConfig
|
||||||
|
class TableTransformerOnnxConfig(OnnxConfig):
|
||||||
|
|
||||||
|
torch_onnx_minimum_version = version.parse("1.11")
|
||||||
|
|
||||||
|
@property
|
||||||
|
def inputs(self) -> Mapping[str, Mapping[int, str]]:
|
||||||
|
return OrderedDict(
|
||||||
|
[
|
||||||
|
("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
|
||||||
|
("pixel_mask", {0: "batch"}),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def atol_for_validation(self) -> float:
|
||||||
|
return 1e-5
|
||||||
|
|
||||||
|
@property
|
||||||
|
def default_onnx_opset(self) -> int:
|
||||||
|
return 12
|
||||||
@@ -0,0 +1,318 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert Table Transformer checkpoints.
|
||||||
|
|
||||||
|
URL: https://github.com/microsoft/table-transformer
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
from collections import OrderedDict
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from PIL import Image
|
||||||
|
from torchvision.transforms import functional as F
|
||||||
|
|
||||||
|
from huggingface_hub import hf_hub_download
|
||||||
|
from transformers import DetrFeatureExtractor, TableTransformerConfig, TableTransformerForObjectDetection
|
||||||
|
from transformers.utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logging.set_verbosity_info()
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
# here we list all keys to be renamed (original name on the left, our name on the right)
|
||||||
|
rename_keys = []
|
||||||
|
for i in range(6):
|
||||||
|
# encoder layers: output projection, 2 feedforward neural networks and 2 layernorms
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.encoder.layers.{i}.self_attn.out_proj.weight", f"encoder.layers.{i}.self_attn.out_proj.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.encoder.layers.{i}.self_attn.out_proj.bias", f"encoder.layers.{i}.self_attn.out_proj.bias")
|
||||||
|
)
|
||||||
|
rename_keys.append((f"transformer.encoder.layers.{i}.linear1.weight", f"encoder.layers.{i}.fc1.weight"))
|
||||||
|
rename_keys.append((f"transformer.encoder.layers.{i}.linear1.bias", f"encoder.layers.{i}.fc1.bias"))
|
||||||
|
rename_keys.append((f"transformer.encoder.layers.{i}.linear2.weight", f"encoder.layers.{i}.fc2.weight"))
|
||||||
|
rename_keys.append((f"transformer.encoder.layers.{i}.linear2.bias", f"encoder.layers.{i}.fc2.bias"))
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.encoder.layers.{i}.norm1.weight", f"encoder.layers.{i}.self_attn_layer_norm.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append((f"transformer.encoder.layers.{i}.norm1.bias", f"encoder.layers.{i}.self_attn_layer_norm.bias"))
|
||||||
|
rename_keys.append((f"transformer.encoder.layers.{i}.norm2.weight", f"encoder.layers.{i}.final_layer_norm.weight"))
|
||||||
|
rename_keys.append((f"transformer.encoder.layers.{i}.norm2.bias", f"encoder.layers.{i}.final_layer_norm.bias"))
|
||||||
|
# decoder layers: 2 times output projection, 2 feedforward neural networks and 3 layernorms
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.decoder.layers.{i}.self_attn.out_proj.weight", f"decoder.layers.{i}.self_attn.out_proj.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.decoder.layers.{i}.self_attn.out_proj.bias", f"decoder.layers.{i}.self_attn.out_proj.bias")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(
|
||||||
|
f"transformer.decoder.layers.{i}.multihead_attn.out_proj.weight",
|
||||||
|
f"decoder.layers.{i}.encoder_attn.out_proj.weight",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(
|
||||||
|
f"transformer.decoder.layers.{i}.multihead_attn.out_proj.bias",
|
||||||
|
f"decoder.layers.{i}.encoder_attn.out_proj.bias",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
rename_keys.append((f"transformer.decoder.layers.{i}.linear1.weight", f"decoder.layers.{i}.fc1.weight"))
|
||||||
|
rename_keys.append((f"transformer.decoder.layers.{i}.linear1.bias", f"decoder.layers.{i}.fc1.bias"))
|
||||||
|
rename_keys.append((f"transformer.decoder.layers.{i}.linear2.weight", f"decoder.layers.{i}.fc2.weight"))
|
||||||
|
rename_keys.append((f"transformer.decoder.layers.{i}.linear2.bias", f"decoder.layers.{i}.fc2.bias"))
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.decoder.layers.{i}.norm1.weight", f"decoder.layers.{i}.self_attn_layer_norm.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append((f"transformer.decoder.layers.{i}.norm1.bias", f"decoder.layers.{i}.self_attn_layer_norm.bias"))
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.decoder.layers.{i}.norm2.weight", f"decoder.layers.{i}.encoder_attn_layer_norm.weight")
|
||||||
|
)
|
||||||
|
rename_keys.append(
|
||||||
|
(f"transformer.decoder.layers.{i}.norm2.bias", f"decoder.layers.{i}.encoder_attn_layer_norm.bias")
|
||||||
|
)
|
||||||
|
rename_keys.append((f"transformer.decoder.layers.{i}.norm3.weight", f"decoder.layers.{i}.final_layer_norm.weight"))
|
||||||
|
rename_keys.append((f"transformer.decoder.layers.{i}.norm3.bias", f"decoder.layers.{i}.final_layer_norm.bias"))
|
||||||
|
|
||||||
|
# convolutional projection + query embeddings + layernorm of encoder + layernorm of decoder + class and bounding box heads
|
||||||
|
rename_keys.extend(
|
||||||
|
[
|
||||||
|
("input_proj.weight", "input_projection.weight"),
|
||||||
|
("input_proj.bias", "input_projection.bias"),
|
||||||
|
("query_embed.weight", "query_position_embeddings.weight"),
|
||||||
|
("transformer.encoder.norm.weight", "encoder.layernorm.weight"),
|
||||||
|
("transformer.encoder.norm.bias", "encoder.layernorm.bias"),
|
||||||
|
("transformer.decoder.norm.weight", "decoder.layernorm.weight"),
|
||||||
|
("transformer.decoder.norm.bias", "decoder.layernorm.bias"),
|
||||||
|
("class_embed.weight", "class_labels_classifier.weight"),
|
||||||
|
("class_embed.bias", "class_labels_classifier.bias"),
|
||||||
|
("bbox_embed.layers.0.weight", "bbox_predictor.layers.0.weight"),
|
||||||
|
("bbox_embed.layers.0.bias", "bbox_predictor.layers.0.bias"),
|
||||||
|
("bbox_embed.layers.1.weight", "bbox_predictor.layers.1.weight"),
|
||||||
|
("bbox_embed.layers.1.bias", "bbox_predictor.layers.1.bias"),
|
||||||
|
("bbox_embed.layers.2.weight", "bbox_predictor.layers.2.weight"),
|
||||||
|
("bbox_embed.layers.2.bias", "bbox_predictor.layers.2.bias"),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def rename_key(state_dict, old, new):
|
||||||
|
val = state_dict.pop(old)
|
||||||
|
state_dict[new] = val
|
||||||
|
|
||||||
|
|
||||||
|
def rename_backbone_keys(state_dict):
|
||||||
|
new_state_dict = OrderedDict()
|
||||||
|
for key, value in state_dict.items():
|
||||||
|
if "backbone.0.body" in key:
|
||||||
|
new_key = key.replace("backbone.0.body", "backbone.conv_encoder.model")
|
||||||
|
new_state_dict[new_key] = value
|
||||||
|
else:
|
||||||
|
new_state_dict[key] = value
|
||||||
|
|
||||||
|
return new_state_dict
|
||||||
|
|
||||||
|
|
||||||
|
def read_in_q_k_v(state_dict):
|
||||||
|
prefix = ""
|
||||||
|
|
||||||
|
# first: transformer encoder
|
||||||
|
for i in range(6):
|
||||||
|
# read in weights + bias of input projection layer (in PyTorch's MultiHeadAttention, this is a single matrix + bias)
|
||||||
|
in_proj_weight = state_dict.pop(f"{prefix}transformer.encoder.layers.{i}.self_attn.in_proj_weight")
|
||||||
|
in_proj_bias = state_dict.pop(f"{prefix}transformer.encoder.layers.{i}.self_attn.in_proj_bias")
|
||||||
|
# next, add query, keys and values (in that order) to the state dict
|
||||||
|
state_dict[f"encoder.layers.{i}.self_attn.q_proj.weight"] = in_proj_weight[:256, :]
|
||||||
|
state_dict[f"encoder.layers.{i}.self_attn.q_proj.bias"] = in_proj_bias[:256]
|
||||||
|
state_dict[f"encoder.layers.{i}.self_attn.k_proj.weight"] = in_proj_weight[256:512, :]
|
||||||
|
state_dict[f"encoder.layers.{i}.self_attn.k_proj.bias"] = in_proj_bias[256:512]
|
||||||
|
state_dict[f"encoder.layers.{i}.self_attn.v_proj.weight"] = in_proj_weight[-256:, :]
|
||||||
|
state_dict[f"encoder.layers.{i}.self_attn.v_proj.bias"] = in_proj_bias[-256:]
|
||||||
|
# next: transformer decoder (which is a bit more complex because it also includes cross-attention)
|
||||||
|
for i in range(6):
|
||||||
|
# read in weights + bias of input projection layer of self-attention
|
||||||
|
in_proj_weight = state_dict.pop(f"{prefix}transformer.decoder.layers.{i}.self_attn.in_proj_weight")
|
||||||
|
in_proj_bias = state_dict.pop(f"{prefix}transformer.decoder.layers.{i}.self_attn.in_proj_bias")
|
||||||
|
# next, add query, keys and values (in that order) to the state dict
|
||||||
|
state_dict[f"decoder.layers.{i}.self_attn.q_proj.weight"] = in_proj_weight[:256, :]
|
||||||
|
state_dict[f"decoder.layers.{i}.self_attn.q_proj.bias"] = in_proj_bias[:256]
|
||||||
|
state_dict[f"decoder.layers.{i}.self_attn.k_proj.weight"] = in_proj_weight[256:512, :]
|
||||||
|
state_dict[f"decoder.layers.{i}.self_attn.k_proj.bias"] = in_proj_bias[256:512]
|
||||||
|
state_dict[f"decoder.layers.{i}.self_attn.v_proj.weight"] = in_proj_weight[-256:, :]
|
||||||
|
state_dict[f"decoder.layers.{i}.self_attn.v_proj.bias"] = in_proj_bias[-256:]
|
||||||
|
# read in weights + bias of input projection layer of cross-attention
|
||||||
|
in_proj_weight_cross_attn = state_dict.pop(
|
||||||
|
f"{prefix}transformer.decoder.layers.{i}.multihead_attn.in_proj_weight"
|
||||||
|
)
|
||||||
|
in_proj_bias_cross_attn = state_dict.pop(f"{prefix}transformer.decoder.layers.{i}.multihead_attn.in_proj_bias")
|
||||||
|
# next, add query, keys and values (in that order) of cross-attention to the state dict
|
||||||
|
state_dict[f"decoder.layers.{i}.encoder_attn.q_proj.weight"] = in_proj_weight_cross_attn[:256, :]
|
||||||
|
state_dict[f"decoder.layers.{i}.encoder_attn.q_proj.bias"] = in_proj_bias_cross_attn[:256]
|
||||||
|
state_dict[f"decoder.layers.{i}.encoder_attn.k_proj.weight"] = in_proj_weight_cross_attn[256:512, :]
|
||||||
|
state_dict[f"decoder.layers.{i}.encoder_attn.k_proj.bias"] = in_proj_bias_cross_attn[256:512]
|
||||||
|
state_dict[f"decoder.layers.{i}.encoder_attn.v_proj.weight"] = in_proj_weight_cross_attn[-256:, :]
|
||||||
|
state_dict[f"decoder.layers.{i}.encoder_attn.v_proj.bias"] = in_proj_bias_cross_attn[-256:]
|
||||||
|
|
||||||
|
|
||||||
|
def resize(image, checkpoint_url):
|
||||||
|
width, height = image.size
|
||||||
|
current_max_size = max(width, height)
|
||||||
|
target_max_size = 800 if "detection" in checkpoint_url else 1000
|
||||||
|
scale = target_max_size / current_max_size
|
||||||
|
resized_image = image.resize((int(round(scale * width)), int(round(scale * height))))
|
||||||
|
|
||||||
|
return resized_image
|
||||||
|
|
||||||
|
|
||||||
|
def normalize(image):
|
||||||
|
image = F.to_tensor(image)
|
||||||
|
image = F.normalize(image, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
|
||||||
|
return image
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def convert_table_transformer_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub):
|
||||||
|
"""
|
||||||
|
Copy/paste/tweak model's weights to our DETR structure.
|
||||||
|
"""
|
||||||
|
|
||||||
|
logger.info("Converting model...")
|
||||||
|
|
||||||
|
# load original state dict
|
||||||
|
state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")
|
||||||
|
# rename keys
|
||||||
|
for src, dest in rename_keys:
|
||||||
|
rename_key(state_dict, src, dest)
|
||||||
|
state_dict = rename_backbone_keys(state_dict)
|
||||||
|
# query, key and value matrices need special treatment
|
||||||
|
read_in_q_k_v(state_dict)
|
||||||
|
# important: we need to prepend a prefix to each of the base model keys as the head models use different attributes for them
|
||||||
|
prefix = "model."
|
||||||
|
for key in state_dict.copy().keys():
|
||||||
|
if not key.startswith("class_labels_classifier") and not key.startswith("bbox_predictor"):
|
||||||
|
val = state_dict.pop(key)
|
||||||
|
state_dict[prefix + key] = val
|
||||||
|
# create HuggingFace model and load state dict
|
||||||
|
config = TableTransformerConfig(
|
||||||
|
backbone="resnet18",
|
||||||
|
mask_loss_coefficient=1,
|
||||||
|
dice_loss_coefficient=1,
|
||||||
|
ce_loss_coefficient=1,
|
||||||
|
bbox_loss_coefficient=5,
|
||||||
|
giou_loss_coefficient=2,
|
||||||
|
eos_coefficient=0.4,
|
||||||
|
class_cost=1,
|
||||||
|
bbox_cost=5,
|
||||||
|
giou_cost=2,
|
||||||
|
)
|
||||||
|
|
||||||
|
if "detection" in checkpoint_url:
|
||||||
|
config.num_queries = 15
|
||||||
|
config.num_labels = 2
|
||||||
|
id2label = {0: "table", 1: "table rotated"}
|
||||||
|
config.id2label = id2label
|
||||||
|
config.label2id = {v: k for k, v in id2label.items()}
|
||||||
|
else:
|
||||||
|
config.num_queries = 125
|
||||||
|
config.num_labels = 6
|
||||||
|
id2label = {
|
||||||
|
0: "table",
|
||||||
|
1: "table column",
|
||||||
|
2: "table row",
|
||||||
|
3: "table column header",
|
||||||
|
4: "table projected row header",
|
||||||
|
5: "table spanning cell",
|
||||||
|
}
|
||||||
|
config.id2label = id2label
|
||||||
|
config.label2id = {v: k for k, v in id2label.items()}
|
||||||
|
|
||||||
|
feature_extractor = DetrFeatureExtractor(
|
||||||
|
format="coco_detection", max_size=800 if "detection" in checkpoint_url else 1000
|
||||||
|
)
|
||||||
|
model = TableTransformerForObjectDetection(config)
|
||||||
|
model.load_state_dict(state_dict)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
# verify our conversion
|
||||||
|
filename = "example_pdf.png" if "detection" in checkpoint_url else "example_table.png"
|
||||||
|
file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename=filename)
|
||||||
|
image = Image.open(file_path).convert("RGB")
|
||||||
|
pixel_values = normalize(resize(image, checkpoint_url)).unsqueeze(0)
|
||||||
|
|
||||||
|
outputs = model(pixel_values)
|
||||||
|
|
||||||
|
if "detection" in checkpoint_url:
|
||||||
|
expected_shape = (1, 15, 3)
|
||||||
|
expected_logits = torch.tensor(
|
||||||
|
[[-6.7897, -16.9985, 6.7937], [-8.0186, -22.2192, 6.9677], [-7.3117, -21.0708, 7.4055]]
|
||||||
|
)
|
||||||
|
expected_boxes = torch.tensor([[0.4867, 0.1767, 0.6732], [0.6718, 0.4479, 0.3830], [0.4716, 0.1760, 0.6364]])
|
||||||
|
|
||||||
|
else:
|
||||||
|
expected_shape = (1, 125, 7)
|
||||||
|
expected_logits = torch.tensor(
|
||||||
|
[[-18.1430, -8.3214, 4.8274], [-18.4685, -7.1361, -4.2667], [-26.3693, -9.3429, -4.9962]]
|
||||||
|
)
|
||||||
|
expected_boxes = torch.tensor([[0.4983, 0.5595, 0.9440], [0.4916, 0.6315, 0.5954], [0.6108, 0.8637, 0.1135]])
|
||||||
|
|
||||||
|
assert outputs.logits.shape == expected_shape
|
||||||
|
assert torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4)
|
||||||
|
assert torch.allclose(outputs.pred_boxes[0, :3, :3], expected_boxes, atol=1e-4)
|
||||||
|
print("Looks ok!")
|
||||||
|
|
||||||
|
if pytorch_dump_folder_path is not None:
|
||||||
|
# Save model and feature extractor
|
||||||
|
logger.info(f"Saving PyTorch model and feature extractor to {pytorch_dump_folder_path}...")
|
||||||
|
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
|
||||||
|
model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
feature_extractor.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
if push_to_hub:
|
||||||
|
# Push model to HF hub
|
||||||
|
logger.info("Pushing model to the hub...")
|
||||||
|
model_name = (
|
||||||
|
"microsoft/table-transformer-detection"
|
||||||
|
if "detection" in checkpoint_url
|
||||||
|
else "microsoft/table-transformer-structure-recognition"
|
||||||
|
)
|
||||||
|
model.push_to_hub(model_name)
|
||||||
|
feature_extractor.push_to_hub(model_name)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--checkpoint_url",
|
||||||
|
default="https://pubtables1m.blob.core.windows.net/model/pubtables1m_detection_detr_r18.pth",
|
||||||
|
type=str,
|
||||||
|
choices=[
|
||||||
|
"https://pubtables1m.blob.core.windows.net/model/pubtables1m_detection_detr_r18.pth",
|
||||||
|
"https://pubtables1m.blob.core.windows.net/model/pubtables1m_structure_detr_r18.pth",
|
||||||
|
],
|
||||||
|
help="URL of the Table Transformer checkpoint you'd like to convert.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the folder to output PyTorch model."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_table_transformer_checkpoint(args.checkpoint_url, args.pytorch_dump_folder_path, args.push_to_hub)
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -87,3 +87,27 @@ class DetrPreTrainedModel(metaclass=DummyObject):
|
|||||||
|
|
||||||
def __init__(self, *args, **kwargs):
|
def __init__(self, *args, **kwargs):
|
||||||
requires_backends(self, ["timm", "vision"])
|
requires_backends(self, ["timm", "vision"])
|
||||||
|
|
||||||
|
|
||||||
|
TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class TableTransformerForObjectDetection(metaclass=DummyObject):
|
||||||
|
_backends = ["timm", "vision"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["timm", "vision"])
|
||||||
|
|
||||||
|
|
||||||
|
class TableTransformerModel(metaclass=DummyObject):
|
||||||
|
_backends = ["timm", "vision"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["timm", "vision"])
|
||||||
|
|
||||||
|
|
||||||
|
class TableTransformerPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["timm", "vision"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["timm", "vision"])
|
||||||
|
|||||||
0
tests/models/table_transformer/__init__.py
Normal file
0
tests/models/table_transformer/__init__.py
Normal file
@@ -0,0 +1,498 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch Table Transformer model. """
|
||||||
|
|
||||||
|
|
||||||
|
import inspect
|
||||||
|
import math
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from huggingface_hub import hf_hub_download
|
||||||
|
from transformers import TableTransformerConfig, is_timm_available, is_vision_available
|
||||||
|
from transformers.testing_utils import require_timm, require_vision, slow, torch_device
|
||||||
|
|
||||||
|
from ...generation.test_generation_utils import GenerationTesterMixin
|
||||||
|
from ...test_configuration_common import ConfigTester
|
||||||
|
from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor
|
||||||
|
|
||||||
|
|
||||||
|
if is_timm_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import TableTransformerForObjectDetection, TableTransformerModel
|
||||||
|
|
||||||
|
|
||||||
|
if is_vision_available():
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from transformers import AutoFeatureExtractor
|
||||||
|
|
||||||
|
|
||||||
|
class TableTransformerModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=8,
|
||||||
|
is_training=True,
|
||||||
|
use_labels=True,
|
||||||
|
hidden_size=256,
|
||||||
|
num_hidden_layers=2,
|
||||||
|
num_attention_heads=8,
|
||||||
|
intermediate_size=4,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout_prob=0.1,
|
||||||
|
attention_probs_dropout_prob=0.1,
|
||||||
|
num_queries=12,
|
||||||
|
num_channels=3,
|
||||||
|
min_size=200,
|
||||||
|
max_size=200,
|
||||||
|
n_targets=8,
|
||||||
|
num_labels=91,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.is_training = is_training
|
||||||
|
self.use_labels = use_labels
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout_prob = hidden_dropout_prob
|
||||||
|
self.attention_probs_dropout_prob = attention_probs_dropout_prob
|
||||||
|
self.num_queries = num_queries
|
||||||
|
self.num_channels = num_channels
|
||||||
|
self.min_size = min_size
|
||||||
|
self.max_size = max_size
|
||||||
|
self.n_targets = n_targets
|
||||||
|
self.num_labels = num_labels
|
||||||
|
|
||||||
|
# we also set the expected seq length for both encoder and decoder
|
||||||
|
self.encoder_seq_length = math.ceil(self.min_size / 32) * math.ceil(self.max_size / 32)
|
||||||
|
self.decoder_seq_length = self.num_queries
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
pixel_values = floats_tensor([self.batch_size, self.num_channels, self.min_size, self.max_size])
|
||||||
|
|
||||||
|
pixel_mask = torch.ones([self.batch_size, self.min_size, self.max_size], device=torch_device)
|
||||||
|
|
||||||
|
labels = None
|
||||||
|
if self.use_labels:
|
||||||
|
# labels is a list of Dict (each Dict being the labels for a given example in the batch)
|
||||||
|
labels = []
|
||||||
|
for i in range(self.batch_size):
|
||||||
|
target = {}
|
||||||
|
target["class_labels"] = torch.randint(
|
||||||
|
high=self.num_labels, size=(self.n_targets,), device=torch_device
|
||||||
|
)
|
||||||
|
target["boxes"] = torch.rand(self.n_targets, 4, device=torch_device)
|
||||||
|
target["masks"] = torch.rand(self.n_targets, self.min_size, self.max_size, device=torch_device)
|
||||||
|
labels.append(target)
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
return config, pixel_values, pixel_mask, labels
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return TableTransformerConfig(
|
||||||
|
d_model=self.hidden_size,
|
||||||
|
encoder_layers=self.num_hidden_layers,
|
||||||
|
decoder_layers=self.num_hidden_layers,
|
||||||
|
encoder_attention_heads=self.num_attention_heads,
|
||||||
|
decoder_attention_heads=self.num_attention_heads,
|
||||||
|
encoder_ffn_dim=self.intermediate_size,
|
||||||
|
decoder_ffn_dim=self.intermediate_size,
|
||||||
|
dropout=self.hidden_dropout_prob,
|
||||||
|
attention_dropout=self.attention_probs_dropout_prob,
|
||||||
|
num_queries=self.num_queries,
|
||||||
|
num_labels=self.num_labels,
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config, pixel_values, pixel_mask, labels = self.prepare_config_and_inputs()
|
||||||
|
inputs_dict = {"pixel_values": pixel_values, "pixel_mask": pixel_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def create_and_check_table_transformer_model(self, config, pixel_values, pixel_mask, labels):
|
||||||
|
model = TableTransformerModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
|
||||||
|
result = model(pixel_values)
|
||||||
|
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.last_hidden_state.shape, (self.batch_size, self.decoder_seq_length, self.hidden_size)
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_table_transformer_object_detection_head_model(self, config, pixel_values, pixel_mask, labels):
|
||||||
|
model = TableTransformerForObjectDetection(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask)
|
||||||
|
result = model(pixel_values)
|
||||||
|
|
||||||
|
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, self.num_labels + 1))
|
||||||
|
self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
|
||||||
|
|
||||||
|
result = model(pixel_values=pixel_values, pixel_mask=pixel_mask, labels=labels)
|
||||||
|
|
||||||
|
self.parent.assertEqual(result.loss.shape, ())
|
||||||
|
self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_queries, self.num_labels + 1))
|
||||||
|
self.parent.assertEqual(result.pred_boxes.shape, (self.batch_size, self.num_queries, 4))
|
||||||
|
|
||||||
|
|
||||||
|
@require_timm
|
||||||
|
class TableTransformerModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (
|
||||||
|
(
|
||||||
|
TableTransformerModel,
|
||||||
|
TableTransformerForObjectDetection,
|
||||||
|
)
|
||||||
|
if is_timm_available()
|
||||||
|
else ()
|
||||||
|
)
|
||||||
|
is_encoder_decoder = True
|
||||||
|
test_torchscript = False
|
||||||
|
test_pruning = False
|
||||||
|
test_head_masking = False
|
||||||
|
test_missing_keys = False
|
||||||
|
|
||||||
|
# special case for head models
|
||||||
|
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
|
||||||
|
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
|
||||||
|
|
||||||
|
if return_labels:
|
||||||
|
if model_class.__name__ in ["TableTransformerForObjectDetection"]:
|
||||||
|
labels = []
|
||||||
|
for i in range(self.model_tester.batch_size):
|
||||||
|
target = {}
|
||||||
|
target["class_labels"] = torch.ones(
|
||||||
|
size=(self.model_tester.n_targets,), device=torch_device, dtype=torch.long
|
||||||
|
)
|
||||||
|
target["boxes"] = torch.ones(
|
||||||
|
self.model_tester.n_targets, 4, device=torch_device, dtype=torch.float
|
||||||
|
)
|
||||||
|
target["masks"] = torch.ones(
|
||||||
|
self.model_tester.n_targets,
|
||||||
|
self.model_tester.min_size,
|
||||||
|
self.model_tester.max_size,
|
||||||
|
device=torch_device,
|
||||||
|
dtype=torch.float,
|
||||||
|
)
|
||||||
|
labels.append(target)
|
||||||
|
inputs_dict["labels"] = labels
|
||||||
|
|
||||||
|
return inputs_dict
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = TableTransformerModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=TableTransformerConfig, has_text_modality=False)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_table_transformer_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_table_transformer_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_table_transformer_object_detection_head_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_table_transformer_object_detection_head_model(*config_and_inputs)
|
||||||
|
|
||||||
|
@unittest.skip(reason="Table Transformer does not use inputs_embeds")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="Table Transformer does not have a get_input_embeddings method")
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="Table Transformer is not a generative model")
|
||||||
|
def test_generate_without_input_ids(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip(reason="Table Transformer does not use token embeddings")
|
||||||
|
def test_resize_tokens_embeddings(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_outputs_equivalence(self):
|
||||||
|
# TODO Niels: fix me!
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_attention_outputs(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.return_dict = True
|
||||||
|
|
||||||
|
decoder_seq_length = self.model_tester.decoder_seq_length
|
||||||
|
encoder_seq_length = self.model_tester.encoder_seq_length
|
||||||
|
decoder_key_length = self.model_tester.decoder_seq_length
|
||||||
|
encoder_key_length = self.model_tester.encoder_seq_length
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
inputs_dict["output_attentions"] = True
|
||||||
|
inputs_dict["output_hidden_states"] = False
|
||||||
|
config.return_dict = True
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||||
|
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||||
|
|
||||||
|
# check that output_attentions also work using config
|
||||||
|
del inputs_dict["output_attentions"]
|
||||||
|
config.output_attentions = True
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||||
|
self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
|
||||||
|
|
||||||
|
self.assertListEqual(
|
||||||
|
list(attentions[0].shape[-3:]),
|
||||||
|
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||||
|
)
|
||||||
|
out_len = len(outputs)
|
||||||
|
|
||||||
|
if self.is_encoder_decoder:
|
||||||
|
correct_outlen = 5
|
||||||
|
|
||||||
|
# loss is at first position
|
||||||
|
if "labels" in inputs_dict:
|
||||||
|
correct_outlen += 1 # loss is added to beginning
|
||||||
|
# Object Detection model returns pred_logits and pred_boxes
|
||||||
|
if model_class.__name__ == "TableTransformerForObjectDetection":
|
||||||
|
correct_outlen += 2
|
||||||
|
|
||||||
|
if "past_key_values" in outputs:
|
||||||
|
correct_outlen += 1 # past_key_values have been returned
|
||||||
|
|
||||||
|
self.assertEqual(out_len, correct_outlen)
|
||||||
|
|
||||||
|
# decoder attentions
|
||||||
|
decoder_attentions = outputs.decoder_attentions
|
||||||
|
self.assertIsInstance(decoder_attentions, (list, tuple))
|
||||||
|
self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
|
||||||
|
self.assertListEqual(
|
||||||
|
list(decoder_attentions[0].shape[-3:]),
|
||||||
|
[self.model_tester.num_attention_heads, decoder_seq_length, decoder_key_length],
|
||||||
|
)
|
||||||
|
|
||||||
|
# cross attentions
|
||||||
|
cross_attentions = outputs.cross_attentions
|
||||||
|
self.assertIsInstance(cross_attentions, (list, tuple))
|
||||||
|
self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
|
||||||
|
self.assertListEqual(
|
||||||
|
list(cross_attentions[0].shape[-3:]),
|
||||||
|
[
|
||||||
|
self.model_tester.num_attention_heads,
|
||||||
|
decoder_seq_length,
|
||||||
|
encoder_key_length,
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check attention is always last and order is fine
|
||||||
|
inputs_dict["output_attentions"] = True
|
||||||
|
inputs_dict["output_hidden_states"] = True
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
if hasattr(self.model_tester, "num_hidden_states_types"):
|
||||||
|
added_hidden_states = self.model_tester.num_hidden_states_types
|
||||||
|
elif self.is_encoder_decoder:
|
||||||
|
added_hidden_states = 2
|
||||||
|
else:
|
||||||
|
added_hidden_states = 1
|
||||||
|
self.assertEqual(out_len + added_hidden_states, len(outputs))
|
||||||
|
|
||||||
|
self_attentions = outputs.encoder_attentions if config.is_encoder_decoder else outputs.attentions
|
||||||
|
|
||||||
|
self.assertEqual(len(self_attentions), self.model_tester.num_hidden_layers)
|
||||||
|
self.assertListEqual(
|
||||||
|
list(self_attentions[0].shape[-3:]),
|
||||||
|
[self.model_tester.num_attention_heads, encoder_seq_length, encoder_key_length],
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
# removed retain_grad and grad on decoder_hidden_states, as queries don't require grad
|
||||||
|
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.output_hidden_states = True
|
||||||
|
config.output_attentions = True
|
||||||
|
|
||||||
|
# no need to test all models as different heads yield the same functionality
|
||||||
|
model_class = self.all_model_classes[0]
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
|
||||||
|
outputs = model(**inputs)
|
||||||
|
|
||||||
|
output = outputs[0]
|
||||||
|
|
||||||
|
encoder_hidden_states = outputs.encoder_hidden_states[0]
|
||||||
|
encoder_attentions = outputs.encoder_attentions[0]
|
||||||
|
encoder_hidden_states.retain_grad()
|
||||||
|
encoder_attentions.retain_grad()
|
||||||
|
|
||||||
|
decoder_attentions = outputs.decoder_attentions[0]
|
||||||
|
decoder_attentions.retain_grad()
|
||||||
|
|
||||||
|
cross_attentions = outputs.cross_attentions[0]
|
||||||
|
cross_attentions.retain_grad()
|
||||||
|
|
||||||
|
output.flatten()[0].backward(retain_graph=True)
|
||||||
|
|
||||||
|
self.assertIsNotNone(encoder_hidden_states.grad)
|
||||||
|
self.assertIsNotNone(encoder_attentions.grad)
|
||||||
|
self.assertIsNotNone(decoder_attentions.grad)
|
||||||
|
self.assertIsNotNone(cross_attentions.grad)
|
||||||
|
|
||||||
|
def test_forward_signature(self):
|
||||||
|
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
signature = inspect.signature(model.forward)
|
||||||
|
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||||
|
arg_names = [*signature.parameters.keys()]
|
||||||
|
|
||||||
|
if model.config.is_encoder_decoder:
|
||||||
|
expected_arg_names = ["pixel_values", "pixel_mask"]
|
||||||
|
expected_arg_names.extend(
|
||||||
|
["head_mask", "decoder_head_mask", "encoder_outputs"]
|
||||||
|
if "head_mask" and "decoder_head_mask" in arg_names
|
||||||
|
else []
|
||||||
|
)
|
||||||
|
self.assertListEqual(arg_names[: len(expected_arg_names)], expected_arg_names)
|
||||||
|
else:
|
||||||
|
expected_arg_names = ["pixel_values", "pixel_mask"]
|
||||||
|
self.assertListEqual(arg_names[:1], expected_arg_names)
|
||||||
|
|
||||||
|
def test_different_timm_backbone(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
# let's pick a random timm backbone
|
||||||
|
config.backbone = "tf_mobilenetv3_small_075"
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
if model_class.__name__ == "TableTransformerForObjectDetection":
|
||||||
|
expected_shape = (
|
||||||
|
self.model_tester.batch_size,
|
||||||
|
self.model_tester.num_queries,
|
||||||
|
self.model_tester.num_labels + 1,
|
||||||
|
)
|
||||||
|
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||||
|
|
||||||
|
self.assertTrue(outputs)
|
||||||
|
|
||||||
|
def test_greyscale_images(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
# use greyscale pixel values
|
||||||
|
inputs_dict["pixel_values"] = floats_tensor(
|
||||||
|
[self.model_tester.batch_size, 1, self.model_tester.min_size, self.model_tester.max_size]
|
||||||
|
)
|
||||||
|
|
||||||
|
# let's set num_channels to 1
|
||||||
|
config.num_channels = 1
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**self._prepare_for_class(inputs_dict, model_class))
|
||||||
|
|
||||||
|
self.assertTrue(outputs)
|
||||||
|
|
||||||
|
def test_initialization(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config)
|
||||||
|
configs_no_init.init_xavier_std = 1e9
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
for name, param in model.named_parameters():
|
||||||
|
if param.requires_grad:
|
||||||
|
if "bbox_attention" in name and "bias" not in name:
|
||||||
|
self.assertLess(
|
||||||
|
100000,
|
||||||
|
abs(param.data.max().item()),
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertIn(
|
||||||
|
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||||
|
[0.0, 1.0],
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
TOLERANCE = 1e-4
|
||||||
|
|
||||||
|
|
||||||
|
# We will verify our results on an image of cute cats
|
||||||
|
def prepare_img():
|
||||||
|
image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
|
||||||
|
return image
|
||||||
|
|
||||||
|
|
||||||
|
@require_timm
|
||||||
|
@require_vision
|
||||||
|
@slow
|
||||||
|
class TableTransformerModelIntegrationTests(unittest.TestCase):
|
||||||
|
def test_table_detection(self):
|
||||||
|
feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/table-transformer-detection")
|
||||||
|
model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-detection")
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_pdf.png")
|
||||||
|
image = Image.open(file_path).convert("RGB")
|
||||||
|
inputs = feature_extractor(image, return_tensors="pt").to(torch_device)
|
||||||
|
|
||||||
|
# forward pass
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
|
||||||
|
expected_shape = (1, 15, 3)
|
||||||
|
self.assertEqual(outputs.logits.shape, expected_shape)
|
||||||
|
|
||||||
|
expected_logits = torch.tensor(
|
||||||
|
[[-6.7329, -16.9590, 6.7447], [-8.0038, -22.3071, 6.9288], [-7.2445, -20.9855, 7.3465]],
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4))
|
||||||
|
|
||||||
|
expected_boxes = torch.tensor(
|
||||||
|
[[0.4868, 0.1764, 0.6729], [0.6674, 0.4621, 0.3864], [0.4720, 0.1757, 0.6362]], device=torch_device
|
||||||
|
)
|
||||||
|
self.assertTrue(torch.allclose(outputs.pred_boxes[0, :3, :3], expected_boxes, atol=1e-3))
|
||||||
@@ -46,6 +46,8 @@ PRIVATE_MODELS = [
|
|||||||
# Being in this list is an exception and should **not** be the rule.
|
# Being in this list is an exception and should **not** be the rule.
|
||||||
IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
||||||
# models to ignore for not tested
|
# models to ignore for not tested
|
||||||
|
"TableTransformerEncoder", # Building part of bigger (tested) model.
|
||||||
|
"TableTransformerDecoder", # Building part of bigger (tested) model.
|
||||||
"TimeSeriesTransformerEncoder", # Building part of bigger (tested) model.
|
"TimeSeriesTransformerEncoder", # Building part of bigger (tested) model.
|
||||||
"TimeSeriesTransformerDecoder", # Building part of bigger (tested) model.
|
"TimeSeriesTransformerDecoder", # Building part of bigger (tested) model.
|
||||||
"DeformableDetrEncoder", # Building part of bigger (tested) model.
|
"DeformableDetrEncoder", # Building part of bigger (tested) model.
|
||||||
|
|||||||
@@ -116,6 +116,7 @@ src/transformers/models/segformer/modeling_tf_segformer.py
|
|||||||
src/transformers/models/swin/configuration_swin.py
|
src/transformers/models/swin/configuration_swin.py
|
||||||
src/transformers/models/swin/modeling_swin.py
|
src/transformers/models/swin/modeling_swin.py
|
||||||
src/transformers/models/swinv2/configuration_swinv2.py
|
src/transformers/models/swinv2/configuration_swinv2.py
|
||||||
|
src/transformers/models/table_transformer/modeling_table_transformer.py
|
||||||
src/transformers/models/time_series_transformer/configuration_time_series_transformer.py
|
src/transformers/models/time_series_transformer/configuration_time_series_transformer.py
|
||||||
src/transformers/models/time_series_transformer/modeling_time_series_transformer.py
|
src/transformers/models/time_series_transformer/modeling_time_series_transformer.py
|
||||||
src/transformers/models/trajectory_transformer/configuration_trajectory_transformer.py
|
src/transformers/models/trajectory_transformer/configuration_trajectory_transformer.py
|
||||||
|
|||||||
Reference in New Issue
Block a user