Add the SEW and SEW-D speech models (#13962)
* Working encoder * SEW-D and tests * Further conv fixes * Automodels and conv inits * Update integration tests, add docs * Docs cleanup, resolve todos * Conf fix * Fix docs * Fix tests, apply suggestions * Update src/transformers/models/sew/modeling_sew.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Model conversion and updated no-mask tests * Remove copy of feature_proj * Style * Update src/transformers/models/auto/feature_extraction_auto.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/models/auto/feature_extraction_auto.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Move orgs Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
This commit is contained in:
@@ -267,6 +267,8 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
|||||||
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
|
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
|
||||||
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||||
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||||
|
1. **[SEW](https://huggingface.co/transformers/model_doc/sew.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||||
|
1. **[SEW-D](https://huggingface.co/transformers/model_doc/sew_d.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||||
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
||||||
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
||||||
1. **[Splinter](https://huggingface.co/transformers/model_doc/splinter.html)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
|
1. **[Splinter](https://huggingface.co/transformers/model_doc/splinter.html)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
|
||||||
|
|||||||
@@ -289,6 +289,8 @@ conda install -c huggingface transformers
|
|||||||
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (来自 Google Research) 伴随论文 [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) 由 Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder 发布。
|
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (来自 Google Research) 伴随论文 [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) 由 Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder 发布。
|
||||||
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (来自 Facebook), 伴随论文 [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) 由 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov 发布。
|
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (来自 Facebook), 伴随论文 [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) 由 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov 发布。
|
||||||
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。
|
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。
|
||||||
|
1. **[SEW](https://huggingface.co/transformers/model_doc/sew.html)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
|
||||||
|
1. **[SEW-D](https://huggingface.co/transformers/model_doc/sew_d.html)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
|
||||||
1. **[SpeechEncoderDecoder](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html)**
|
1. **[SpeechEncoderDecoder](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html)**
|
||||||
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (来自 Facebook), 伴随论文 [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino 发布。
|
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (来自 Facebook), 伴随论文 [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino 发布。
|
||||||
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (来自 Facebook) 伴随论文 [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) 由 Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau 发布。
|
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (来自 Facebook) 伴随论文 [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) 由 Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau 发布。
|
||||||
|
|||||||
@@ -301,6 +301,8 @@ conda install -c huggingface transformers
|
|||||||
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
|
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
|
||||||
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||||
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||||
|
1. **[SEW](https://huggingface.co/transformers/model_doc/sew.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||||
|
1. **[SEW-D](https://huggingface.co/transformers/model_doc/sew_d.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||||
1. **[SpeechEncoderDecoder](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html)**
|
1. **[SpeechEncoderDecoder](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html)**
|
||||||
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
||||||
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (from Facebook) released with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (from Facebook) released with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
||||||
|
|||||||
@@ -268,59 +268,65 @@ Supported models
|
|||||||
57. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
|
57. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
|
||||||
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
|
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
|
||||||
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||||
58. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
|
58. :doc:`SEW <model_doc/sew>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in Unsupervised
|
||||||
|
Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu
|
||||||
|
Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||||
|
59. :doc:`SEW-D <model_doc/sew_d>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in
|
||||||
|
Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
|
||||||
|
Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||||
|
60. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
|
||||||
`fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
|
`fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
|
||||||
Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
||||||
59. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
|
61. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
|
||||||
`Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
|
`Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
|
||||||
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
||||||
60. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
|
62. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
|
||||||
Question Answering by Pretraining Span Selection <https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain,
|
Question Answering by Pretraining Span Selection <https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain,
|
||||||
Jonathan Berant, Amir Globerson, Omer Levy.
|
Jonathan Berant, Amir Globerson, Omer Levy.
|
||||||
61. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
|
63. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
|
||||||
vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola,
|
vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola,
|
||||||
Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
|
Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
|
||||||
62. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
64. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
||||||
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
|
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
|
||||||
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||||
63. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
|
65. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
|
||||||
`google-research/text-to-text-transfer-transformer
|
`google-research/text-to-text-transfer-transformer
|
||||||
<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__ by
|
<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__ by
|
||||||
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi
|
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi
|
||||||
Zhou and Wei Li and Peter J. Liu.
|
Zhou and Wei Li and Peter J. Liu.
|
||||||
64. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
66. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
||||||
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
|
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
|
||||||
Francesco Piccinno and Julian Martin Eisenschlos.
|
Francesco Piccinno and Julian Martin Eisenschlos.
|
||||||
65. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
67. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
||||||
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
|
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
|
||||||
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
||||||
66. `TrOCR <https://huggingface.co/transformers/master/model_doc/trocr.html>`__ (from Microsoft), released together
|
68. `TrOCR <https://huggingface.co/transformers/master/model_doc/trocr.html>`__ (from Microsoft), released together
|
||||||
with the paper `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
|
with the paper `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
|
||||||
<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
|
<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
|
||||||
Zhoujun Li, Furu Wei.
|
Zhoujun Li, Furu Wei.
|
||||||
67. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
|
69. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
|
||||||
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
|
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
|
||||||
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
|
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
|
||||||
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||||
68. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
|
70. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
|
||||||
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
|
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
|
||||||
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||||
69. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
71. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||||
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
|
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
|
||||||
Zhou, Abdelrahman Mohamed, Michael Auli.
|
Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
70. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
72. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||||
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
|
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
|
||||||
71. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
73. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||||
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
|
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
|
||||||
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||||
72. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
74. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||||
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
|
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
|
||||||
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
|
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
|
||||||
Zettlemoyer and Veselin Stoyanov.
|
Zettlemoyer and Veselin Stoyanov.
|
||||||
73. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
75. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||||
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
|
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
|
||||||
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||||
74. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
76. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
||||||
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
|
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
|
||||||
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||||
|
|
||||||
@@ -446,6 +452,10 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
| RoFormer | ✅ | ✅ | ✅ | ✅ | ❌ |
|
| RoFormer | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
|
| SEW | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
|
| SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ |
|
| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||||
| Speech2Text | ✅ | ❌ | ✅ | ❌ | ❌ |
|
| Speech2Text | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||||
@@ -621,6 +631,8 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
model_doc/retribert
|
model_doc/retribert
|
||||||
model_doc/roberta
|
model_doc/roberta
|
||||||
model_doc/roformer
|
model_doc/roformer
|
||||||
|
model_doc/sew
|
||||||
|
model_doc/sew_d
|
||||||
model_doc/speechencoderdecoder
|
model_doc/speechencoderdecoder
|
||||||
model_doc/speech_to_text
|
model_doc/speech_to_text
|
||||||
model_doc/speech_to_text_2
|
model_doc/speech_to_text_2
|
||||||
|
|||||||
61
docs/source/model_doc/sew.rst
Normal file
61
docs/source/model_doc/sew.rst
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
..
|
||||||
|
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
|
||||||
|
SEW
|
||||||
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
Overview
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
SEW (Squeezed and Efficient Wav2Vec) was proposed in `Performance-Efficiency Trade-offs in Unsupervised Pre-training
|
||||||
|
for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q.
|
||||||
|
Weinberger, Yoav Artzi.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition
|
||||||
|
(ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance
|
||||||
|
and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a
|
||||||
|
pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a
|
||||||
|
variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x
|
||||||
|
inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference
|
||||||
|
time, SEW reduces word error rate by 25-50% across different model sizes.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- SEW is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
|
||||||
|
- SEWForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using
|
||||||
|
:class:`~transformers.Wav2Vec2CTCTokenizer`.
|
||||||
|
|
||||||
|
This model was contributed by `anton-l <https://huggingface.co/anton-l>`__.
|
||||||
|
|
||||||
|
|
||||||
|
SEWConfig
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.SEWConfig
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
SEWModel
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.SEWModel
|
||||||
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
|
SEWForCTC
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.SEWForCTC
|
||||||
|
:members: forward
|
||||||
|
|
||||||
61
docs/source/model_doc/sew_d.rst
Normal file
61
docs/source/model_doc/sew_d.rst
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
..
|
||||||
|
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
|
||||||
|
SEW-D
|
||||||
|
-----------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
Overview
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
SEW-D (Squeezed and Efficient Wav2Vec with Disentangled attention) was proposed in `Performance-Efficiency Trade-offs
|
||||||
|
in Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
|
||||||
|
Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition
|
||||||
|
(ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance
|
||||||
|
and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a
|
||||||
|
pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a
|
||||||
|
variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x
|
||||||
|
inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference
|
||||||
|
time, SEW reduces word error rate by 25-50% across different model sizes.*
|
||||||
|
|
||||||
|
Tips:
|
||||||
|
|
||||||
|
- SEW-D is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
|
||||||
|
- SEWDForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
|
||||||
|
using :class:`~transformers.Wav2Vec2CTCTokenizer`.
|
||||||
|
|
||||||
|
This model was contributed by `anton-l <https://huggingface.co/anton-l>`__.
|
||||||
|
|
||||||
|
|
||||||
|
SEWDConfig
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.SEWDConfig
|
||||||
|
:members:
|
||||||
|
|
||||||
|
|
||||||
|
SEWDModel
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.SEWDModel
|
||||||
|
:members: forward
|
||||||
|
|
||||||
|
|
||||||
|
SEWDForCTC
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autoclass:: transformers.SEWDForCTC
|
||||||
|
:members: forward
|
||||||
|
|
||||||
@@ -251,6 +251,8 @@ _import_structure = {
|
|||||||
"models.retribert": ["RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RetriBertConfig", "RetriBertTokenizer"],
|
"models.retribert": ["RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RetriBertConfig", "RetriBertTokenizer"],
|
||||||
"models.roberta": ["ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "RobertaConfig", "RobertaTokenizer"],
|
"models.roberta": ["ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "RobertaConfig", "RobertaTokenizer"],
|
||||||
"models.roformer": ["ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoFormerConfig", "RoFormerTokenizer"],
|
"models.roformer": ["ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoFormerConfig", "RoFormerTokenizer"],
|
||||||
|
"models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
|
||||||
|
"models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
|
||||||
"models.speech_encoder_decoder": ["SpeechEncoderDecoderConfig"],
|
"models.speech_encoder_decoder": ["SpeechEncoderDecoderConfig"],
|
||||||
"models.speech_to_text": [
|
"models.speech_to_text": [
|
||||||
"SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
"SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
@@ -1135,6 +1137,22 @@ if is_torch_available():
|
|||||||
"load_tf_weights_in_roformer",
|
"load_tf_weights_in_roformer",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.sew"].extend(
|
||||||
|
[
|
||||||
|
"SEW_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"SEWForCTC",
|
||||||
|
"SEWModel",
|
||||||
|
"SEWPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
_import_structure["models.sew_d"].extend(
|
||||||
|
[
|
||||||
|
"SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"SEWDForCTC",
|
||||||
|
"SEWDModel",
|
||||||
|
"SEWDPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.speech_encoder_decoder"].extend(["SpeechEncoderDecoderModel"])
|
_import_structure["models.speech_encoder_decoder"].extend(["SpeechEncoderDecoderModel"])
|
||||||
_import_structure["models.speech_to_text"].extend(
|
_import_structure["models.speech_to_text"].extend(
|
||||||
[
|
[
|
||||||
@@ -2095,6 +2113,8 @@ if TYPE_CHECKING:
|
|||||||
from .models.retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig, RetriBertTokenizer
|
from .models.retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig, RetriBertTokenizer
|
||||||
from .models.roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig, RobertaTokenizer
|
from .models.roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig, RobertaTokenizer
|
||||||
from .models.roformer import ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, RoFormerConfig, RoFormerTokenizer
|
from .models.roformer import ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, RoFormerConfig, RoFormerTokenizer
|
||||||
|
from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
|
||||||
|
from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
|
||||||
from .models.speech_encoder_decoder import SpeechEncoderDecoderConfig
|
from .models.speech_encoder_decoder import SpeechEncoderDecoderConfig
|
||||||
from .models.speech_to_text import SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, Speech2TextConfig
|
from .models.speech_to_text import SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, Speech2TextConfig
|
||||||
from .models.speech_to_text_2 import (
|
from .models.speech_to_text_2 import (
|
||||||
@@ -2835,6 +2855,8 @@ if TYPE_CHECKING:
|
|||||||
RoFormerPreTrainedModel,
|
RoFormerPreTrainedModel,
|
||||||
load_tf_weights_in_roformer,
|
load_tf_weights_in_roformer,
|
||||||
)
|
)
|
||||||
|
from .models.sew import SEW_PRETRAINED_MODEL_ARCHIVE_LIST, SEWForCTC, SEWModel, SEWPreTrainedModel
|
||||||
|
from .models.sew_d import SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST, SEWDForCTC, SEWDModel, SEWDPreTrainedModel
|
||||||
from .models.speech_encoder_decoder import SpeechEncoderDecoderModel
|
from .models.speech_encoder_decoder import SpeechEncoderDecoderModel
|
||||||
from .models.speech_to_text import (
|
from .models.speech_to_text import (
|
||||||
SPEECH_TO_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
SPEECH_TO_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
|||||||
@@ -80,6 +80,8 @@ from . import (
|
|||||||
retribert,
|
retribert,
|
||||||
roberta,
|
roberta,
|
||||||
roformer,
|
roformer,
|
||||||
|
sew,
|
||||||
|
sew_d,
|
||||||
speech_to_text,
|
speech_to_text,
|
||||||
splinter,
|
splinter,
|
||||||
squeezebert,
|
squeezebert,
|
||||||
|
|||||||
@@ -96,6 +96,8 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
("rag", "RagConfig"),
|
("rag", "RagConfig"),
|
||||||
("tapas", "TapasConfig"),
|
("tapas", "TapasConfig"),
|
||||||
("splinter", "SplinterConfig"),
|
("splinter", "SplinterConfig"),
|
||||||
|
("sew-d", "SEWDConfig"),
|
||||||
|
("sew", "SEWConfig"),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -162,6 +164,8 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||||||
("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("splinter", "SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("splinter", "SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -245,6 +249,8 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||||||
("byt5", "ByT5"),
|
("byt5", "ByT5"),
|
||||||
("mbart50", "mBART-50"),
|
("mbart50", "mBART-50"),
|
||||||
("splinter", "Splinter"),
|
("splinter", "Splinter"),
|
||||||
|
("sew-d", "SEW-D"),
|
||||||
|
("sew", "SEW"),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -92,6 +92,8 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("tapas", "TapasModel"),
|
("tapas", "TapasModel"),
|
||||||
("ibert", "IBertModel"),
|
("ibert", "IBertModel"),
|
||||||
("splinter", "SplinterModel"),
|
("splinter", "SplinterModel"),
|
||||||
|
("sew", "SEWModel"),
|
||||||
|
("sew-d", "SEWDModel"),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -482,6 +484,8 @@ MODEL_FOR_CTC_MAPPING_NAMES = OrderedDict(
|
|||||||
# Model for Connectionist temporal classification (CTC) mapping
|
# Model for Connectionist temporal classification (CTC) mapping
|
||||||
("wav2vec2", "Wav2Vec2ForCTC"),
|
("wav2vec2", "Wav2Vec2ForCTC"),
|
||||||
("hubert", "HubertForCTC"),
|
("hubert", "HubertForCTC"),
|
||||||
|
("sew", "SEWForCTC"),
|
||||||
|
("sew-d", "SEWDForCTC"),
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -141,7 +141,7 @@ def _compute_mask_indices(
|
|||||||
class HubertNoLayerNormConvLayer(nn.Module):
|
class HubertNoLayerNormConvLayer(nn.Module):
|
||||||
def __init__(self, config, layer_id=0):
|
def __init__(self, config, layer_id=0):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||||
self.out_conv_dim = config.conv_dim[layer_id]
|
self.out_conv_dim = config.conv_dim[layer_id]
|
||||||
|
|
||||||
self.conv = nn.Conv1d(
|
self.conv = nn.Conv1d(
|
||||||
@@ -163,7 +163,7 @@ class HubertNoLayerNormConvLayer(nn.Module):
|
|||||||
class HubertLayerNormConvLayer(nn.Module):
|
class HubertLayerNormConvLayer(nn.Module):
|
||||||
def __init__(self, config, layer_id=0):
|
def __init__(self, config, layer_id=0):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||||
self.out_conv_dim = config.conv_dim[layer_id]
|
self.out_conv_dim = config.conv_dim[layer_id]
|
||||||
|
|
||||||
self.conv = nn.Conv1d(
|
self.conv = nn.Conv1d(
|
||||||
@@ -191,7 +191,7 @@ class HubertLayerNormConvLayer(nn.Module):
|
|||||||
class HubertGroupNormConvLayer(nn.Module):
|
class HubertGroupNormConvLayer(nn.Module):
|
||||||
def __init__(self, config, layer_id=0):
|
def __init__(self, config, layer_id=0):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||||
self.out_conv_dim = config.conv_dim[layer_id]
|
self.out_conv_dim = config.conv_dim[layer_id]
|
||||||
|
|
||||||
self.conv = nn.Conv1d(
|
self.conv = nn.Conv1d(
|
||||||
|
|||||||
45
src/transformers/models/sew/__init__.py
Normal file
45
src/transformers/models/sew/__init__.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...file_utils import _LazyModule, is_torch_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
|
||||||
|
}
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
_import_structure["modeling_sew"] = [
|
||||||
|
"SEW_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"SEWForCTC",
|
||||||
|
"SEWModel",
|
||||||
|
"SEWPreTrainedModel",
|
||||||
|
]
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
from .modeling_sew import SEW_PRETRAINED_MODEL_ARCHIVE_LIST, SEWForCTC, SEWModel, SEWPreTrainedModel
|
||||||
|
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
||||||
216
src/transformers/models/sew/configuration_sew.py
Normal file
216
src/transformers/models/sew/configuration_sew.py
Normal file
@@ -0,0 +1,216 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2021 ASAPP Inc. and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" SEW model configuration """
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
SEW_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"asapp/sew-tiny-100k": "https://huggingface.co/asapp/sew-tiny-100k/resolve/main/config.json",
|
||||||
|
# See all SEW models at https://huggingface.co/models?filter=sew
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class SEWConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a :class:`~transformers.SEWModel`. It is used to
|
||||||
|
instantiate a SEW model according to the specified arguments, defining the model architecture. Instantiating a
|
||||||
|
configuration with the defaults will yield a similar configuration to that of the SEW `asapp/sew-tiny-100k
|
||||||
|
<https://huggingface.co/asapp/sew-tiny-100k>`__ architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
|
||||||
|
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (:obj:`int`, `optional`, defaults to 32):
|
||||||
|
Vocabulary size of the SEW model. Defines the number of different tokens that can be represented by the
|
||||||
|
:obj:`inputs_ids` passed when calling :class:`~transformers.SEW`.
|
||||||
|
hidden_size (:obj:`int`, `optional`, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||||
|
squeeze_factor (:obj:`int`, `optional`, defaults to 2):
|
||||||
|
Sequence length downsampling factor after the encoder and upsampling factor after the transformer.
|
||||||
|
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string,
|
||||||
|
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||||
|
hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
final_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout probability for the final projection layer of :class:`SEWForCTC`.
|
||||||
|
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
feat_extract_norm (:obj:`str`, `optional`, defaults to :obj:`"group"`):
|
||||||
|
The norm to be applied to 1D convolutional layers in feature extractor. One of :obj:`"group"` for group
|
||||||
|
normalization of only the first 1D convolutional layer or :obj:`"layer"` for layer normalization of all 1D
|
||||||
|
convolutional layers.
|
||||||
|
feat_proj_dropout (:obj:`float`, `optional`, defaults to 0.0):
|
||||||
|
The dropout probability for output of the feature extractor.
|
||||||
|
feat_extract_activation (:obj:`str, `optional`, defaults to :obj:`"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the 1D convolutional layers of the feature
|
||||||
|
extractor. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||||
|
conv_dim (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512)`):
|
||||||
|
A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
|
||||||
|
feature extractor. The length of `conv_dim` defines the number of 1D convolutional layers.
|
||||||
|
conv_stride (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1)`):
|
||||||
|
A tuple of integers defining the stride of each 1D convolutional layer in the feature extractor. The length
|
||||||
|
of `conv_stride` defines the number of convolutional layers and has to match the the length of `conv_dim`.
|
||||||
|
conv_kernel (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1)`):
|
||||||
|
A tuple of integers defining the kernel size of each 1D convolutional layer in the feature extractor. The
|
||||||
|
length of `conv_kernel` defines the number of convolutional layers and has to match the the length of
|
||||||
|
`conv_dim`.
|
||||||
|
conv_bias (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether the 1D convolutional layers have a bias.
|
||||||
|
num_conv_pos_embeddings (:obj:`int`, `optional`, defaults to 128):
|
||||||
|
Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
|
||||||
|
embeddings layer.
|
||||||
|
num_conv_pos_embedding_groups (:obj:`int`, `optional`, defaults to 16):
|
||||||
|
Number of groups of 1D convolutional positional embeddings layer.
|
||||||
|
apply_spec_augment (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether to apply *SpecAugment* data augmentation to the outputs of the feature extractor. For reference see
|
||||||
|
`SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
|
||||||
|
<https://arxiv.org/abs/1904.08779>`__.
|
||||||
|
mask_time_prob (:obj:`float`, `optional`, defaults to 0.05):
|
||||||
|
Propability of each feature vector along the time axis to be chosen as the start of the vector span to be
|
||||||
|
masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature vectors will be
|
||||||
|
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||||
|
mask_time_length (:obj:`int`, `optional`, defaults to 10):
|
||||||
|
Length of vector span along the time axis.
|
||||||
|
mask_feature_prob (:obj:`float`, `optional`, defaults to 0.0):
|
||||||
|
Propability of each feature vector along the feature axis to be chosen as the start of the vector span to
|
||||||
|
be masked. Approximately ``mask_time_prob * hidden_size // mask_time_length`` feature vectors will be
|
||||||
|
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||||
|
mask_feature_length (:obj:`int`, `optional`, defaults to 10):
|
||||||
|
Length of vector span along the feature axis.
|
||||||
|
ctc_loss_reduction (:obj:`str`, `optional`, defaults to :obj:`"sum"`):
|
||||||
|
Specifies the reduction to apply to the output of ``torch.nn.CTCLoss``. Only relevant when training an
|
||||||
|
instance of :class:`~transformers.SEWForCTC`.
|
||||||
|
ctc_zero_infinity (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether to zero infinite losses and the associated gradients of ``torch.nn.CTCLoss``. Infinite losses
|
||||||
|
mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an
|
||||||
|
instance of :class:`~transformers.SEWForCTC`.
|
||||||
|
|
||||||
|
Example::
|
||||||
|
|
||||||
|
>>> from transformers import SEWModel, SEWConfig
|
||||||
|
|
||||||
|
>>> # Initializing a SEW asapp/sew-tiny-100k style configuration
|
||||||
|
>>> configuration = SEWConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the asapp/sew-tiny-100k style configuration
|
||||||
|
>>> model = SEWModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
"""
|
||||||
|
model_type = "sew"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=32,
|
||||||
|
hidden_size=768,
|
||||||
|
num_hidden_layers=12,
|
||||||
|
num_attention_heads=12,
|
||||||
|
intermediate_size=3072,
|
||||||
|
squeeze_factor=2,
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout=0.1,
|
||||||
|
activation_dropout=0.1,
|
||||||
|
attention_dropout=0.1,
|
||||||
|
feat_proj_dropout=0.0,
|
||||||
|
final_dropout=0.1,
|
||||||
|
layerdrop=0.1,
|
||||||
|
initializer_range=0.02,
|
||||||
|
layer_norm_eps=1e-5,
|
||||||
|
feat_extract_norm="group",
|
||||||
|
feat_extract_activation="gelu",
|
||||||
|
conv_dim=(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512),
|
||||||
|
conv_stride=(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1),
|
||||||
|
conv_kernel=(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1),
|
||||||
|
conv_bias=False,
|
||||||
|
num_conv_pos_embeddings=128,
|
||||||
|
num_conv_pos_embedding_groups=16,
|
||||||
|
apply_spec_augment=True,
|
||||||
|
mask_time_prob=0.05,
|
||||||
|
mask_time_length=10,
|
||||||
|
mask_feature_prob=0.0,
|
||||||
|
mask_feature_length=10,
|
||||||
|
ctc_loss_reduction="sum",
|
||||||
|
ctc_zero_infinity=False,
|
||||||
|
pad_token_id=0,
|
||||||
|
bos_token_id=1,
|
||||||
|
eos_token_id=2,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs, pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id)
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.feat_extract_norm = feat_extract_norm
|
||||||
|
self.feat_extract_activation = feat_extract_activation
|
||||||
|
self.conv_dim = list(conv_dim)
|
||||||
|
self.conv_stride = list(conv_stride)
|
||||||
|
self.conv_kernel = list(conv_kernel)
|
||||||
|
self.conv_bias = conv_bias
|
||||||
|
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||||
|
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||||
|
self.num_feat_extract_layers = len(self.conv_dim)
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.squeeze_factor = squeeze_factor
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.hidden_dropout = hidden_dropout
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.activation_dropout = activation_dropout
|
||||||
|
self.feat_proj_dropout = feat_proj_dropout
|
||||||
|
self.final_dropout = final_dropout
|
||||||
|
self.layerdrop = layerdrop
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
|
||||||
|
if (
|
||||||
|
(len(self.conv_stride) != self.num_feat_extract_layers)
|
||||||
|
or (len(self.conv_kernel) != self.num_feat_extract_layers)
|
||||||
|
or (len(self.conv_dim) != self.num_feat_extract_layers)
|
||||||
|
):
|
||||||
|
raise ValueError(
|
||||||
|
"Configuration for convolutional layers is incorrect."
|
||||||
|
"It is required that `len(config.conv_dim)` == `len(config.conv_stride)` == `len(config.conv_kernel)`,"
|
||||||
|
f"but is `len(config.conv_dim) = {len(self.conv_dim)}`, `len(config.conv_stride)"
|
||||||
|
f"= {len(self.conv_stride)}`, `len(config.conv_kernel) = {len(self.conv_kernel)}`."
|
||||||
|
)
|
||||||
|
|
||||||
|
# fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
|
||||||
|
self.apply_spec_augment = apply_spec_augment
|
||||||
|
self.mask_time_prob = mask_time_prob
|
||||||
|
self.mask_time_length = mask_time_length
|
||||||
|
self.mask_feature_prob = mask_feature_prob
|
||||||
|
self.mask_feature_length = mask_feature_length
|
||||||
|
|
||||||
|
# ctc loss
|
||||||
|
self.ctc_loss_reduction = ctc_loss_reduction
|
||||||
|
self.ctc_zero_infinity = ctc_zero_infinity
|
||||||
@@ -0,0 +1,286 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2021 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert SEW checkpoint."""
|
||||||
|
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
import fairseq
|
||||||
|
import torch
|
||||||
|
from fairseq.data import Dictionary
|
||||||
|
|
||||||
|
# Register SEW's fairseq modules
|
||||||
|
from sew_asapp import tasks # noqa: F401
|
||||||
|
from transformers import (
|
||||||
|
SEWConfig,
|
||||||
|
SEWForCTC,
|
||||||
|
SEWModel,
|
||||||
|
Wav2Vec2CTCTokenizer,
|
||||||
|
Wav2Vec2FeatureExtractor,
|
||||||
|
Wav2Vec2Processor,
|
||||||
|
logging,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
logging.set_verbosity_info()
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
MAPPING = {
|
||||||
|
"post_extract_proj": "feature_projection",
|
||||||
|
"encoder.pos_conv.0": "encoder.pos_conv_embed.conv",
|
||||||
|
"self_attn.k_proj": "encoder.layers.*.attention.k_proj",
|
||||||
|
"self_attn.v_proj": "encoder.layers.*.attention.v_proj",
|
||||||
|
"self_attn.q_proj": "encoder.layers.*.attention.q_proj",
|
||||||
|
"self_attn.out_proj": "encoder.layers.*.attention.out_proj",
|
||||||
|
"self_attn_layer_norm": "encoder.layers.*.layer_norm",
|
||||||
|
"fc1": "encoder.layers.*.feed_forward.intermediate_dense",
|
||||||
|
"fc2": "encoder.layers.*.feed_forward.output_dense",
|
||||||
|
"final_layer_norm": "encoder.layers.*.final_layer_norm",
|
||||||
|
"encoder.upsample.0": "encoder.upsample.projection",
|
||||||
|
"encoder.layer_norm": "encoder.layer_norm",
|
||||||
|
"w2v_encoder.layer_norm": "layer_norm",
|
||||||
|
"w2v_encoder.proj": "lm_head",
|
||||||
|
"mask_emb": "masked_spec_embed",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def set_recursively(hf_pointer, key, value, full_name, weight_type):
|
||||||
|
for attribute in key.split("."):
|
||||||
|
hf_pointer = getattr(hf_pointer, attribute)
|
||||||
|
|
||||||
|
if weight_type is not None:
|
||||||
|
hf_shape = getattr(hf_pointer, weight_type).shape
|
||||||
|
else:
|
||||||
|
hf_shape = hf_pointer.shape
|
||||||
|
|
||||||
|
assert (
|
||||||
|
hf_shape == value.shape
|
||||||
|
), f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be {value.shape} for {full_name}"
|
||||||
|
|
||||||
|
if weight_type == "weight":
|
||||||
|
hf_pointer.weight.data = value
|
||||||
|
elif weight_type == "weight_g":
|
||||||
|
hf_pointer.weight_g.data = value
|
||||||
|
elif weight_type == "weight_v":
|
||||||
|
hf_pointer.weight_v.data = value
|
||||||
|
elif weight_type == "bias":
|
||||||
|
hf_pointer.bias.data = value
|
||||||
|
else:
|
||||||
|
hf_pointer.data = value
|
||||||
|
|
||||||
|
logger.info(f"{key + '.' + weight_type if weight_type is not None else ''} was initialized from {full_name}.")
|
||||||
|
|
||||||
|
|
||||||
|
def recursively_load_weights(fairseq_model, hf_model, is_finetuned):
|
||||||
|
unused_weights = []
|
||||||
|
fairseq_dict = fairseq_model.state_dict()
|
||||||
|
|
||||||
|
feature_extractor = hf_model.sew.feature_extractor if is_finetuned else hf_model.feature_extractor
|
||||||
|
|
||||||
|
for name, value in fairseq_dict.items():
|
||||||
|
is_used = False
|
||||||
|
if "conv_layers" in name:
|
||||||
|
load_conv_layer(
|
||||||
|
name,
|
||||||
|
value,
|
||||||
|
feature_extractor,
|
||||||
|
unused_weights,
|
||||||
|
hf_model.config.feat_extract_norm == "group",
|
||||||
|
)
|
||||||
|
is_used = True
|
||||||
|
else:
|
||||||
|
for key, mapped_key in MAPPING.items():
|
||||||
|
mapped_key = "sew." + mapped_key if (is_finetuned and mapped_key != "lm_head") else mapped_key
|
||||||
|
|
||||||
|
if key in name or key.split("w2v_encoder.")[-1] == name.split(".")[0]:
|
||||||
|
is_used = True
|
||||||
|
if "*" in mapped_key:
|
||||||
|
layer_index = name.split(key)[0].split(".")[-2]
|
||||||
|
mapped_key = mapped_key.replace("*", layer_index)
|
||||||
|
if "weight_g" in name:
|
||||||
|
weight_type = "weight_g"
|
||||||
|
elif "weight_v" in name:
|
||||||
|
weight_type = "weight_v"
|
||||||
|
elif "weight" in name:
|
||||||
|
weight_type = "weight"
|
||||||
|
elif "bias" in name:
|
||||||
|
weight_type = "bias"
|
||||||
|
else:
|
||||||
|
weight_type = None
|
||||||
|
set_recursively(hf_model, mapped_key, value, name, weight_type)
|
||||||
|
continue
|
||||||
|
if not is_used:
|
||||||
|
unused_weights.append(name)
|
||||||
|
|
||||||
|
logger.warning(f"Unused weights: {unused_weights}")
|
||||||
|
|
||||||
|
|
||||||
|
def load_conv_layer(full_name, value, feature_extractor, unused_weights, use_group_norm):
|
||||||
|
name = full_name.split("conv_layers.")[-1]
|
||||||
|
items = name.split(".")
|
||||||
|
layer_id = int(items[0])
|
||||||
|
type_id = int(items[1])
|
||||||
|
|
||||||
|
if type_id == 0:
|
||||||
|
if "bias" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].conv.bias.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.bias.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].conv.bias.data = value
|
||||||
|
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||||
|
elif "weight" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].conv.weight.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.weight.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].conv.weight.data = value
|
||||||
|
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||||
|
elif (type_id == 2 and not use_group_norm) or (type_id == 2 and layer_id == 0 and use_group_norm):
|
||||||
|
if "bias" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.bias.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.bias.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].layer_norm.bias.data = value
|
||||||
|
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||||
|
elif "weight" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.weight.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.weight.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].layer_norm.weight.data = value
|
||||||
|
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||||
|
else:
|
||||||
|
unused_weights.append(full_name)
|
||||||
|
|
||||||
|
|
||||||
|
def convert_config(model):
|
||||||
|
config = SEWConfig()
|
||||||
|
fs_config = model.cfg
|
||||||
|
|
||||||
|
config.activation_dropout = fs_config.activation_dropout
|
||||||
|
config.apply_spec_augment = fs_config.mask_prob > 0 or fs_config.mask_channel_prob > 0
|
||||||
|
config.attention_dropout = fs_config.attention_dropout
|
||||||
|
config.conv_bias = fs_config.conv_bias
|
||||||
|
conv_layers = eval(fs_config.conv_feature_layers)
|
||||||
|
config.conv_dim = [x[0] for x in conv_layers]
|
||||||
|
config.conv_kernel = [x[1] for x in conv_layers]
|
||||||
|
config.conv_stride = [x[2] for x in conv_layers]
|
||||||
|
config.feat_extract_activation = "gelu"
|
||||||
|
config.feat_extract_norm = "layer" if fs_config.extractor_mode == "layer_norm" else "group"
|
||||||
|
config.feat_proj_dropout = fs_config.dropout_input
|
||||||
|
config.final_dropout = 0.0
|
||||||
|
config.hidden_act = fs_config.activation_fn.name
|
||||||
|
config.hidden_dropout = fs_config.dropout
|
||||||
|
config.hidden_size = fs_config.encoder_embed_dim
|
||||||
|
config.initializer_range = 0.02
|
||||||
|
config.intermediate_size = fs_config.encoder_ffn_embed_dim
|
||||||
|
config.layer_norm_eps = 1e-5
|
||||||
|
config.layerdrop = fs_config.encoder_layerdrop
|
||||||
|
config.mask_feature_length = fs_config.mask_channel_length
|
||||||
|
config.mask_feature_prob = fs_config.mask_channel_prob
|
||||||
|
config.mask_time_length = fs_config.mask_length
|
||||||
|
config.mask_time_prob = fs_config.mask_prob
|
||||||
|
config.num_attention_heads = fs_config.encoder_attention_heads
|
||||||
|
config.num_conv_pos_embedding_groups = fs_config.conv_pos_groups
|
||||||
|
config.num_conv_pos_embeddings = fs_config.conv_pos
|
||||||
|
config.num_feat_extract_layers = len(conv_layers)
|
||||||
|
config.num_hidden_layers = fs_config.encoder_layers
|
||||||
|
config.squeeze_factor = fs_config.squeeze_factor
|
||||||
|
|
||||||
|
return config
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def convert_sew_checkpoint(
|
||||||
|
checkpoint_path, pytorch_dump_folder_path, config_path=None, dict_path=None, is_finetuned=True
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Copy/paste/tweak model's weights to transformers design.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if is_finetuned:
|
||||||
|
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
|
||||||
|
[checkpoint_path], arg_overrides={"data": "/".join(dict_path.split("/")[:-1])}
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint_path])
|
||||||
|
|
||||||
|
if config_path is not None:
|
||||||
|
config = SEWConfig.from_pretrained(config_path)
|
||||||
|
else:
|
||||||
|
config = convert_config(model[0])
|
||||||
|
model = model[0].eval()
|
||||||
|
|
||||||
|
return_attention_mask = True if config.feat_extract_norm == "layer" else False
|
||||||
|
feature_extractor = Wav2Vec2FeatureExtractor(
|
||||||
|
feature_size=1,
|
||||||
|
sampling_rate=16000,
|
||||||
|
padding_value=0,
|
||||||
|
do_normalize=True,
|
||||||
|
return_attention_mask=return_attention_mask,
|
||||||
|
)
|
||||||
|
|
||||||
|
if is_finetuned:
|
||||||
|
if dict_path:
|
||||||
|
target_dict = Dictionary.load(dict_path)
|
||||||
|
|
||||||
|
# important change bos & pad token id since CTC symbol is <pad> and
|
||||||
|
# not <s> as in fairseq
|
||||||
|
config.bos_token_id = target_dict.pad_index
|
||||||
|
config.pad_token_id = target_dict.bos_index
|
||||||
|
config.eos_token_id = target_dict.eos_index
|
||||||
|
config.vocab_size = len(target_dict.symbols)
|
||||||
|
vocab_path = os.path.join(pytorch_dump_folder_path, "vocab.json")
|
||||||
|
if not os.path.isdir(pytorch_dump_folder_path):
|
||||||
|
logger.error("--pytorch_dump_folder_path ({}) should be a directory".format(pytorch_dump_folder_path))
|
||||||
|
return
|
||||||
|
os.makedirs(pytorch_dump_folder_path, exist_ok=True)
|
||||||
|
with open(vocab_path, "w", encoding="utf-8") as vocab_handle:
|
||||||
|
json.dump(target_dict.indices, vocab_handle)
|
||||||
|
tokenizer = Wav2Vec2CTCTokenizer(
|
||||||
|
vocab_path,
|
||||||
|
unk_token=target_dict.unk_word,
|
||||||
|
pad_token=target_dict.pad_word,
|
||||||
|
bos_token=target_dict.bos_word,
|
||||||
|
eos_token=target_dict.eos_word,
|
||||||
|
word_delimiter_token="|",
|
||||||
|
do_lower_case=False,
|
||||||
|
)
|
||||||
|
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||||
|
processor.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
hf_model = SEWForCTC(config)
|
||||||
|
else:
|
||||||
|
hf_model = SEWModel(config)
|
||||||
|
feature_extractor.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
recursively_load_weights(model, hf_model, is_finetuned)
|
||||||
|
|
||||||
|
hf_model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||||
|
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
|
||||||
|
parser.add_argument("--dict_path", default=None, type=str, help="Path to dict of fine-tuned model")
|
||||||
|
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||||
|
parser.add_argument(
|
||||||
|
"--is_finetuned", action="store_true", help="Whether the model to convert is a fine-tuned model or not"
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_sew_checkpoint(
|
||||||
|
args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.dict_path, args.is_finetuned
|
||||||
|
)
|
||||||
1035
src/transformers/models/sew/modeling_sew.py
Normal file
1035
src/transformers/models/sew/modeling_sew.py
Normal file
File diff suppressed because it is too large
Load Diff
45
src/transformers/models/sew_d/__init__.py
Normal file
45
src/transformers/models/sew_d/__init__.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
# flake8: noqa
|
||||||
|
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||||
|
# module, but to preserve other warnings. So, don't check this module at all.
|
||||||
|
|
||||||
|
# Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...file_utils import _LazyModule, is_torch_available
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
|
||||||
|
}
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
_import_structure["modeling_sew_d"] = [
|
||||||
|
"SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"SEWDForCTC",
|
||||||
|
"SEWDModel",
|
||||||
|
"SEWDPreTrainedModel",
|
||||||
|
]
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
from .modeling_sew_d import SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST, SEWDForCTC, SEWDModel, SEWDPreTrainedModel
|
||||||
|
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
||||||
248
src/transformers/models/sew_d/configuration_sew_d.py
Normal file
248
src/transformers/models/sew_d/configuration_sew_d.py
Normal file
@@ -0,0 +1,248 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2021 ASAPP Inc. and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" SEW-D model configuration """
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"asapp/sew-d-tiny-100k": "https://huggingface.co/asapp/sew-d-tiny-100k/resolve/main/config.json",
|
||||||
|
# See all SEW-D models at https://huggingface.co/models?filter=sew-d
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class SEWDConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a :class:`~transformers.SEWDModel`. It is used to
|
||||||
|
instantiate a SEW-D model according to the specified arguments, defining the model architecture. Instantiating a
|
||||||
|
configuration with the defaults will yield a similar configuration to that of the SEW-D `asapp/sew-d-tiny-100k
|
||||||
|
<https://huggingface.co/asapp/sew-d-tiny-100k>`__ architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
|
||||||
|
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (:obj:`int`, `optional`, defaults to 32):
|
||||||
|
Vocabulary size of the SEW-D model. Defines the number of different tokens that can be represented by the
|
||||||
|
:obj:`inputs_ids` passed when calling :class:`~transformers.SEWD`.
|
||||||
|
hidden_size (:obj:`int`, `optional`, defaults to 768):
|
||||||
|
Dimensionality of the encoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||||
|
squeeze_factor (:obj:`int`, `optional`, defaults to 2):
|
||||||
|
Sequence length downsampling factor after the encoder and upsampling factor after the transformer.
|
||||||
|
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
|
||||||
|
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||||
|
just in case (e.g., 512 or 1024 or 2048).
|
||||||
|
position_buckets (:obj:`int`, `optional`, defaults to 256):
|
||||||
|
The maximum size of relative position embeddings.
|
||||||
|
share_att_key (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether to share attention key with c2p and p2c.
|
||||||
|
relative_attention (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether to use relative position encoding.
|
||||||
|
position_biased_input (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether to add absolute position embedding to content embedding.
|
||||||
|
pos_att_type (:obj:`Tuple[str]`, `optional`, defaults to :obj:`("p2c", "c2p")`):
|
||||||
|
The type of relative position attention, it can be a combination of :obj:`("p2c", "c2p", "p2p")`, e.g.
|
||||||
|
:obj:`("p2c")`, :obj:`("p2c", "c2p")`, :obj:`("p2c", "c2p", 'p2p")`.
|
||||||
|
norm_rel_ebd (:obj:`str`, `optional`, defaults to :obj:`"layer_norm"`):
|
||||||
|
Whether to use layer norm in relative embedding (:obj:`"layer_norm"` if yes)
|
||||||
|
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string,
|
||||||
|
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||||
|
hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||||
|
attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
final_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||||
|
The dropout probability for the final projection layer of :class:`SEWDForCTC`.
|
||||||
|
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
feat_extract_norm (:obj:`str`, `optional`, defaults to :obj:`"group"`):
|
||||||
|
The norm to be applied to 1D convolutional layers in feature extractor. One of :obj:`"group"` for group
|
||||||
|
normalization of only the first 1D convolutional layer or :obj:`"layer"` for layer normalization of all 1D
|
||||||
|
convolutional layers.
|
||||||
|
feat_proj_dropout (:obj:`float`, `optional`, defaults to 0.0):
|
||||||
|
The dropout probability for output of the feature extractor.
|
||||||
|
feat_extract_activation (:obj:`str, `optional`, defaults to :obj:`"gelu"`):
|
||||||
|
The non-linear activation function (function or string) in the 1D convolutional layers of the feature
|
||||||
|
extractor. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||||
|
conv_dim (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512)`):
|
||||||
|
A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
|
||||||
|
feature extractor. The length of `conv_dim` defines the number of 1D convolutional layers.
|
||||||
|
conv_stride (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1)`):
|
||||||
|
A tuple of integers defining the stride of each 1D convolutional layer in the feature extractor. The length
|
||||||
|
of `conv_stride` defines the number of convolutional layers and has to match the the length of `conv_dim`.
|
||||||
|
conv_kernel (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1)`):
|
||||||
|
A tuple of integers defining the kernel size of each 1D convolutional layer in the feature extractor. The
|
||||||
|
length of `conv_kernel` defines the number of convolutional layers and has to match the the length of
|
||||||
|
`conv_dim`.
|
||||||
|
conv_bias (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether the 1D convolutional layers have a bias.
|
||||||
|
num_conv_pos_embeddings (:obj:`int`, `optional`, defaults to 128):
|
||||||
|
Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
|
||||||
|
embeddings layer.
|
||||||
|
num_conv_pos_embedding_groups (:obj:`int`, `optional`, defaults to 16):
|
||||||
|
Number of groups of 1D convolutional positional embeddings layer.
|
||||||
|
apply_spec_augment (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||||
|
Whether to apply *SpecAugment* data augmentation to the outputs of the feature extractor. For reference see
|
||||||
|
`SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
|
||||||
|
<https://arxiv.org/abs/1904.08779>`__.
|
||||||
|
mask_time_prob (:obj:`float`, `optional`, defaults to 0.05):
|
||||||
|
Propability of each feature vector along the time axis to be chosen as the start of the vector span to be
|
||||||
|
masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature vectors will be
|
||||||
|
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||||
|
mask_time_length (:obj:`int`, `optional`, defaults to 10):
|
||||||
|
Length of vector span along the time axis.
|
||||||
|
mask_feature_prob (:obj:`float`, `optional`, defaults to 0.0):
|
||||||
|
Propability of each feature vector along the feature axis to be chosen as the start of the vector span to
|
||||||
|
be masked. Approximately ``mask_time_prob * hidden_size // mask_time_length`` feature vectors will be
|
||||||
|
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||||
|
mask_feature_length (:obj:`int`, `optional`, defaults to 10):
|
||||||
|
Length of vector span along the feature axis.
|
||||||
|
diversity_loss_weight (:obj:`int`, `optional`, defaults to 0.1):
|
||||||
|
The weight of the codebook diversity loss component.
|
||||||
|
ctc_loss_reduction (:obj:`str`, `optional`, defaults to :obj:`"sum"`):
|
||||||
|
Specifies the reduction to apply to the output of ``torch.nn.CTCLoss``. Only relevant when training an
|
||||||
|
instance of :class:`~transformers.SEWDForCTC`.
|
||||||
|
ctc_zero_infinity (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||||
|
Whether to zero infinite losses and the associated gradients of ``torch.nn.CTCLoss``. Infinite losses
|
||||||
|
mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an
|
||||||
|
instance of :class:`~transformers.SEWDForCTC`.
|
||||||
|
|
||||||
|
Example::
|
||||||
|
|
||||||
|
>>> from transformers import SEWDModel, SEWDConfig
|
||||||
|
|
||||||
|
>>> # Initializing a SEW-D asapp/sew-d-tiny-100k style configuration
|
||||||
|
>>> configuration = SEWDConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the asapp/sew-d-tiny-100k style configuration
|
||||||
|
>>> model = SEWDModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
"""
|
||||||
|
model_type = "sew-d"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=32,
|
||||||
|
hidden_size=768,
|
||||||
|
num_hidden_layers=12,
|
||||||
|
num_attention_heads=12,
|
||||||
|
intermediate_size=3072,
|
||||||
|
squeeze_factor=2,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
position_buckets=256,
|
||||||
|
share_att_key=True,
|
||||||
|
relative_attention=True,
|
||||||
|
position_biased_input=False,
|
||||||
|
pos_att_type=("p2c", "c2p"),
|
||||||
|
norm_rel_ebd="layer_norm",
|
||||||
|
hidden_act="gelu",
|
||||||
|
hidden_dropout=0.1,
|
||||||
|
activation_dropout=0.1,
|
||||||
|
attention_dropout=0.1,
|
||||||
|
feat_proj_dropout=0.0,
|
||||||
|
final_dropout=0.1,
|
||||||
|
layerdrop=0.1,
|
||||||
|
initializer_range=0.02,
|
||||||
|
layer_norm_eps=1e-5,
|
||||||
|
feat_extract_norm="group",
|
||||||
|
feat_extract_activation="gelu",
|
||||||
|
conv_dim=(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512),
|
||||||
|
conv_stride=(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1),
|
||||||
|
conv_kernel=(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1),
|
||||||
|
conv_bias=False,
|
||||||
|
num_conv_pos_embeddings=128,
|
||||||
|
num_conv_pos_embedding_groups=16,
|
||||||
|
apply_spec_augment=True,
|
||||||
|
mask_time_prob=0.05,
|
||||||
|
mask_time_length=10,
|
||||||
|
mask_feature_prob=0.0,
|
||||||
|
mask_feature_length=10,
|
||||||
|
ctc_loss_reduction="sum",
|
||||||
|
ctc_zero_infinity=False,
|
||||||
|
pad_token_id=0,
|
||||||
|
bos_token_id=1,
|
||||||
|
eos_token_id=2,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
|
super().__init__(**kwargs, pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id)
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.feat_extract_norm = feat_extract_norm
|
||||||
|
self.feat_extract_activation = feat_extract_activation
|
||||||
|
self.conv_dim = list(conv_dim)
|
||||||
|
self.conv_stride = list(conv_stride)
|
||||||
|
self.conv_kernel = list(conv_kernel)
|
||||||
|
self.conv_bias = conv_bias
|
||||||
|
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||||
|
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||||
|
self.num_feat_extract_layers = len(self.conv_dim)
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.squeeze_factor = squeeze_factor
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.position_buckets = position_buckets
|
||||||
|
self.share_att_key = share_att_key
|
||||||
|
self.relative_attention = relative_attention
|
||||||
|
self.norm_rel_ebd = norm_rel_ebd
|
||||||
|
self.position_biased_input = position_biased_input
|
||||||
|
self.pos_att_type = list(pos_att_type)
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.hidden_dropout = hidden_dropout
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.activation_dropout = activation_dropout
|
||||||
|
self.feat_proj_dropout = feat_proj_dropout
|
||||||
|
self.final_dropout = final_dropout
|
||||||
|
self.layerdrop = layerdrop
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
|
||||||
|
if (
|
||||||
|
(len(self.conv_stride) != self.num_feat_extract_layers)
|
||||||
|
or (len(self.conv_kernel) != self.num_feat_extract_layers)
|
||||||
|
or (len(self.conv_dim) != self.num_feat_extract_layers)
|
||||||
|
):
|
||||||
|
raise ValueError(
|
||||||
|
"Configuration for convolutional layers is incorrect."
|
||||||
|
"It is required that `len(config.conv_dim)` == `len(config.conv_stride)` == `len(config.conv_kernel)`,"
|
||||||
|
f"but is `len(config.conv_dim) = {len(self.conv_dim)}`, `len(config.conv_stride)"
|
||||||
|
f"= {len(self.conv_stride)}`, `len(config.conv_kernel) = {len(self.conv_kernel)}`."
|
||||||
|
)
|
||||||
|
|
||||||
|
# fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
|
||||||
|
self.apply_spec_augment = apply_spec_augment
|
||||||
|
self.mask_time_prob = mask_time_prob
|
||||||
|
self.mask_time_length = mask_time_length
|
||||||
|
self.mask_feature_prob = mask_feature_prob
|
||||||
|
self.mask_feature_length = mask_feature_length
|
||||||
|
|
||||||
|
# ctc loss
|
||||||
|
self.ctc_loss_reduction = ctc_loss_reduction
|
||||||
|
self.ctc_zero_infinity = ctc_zero_infinity
|
||||||
@@ -0,0 +1,298 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2021 The HuggingFace Inc. team.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert SEW checkpoint."""
|
||||||
|
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
import fairseq
|
||||||
|
import torch
|
||||||
|
from fairseq.data import Dictionary
|
||||||
|
|
||||||
|
# Register SEW's fairseq modules
|
||||||
|
from sew_asapp import tasks # noqa: F401
|
||||||
|
from transformers import (
|
||||||
|
SEWDConfig,
|
||||||
|
SEWDForCTC,
|
||||||
|
SEWDModel,
|
||||||
|
Wav2Vec2CTCTokenizer,
|
||||||
|
Wav2Vec2FeatureExtractor,
|
||||||
|
Wav2Vec2Processor,
|
||||||
|
logging,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
logging.set_verbosity_info()
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
MAPPING = {
|
||||||
|
"post_extract_proj": "feature_projection",
|
||||||
|
"encoder.pos_conv.0": "encoder.pos_conv_embed.conv",
|
||||||
|
"attention.self.query_proj": "encoder.encoder.layer.*.attention.self.query_proj",
|
||||||
|
"attention.self.key_proj": "encoder.encoder.layer.*.attention.self.key_proj",
|
||||||
|
"attention.self.value_proj": "encoder.encoder.layer.*.attention.self.value_proj",
|
||||||
|
"attention.output.dense": "encoder.encoder.layer.*.attention.output.dense",
|
||||||
|
"attention.output.LayerNorm": "encoder.encoder.layer.*.attention.output.LayerNorm",
|
||||||
|
"intermediate.dense": "encoder.encoder.layer.*.intermediate.dense",
|
||||||
|
"output.dense": "encoder.encoder.layer.*.output.dense",
|
||||||
|
"output.LayerNorm": "encoder.encoder.layer.*.output.LayerNorm",
|
||||||
|
"encoder.encoder.rel_embeddings": "encoder.encoder.rel_embeddings",
|
||||||
|
"encoder.encoder.LayerNorm": "encoder.encoder.LayerNorm",
|
||||||
|
"encoder.upsample.0": "encoder.upsample.projection",
|
||||||
|
"encoder.layer_norm": "encoder.layer_norm",
|
||||||
|
"w2v_encoder.layer_norm": "layer_norm",
|
||||||
|
"w2v_encoder.proj": "lm_head",
|
||||||
|
"mask_emb": "masked_spec_embed",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def set_recursively(hf_pointer, key, value, full_name, weight_type):
|
||||||
|
for attribute in key.split("."):
|
||||||
|
hf_pointer = getattr(hf_pointer, attribute)
|
||||||
|
|
||||||
|
if weight_type is not None:
|
||||||
|
hf_shape = getattr(hf_pointer, weight_type).shape
|
||||||
|
else:
|
||||||
|
hf_shape = hf_pointer.shape
|
||||||
|
|
||||||
|
assert (
|
||||||
|
hf_shape == value.shape
|
||||||
|
), f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be {value.shape} for {full_name}"
|
||||||
|
|
||||||
|
if weight_type == "weight":
|
||||||
|
hf_pointer.weight.data = value
|
||||||
|
elif weight_type == "weight_g":
|
||||||
|
hf_pointer.weight_g.data = value
|
||||||
|
elif weight_type == "weight_v":
|
||||||
|
hf_pointer.weight_v.data = value
|
||||||
|
elif weight_type == "bias":
|
||||||
|
hf_pointer.bias.data = value
|
||||||
|
else:
|
||||||
|
hf_pointer.data = value
|
||||||
|
|
||||||
|
logger.info(f"{key + '.' + weight_type if weight_type is not None else ''} was initialized from {full_name}.")
|
||||||
|
|
||||||
|
|
||||||
|
def recursively_load_weights(fairseq_model, hf_model, is_finetuned):
|
||||||
|
unused_weights = []
|
||||||
|
fairseq_dict = fairseq_model.state_dict()
|
||||||
|
|
||||||
|
feature_extractor = hf_model.sew.feature_extractor if is_finetuned else hf_model.feature_extractor
|
||||||
|
|
||||||
|
for name, value in fairseq_dict.items():
|
||||||
|
is_used = False
|
||||||
|
if "conv_layers" in name:
|
||||||
|
load_conv_layer(
|
||||||
|
name,
|
||||||
|
value,
|
||||||
|
feature_extractor,
|
||||||
|
unused_weights,
|
||||||
|
hf_model.config.feat_extract_norm == "group",
|
||||||
|
)
|
||||||
|
is_used = True
|
||||||
|
else:
|
||||||
|
for key, mapped_key in MAPPING.items():
|
||||||
|
mapped_key = "sew_d." + mapped_key if (is_finetuned and mapped_key != "lm_head") else mapped_key
|
||||||
|
|
||||||
|
if key in name or key.split("w2v_encoder.")[-1] == name.split(".")[0]:
|
||||||
|
is_used = True
|
||||||
|
if "*" in mapped_key:
|
||||||
|
layer_index = name.split(key)[0].split(".")[-2]
|
||||||
|
if not layer_index.isnumeric():
|
||||||
|
continue
|
||||||
|
mapped_key = mapped_key.replace("*", layer_index)
|
||||||
|
if "weight_g" in name:
|
||||||
|
weight_type = "weight_g"
|
||||||
|
elif "weight_v" in name:
|
||||||
|
weight_type = "weight_v"
|
||||||
|
elif "weight" in name:
|
||||||
|
weight_type = "weight"
|
||||||
|
elif "bias" in name:
|
||||||
|
weight_type = "bias"
|
||||||
|
else:
|
||||||
|
weight_type = None
|
||||||
|
set_recursively(hf_model, mapped_key, value, name, weight_type)
|
||||||
|
continue
|
||||||
|
if not is_used:
|
||||||
|
unused_weights.append(name)
|
||||||
|
|
||||||
|
logger.warning(f"Unused weights: {unused_weights}")
|
||||||
|
|
||||||
|
|
||||||
|
def load_conv_layer(full_name, value, feature_extractor, unused_weights, use_group_norm):
|
||||||
|
name = full_name.split("conv_layers.")[-1]
|
||||||
|
items = name.split(".")
|
||||||
|
layer_id = int(items[0])
|
||||||
|
type_id = int(items[1])
|
||||||
|
|
||||||
|
if type_id == 0:
|
||||||
|
if "bias" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].conv.bias.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.bias.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].conv.bias.data = value
|
||||||
|
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||||
|
elif "weight" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].conv.weight.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.weight.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].conv.weight.data = value
|
||||||
|
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||||
|
elif (type_id == 2 and not use_group_norm) or (type_id == 2 and layer_id == 0 and use_group_norm):
|
||||||
|
if "bias" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.bias.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.bias.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].layer_norm.bias.data = value
|
||||||
|
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||||
|
elif "weight" in name:
|
||||||
|
assert (
|
||||||
|
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.weight.data.shape
|
||||||
|
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.weight.data.shape} was found."
|
||||||
|
feature_extractor.conv_layers[layer_id].layer_norm.weight.data = value
|
||||||
|
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||||
|
else:
|
||||||
|
unused_weights.append(full_name)
|
||||||
|
|
||||||
|
|
||||||
|
def convert_config(model):
|
||||||
|
config = SEWDConfig()
|
||||||
|
fs_config = model.cfg
|
||||||
|
|
||||||
|
config.activation_dropout = fs_config.activation_dropout
|
||||||
|
config.apply_spec_augment = fs_config.mask_prob > 0 or fs_config.mask_channel_prob > 0
|
||||||
|
config.attention_dropout = fs_config.attention_dropout
|
||||||
|
config.conv_bias = fs_config.conv_bias
|
||||||
|
conv_layers = eval(fs_config.conv_feature_layers)
|
||||||
|
config.conv_dim = [x[0] for x in conv_layers]
|
||||||
|
config.conv_kernel = [x[1] for x in conv_layers]
|
||||||
|
config.conv_stride = [x[2] for x in conv_layers]
|
||||||
|
config.feat_extract_activation = "gelu"
|
||||||
|
config.feat_extract_norm = "layer" if fs_config.extractor_mode == "layer_norm" else "group"
|
||||||
|
config.feat_proj_dropout = fs_config.dropout_input
|
||||||
|
config.final_dropout = 0.0
|
||||||
|
config.hidden_act = fs_config.activation_fn.name
|
||||||
|
config.hidden_dropout = fs_config.dropout
|
||||||
|
config.hidden_size = fs_config.encoder_embed_dim
|
||||||
|
config.initializer_range = 0.02
|
||||||
|
config.intermediate_size = fs_config.encoder_ffn_embed_dim
|
||||||
|
config.layer_norm_eps = 1e-5
|
||||||
|
config.layerdrop = fs_config.encoder_layerdrop
|
||||||
|
config.mask_feature_length = fs_config.mask_channel_length
|
||||||
|
config.mask_feature_prob = fs_config.mask_channel_prob
|
||||||
|
config.mask_time_length = fs_config.mask_length
|
||||||
|
config.mask_time_prob = fs_config.mask_prob
|
||||||
|
config.num_attention_heads = fs_config.encoder_attention_heads
|
||||||
|
config.num_conv_pos_embedding_groups = fs_config.conv_pos_groups
|
||||||
|
config.num_conv_pos_embeddings = fs_config.conv_pos
|
||||||
|
config.num_feat_extract_layers = len(conv_layers)
|
||||||
|
config.num_hidden_layers = fs_config.encoder_layers
|
||||||
|
config.squeeze_factor = fs_config.squeeze_factor
|
||||||
|
# DeBERTa-specific parameters:
|
||||||
|
config.max_position_embeddings = fs_config.max_position_embeddings
|
||||||
|
config.position_buckets = fs_config.position_buckets
|
||||||
|
config.share_att_key = fs_config.share_att_key
|
||||||
|
config.relative_attention = fs_config.relative_attention
|
||||||
|
config.position_biased_input = fs_config.position_biased_input
|
||||||
|
config.pos_att_type = tuple(fs_config.pos_att_type.split("|"))
|
||||||
|
config.norm_rel_ebd = fs_config.norm_rel_ebd
|
||||||
|
|
||||||
|
return config
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def convert_sew_checkpoint(
|
||||||
|
checkpoint_path, pytorch_dump_folder_path, config_path=None, dict_path=None, is_finetuned=True
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Copy/paste/tweak model's weights to transformers design.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if is_finetuned:
|
||||||
|
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
|
||||||
|
[checkpoint_path], arg_overrides={"data": "/".join(dict_path.split("/")[:-1])}
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint_path])
|
||||||
|
|
||||||
|
if config_path is not None:
|
||||||
|
config = SEWDConfig.from_pretrained(config_path)
|
||||||
|
else:
|
||||||
|
config = convert_config(model[0])
|
||||||
|
model = model[0].eval()
|
||||||
|
|
||||||
|
return_attention_mask = True if config.feat_extract_norm == "layer" else False
|
||||||
|
feature_extractor = Wav2Vec2FeatureExtractor(
|
||||||
|
feature_size=1,
|
||||||
|
sampling_rate=16000,
|
||||||
|
padding_value=0,
|
||||||
|
do_normalize=True,
|
||||||
|
return_attention_mask=return_attention_mask,
|
||||||
|
)
|
||||||
|
|
||||||
|
if is_finetuned:
|
||||||
|
if dict_path:
|
||||||
|
target_dict = Dictionary.load(dict_path)
|
||||||
|
|
||||||
|
# important change bos & pad token id since CTC symbol is <pad> and
|
||||||
|
# not <s> as in fairseq
|
||||||
|
config.bos_token_id = target_dict.pad_index
|
||||||
|
config.pad_token_id = target_dict.bos_index
|
||||||
|
config.eos_token_id = target_dict.eos_index
|
||||||
|
config.vocab_size = len(target_dict.symbols)
|
||||||
|
vocab_path = os.path.join(pytorch_dump_folder_path, "vocab.json")
|
||||||
|
if not os.path.isdir(pytorch_dump_folder_path):
|
||||||
|
logger.error("--pytorch_dump_folder_path ({}) should be a directory".format(pytorch_dump_folder_path))
|
||||||
|
return
|
||||||
|
os.makedirs(pytorch_dump_folder_path, exist_ok=True)
|
||||||
|
with open(vocab_path, "w", encoding="utf-8") as vocab_handle:
|
||||||
|
json.dump(target_dict.indices, vocab_handle)
|
||||||
|
tokenizer = Wav2Vec2CTCTokenizer(
|
||||||
|
vocab_path,
|
||||||
|
unk_token=target_dict.unk_word,
|
||||||
|
pad_token=target_dict.pad_word,
|
||||||
|
bos_token=target_dict.bos_word,
|
||||||
|
eos_token=target_dict.eos_word,
|
||||||
|
word_delimiter_token="|",
|
||||||
|
do_lower_case=False,
|
||||||
|
)
|
||||||
|
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||||
|
processor.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
hf_model = SEWDForCTC(config)
|
||||||
|
else:
|
||||||
|
hf_model = SEWDModel(config)
|
||||||
|
feature_extractor.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
recursively_load_weights(model, hf_model, is_finetuned)
|
||||||
|
|
||||||
|
hf_model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||||
|
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
|
||||||
|
parser.add_argument("--dict_path", default=None, type=str, help="Path to dict of fine-tuned model")
|
||||||
|
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||||
|
parser.add_argument(
|
||||||
|
"--is_finetuned", action="store_true", help="Whether the model to convert is a fine-tuned model or not"
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_sew_checkpoint(
|
||||||
|
args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.dict_path, args.is_finetuned
|
||||||
|
)
|
||||||
1540
src/transformers/models/sew_d/modeling_sew_d.py
Normal file
1540
src/transformers/models/sew_d/modeling_sew_d.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -256,7 +256,7 @@ def _sample_negative_indices(
|
|||||||
class Wav2Vec2NoLayerNormConvLayer(nn.Module):
|
class Wav2Vec2NoLayerNormConvLayer(nn.Module):
|
||||||
def __init__(self, config, layer_id=0):
|
def __init__(self, config, layer_id=0):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||||
self.out_conv_dim = config.conv_dim[layer_id]
|
self.out_conv_dim = config.conv_dim[layer_id]
|
||||||
|
|
||||||
self.conv = nn.Conv1d(
|
self.conv = nn.Conv1d(
|
||||||
@@ -277,7 +277,7 @@ class Wav2Vec2NoLayerNormConvLayer(nn.Module):
|
|||||||
class Wav2Vec2LayerNormConvLayer(nn.Module):
|
class Wav2Vec2LayerNormConvLayer(nn.Module):
|
||||||
def __init__(self, config, layer_id=0):
|
def __init__(self, config, layer_id=0):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||||
self.out_conv_dim = config.conv_dim[layer_id]
|
self.out_conv_dim = config.conv_dim[layer_id]
|
||||||
|
|
||||||
self.conv = nn.Conv1d(
|
self.conv = nn.Conv1d(
|
||||||
@@ -304,7 +304,7 @@ class Wav2Vec2LayerNormConvLayer(nn.Module):
|
|||||||
class Wav2Vec2GroupNormConvLayer(nn.Module):
|
class Wav2Vec2GroupNormConvLayer(nn.Module):
|
||||||
def __init__(self, config, layer_id=0):
|
def __init__(self, config, layer_id=0):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||||
self.out_conv_dim = config.conv_dim[layer_id]
|
self.out_conv_dim = config.conv_dim[layer_id]
|
||||||
|
|
||||||
self.conv = nn.Conv1d(
|
self.conv = nn.Conv1d(
|
||||||
|
|||||||
@@ -3281,6 +3281,58 @@ def load_tf_weights_in_roformer(*args, **kwargs):
|
|||||||
requires_backends(load_tf_weights_in_roformer, ["torch"])
|
requires_backends(load_tf_weights_in_roformer, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
SEW_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class SEWForCTC:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class SEWModel:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, *args, **kwargs):
|
||||||
|
requires_backends(cls, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class SEWPreTrainedModel:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, *args, **kwargs):
|
||||||
|
requires_backends(cls, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class SEWDForCTC:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class SEWDModel:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, *args, **kwargs):
|
||||||
|
requires_backends(cls, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class SEWDPreTrainedModel:
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_pretrained(cls, *args, **kwargs):
|
||||||
|
requires_backends(cls, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
class SpeechEncoderDecoderModel:
|
class SpeechEncoderDecoderModel:
|
||||||
def __init__(self, *args, **kwargs):
|
def __init__(self, *args, **kwargs):
|
||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|||||||
503
tests/test_modeling_sew.py
Normal file
503
tests/test_modeling_sew.py
Normal file
@@ -0,0 +1,503 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch Hubert model. """
|
||||||
|
|
||||||
|
|
||||||
|
import math
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from tests.test_modeling_common import floats_tensor, ids_tensor, random_attention_mask
|
||||||
|
from transformers import SEWConfig, is_torch_available
|
||||||
|
from transformers.testing_utils import require_datasets, require_soundfile, require_torch, slow, tooslow, torch_device
|
||||||
|
|
||||||
|
from .test_configuration_common import ConfigTester
|
||||||
|
from .test_modeling_common import ModelTesterMixin, _config_zero_init
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import SEWForCTC, SEWModel, Wav2Vec2FeatureExtractor, Wav2Vec2Processor
|
||||||
|
from transformers.models.hubert.modeling_hubert import _compute_mask_indices
|
||||||
|
|
||||||
|
|
||||||
|
class SEWModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=1024, # speech is longer
|
||||||
|
is_training=False,
|
||||||
|
hidden_size=32,
|
||||||
|
feat_extract_norm="group",
|
||||||
|
feat_extract_dropout=0.0,
|
||||||
|
feat_extract_activation="gelu",
|
||||||
|
conv_dim=(64, 32, 32),
|
||||||
|
conv_stride=(5, 2, 1),
|
||||||
|
conv_kernel=(10, 3, 1),
|
||||||
|
conv_bias=False,
|
||||||
|
num_conv_pos_embeddings=31,
|
||||||
|
num_conv_pos_embedding_groups=2,
|
||||||
|
squeeze_factor=2,
|
||||||
|
num_hidden_layers=4,
|
||||||
|
num_attention_heads=2,
|
||||||
|
hidden_dropout=0.1,
|
||||||
|
intermediate_size=20,
|
||||||
|
layer_norm_eps=1e-5,
|
||||||
|
hidden_act="gelu",
|
||||||
|
initializer_range=0.02,
|
||||||
|
vocab_size=32,
|
||||||
|
do_stable_layer_norm=False,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.feat_extract_norm = feat_extract_norm
|
||||||
|
self.feat_extract_dropout = feat_extract_dropout
|
||||||
|
self.feat_extract_activation = feat_extract_activation
|
||||||
|
self.conv_dim = conv_dim
|
||||||
|
self.conv_stride = conv_stride
|
||||||
|
self.conv_kernel = conv_kernel
|
||||||
|
self.conv_bias = conv_bias
|
||||||
|
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||||
|
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||||
|
self.squeeze_factor = squeeze_factor
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.hidden_dropout = hidden_dropout
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.do_stable_layer_norm = do_stable_layer_norm
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
output_seq_length = self.seq_length
|
||||||
|
for kernel, stride in zip(self.conv_kernel, self.conv_stride):
|
||||||
|
output_seq_length = (output_seq_length - (kernel - 1)) / stride
|
||||||
|
self.output_seq_length = int(math.ceil(output_seq_length))
|
||||||
|
self.encoder_seq_length = self.output_seq_length // self.squeeze_factor
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_values = floats_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
attention_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
|
||||||
|
return config, input_values, attention_mask
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return SEWConfig(
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
feat_extract_norm=self.feat_extract_norm,
|
||||||
|
feat_extract_dropout=self.feat_extract_dropout,
|
||||||
|
feat_extract_activation=self.feat_extract_activation,
|
||||||
|
conv_dim=self.conv_dim,
|
||||||
|
conv_stride=self.conv_stride,
|
||||||
|
conv_kernel=self.conv_kernel,
|
||||||
|
conv_bias=self.conv_bias,
|
||||||
|
num_conv_pos_embeddings=self.num_conv_pos_embeddings,
|
||||||
|
num_conv_pos_embedding_groups=self.num_conv_pos_embedding_groups,
|
||||||
|
squeeze_factor=self.squeeze_factor,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
hidden_dropout=self.hidden_dropout,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
layer_norm_eps=self.layer_norm_eps,
|
||||||
|
hidden_act=self.hidden_act,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
vocab_size=self.vocab_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, input_values, attention_mask):
|
||||||
|
model = SEWModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_values, attention_mask=attention_mask)
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.last_hidden_state.shape, (self.batch_size, self.output_seq_length, self.hidden_size)
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_batch_inference(self, config, input_values, *args):
|
||||||
|
# test does not pass for models making use of `group_norm`
|
||||||
|
# check: https://github.com/pytorch/fairseq/issues/3227
|
||||||
|
model = SEWModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.bool)
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
|
||||||
|
# pad input
|
||||||
|
for i in range(len(input_lengths)):
|
||||||
|
input_values[i, input_lengths[i] :] = 0.0
|
||||||
|
attention_mask[i, input_lengths[i] :] = 0.0
|
||||||
|
|
||||||
|
batch_outputs = model(input_values, attention_mask=attention_mask).last_hidden_state
|
||||||
|
|
||||||
|
for i in range(input_values.shape[0]):
|
||||||
|
input_slice = input_values[i : i + 1, : input_lengths[i]]
|
||||||
|
output = model(input_slice).last_hidden_state
|
||||||
|
|
||||||
|
batch_output = batch_outputs[i : i + 1, : output.shape[1]]
|
||||||
|
self.parent.assertTrue(torch.allclose(output, batch_output, atol=1e-3))
|
||||||
|
|
||||||
|
def check_ctc_loss(self, config, input_values, *args):
|
||||||
|
model = SEWForCTC(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
# make sure that dropout is disabled
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.long)
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||||
|
labels = ids_tensor((input_values.shape[0], min(max_length_labels) - 1), model.config.vocab_size)
|
||||||
|
|
||||||
|
# pad input
|
||||||
|
for i in range(len(input_lengths)):
|
||||||
|
input_values[i, input_lengths[i] :] = 0.0
|
||||||
|
attention_mask[i, input_lengths[i] :] = 0
|
||||||
|
|
||||||
|
model.config.ctc_loss_reduction = "sum"
|
||||||
|
sum_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||||
|
|
||||||
|
model.config.ctc_loss_reduction = "mean"
|
||||||
|
mean_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||||
|
|
||||||
|
self.parent.assertTrue(isinstance(sum_loss, float))
|
||||||
|
self.parent.assertTrue(isinstance(mean_loss, float))
|
||||||
|
|
||||||
|
def check_ctc_training(self, config, input_values, *args):
|
||||||
|
config.ctc_zero_infinity = True
|
||||||
|
model = SEWForCTC(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.train()
|
||||||
|
|
||||||
|
# freeze feature encoder
|
||||||
|
model.freeze_feature_extractor()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||||
|
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size)
|
||||||
|
|
||||||
|
# pad input
|
||||||
|
for i in range(len(input_lengths)):
|
||||||
|
input_values[i, input_lengths[i] :] = 0.0
|
||||||
|
|
||||||
|
if max_length_labels[i] < labels.shape[-1]:
|
||||||
|
# it's important that we make sure that target lenghts are at least
|
||||||
|
# one shorter than logit lenghts to prevent -inf
|
||||||
|
labels[i, max_length_labels[i] - 1 :] = -100
|
||||||
|
|
||||||
|
loss = model(input_values, labels=labels).loss
|
||||||
|
self.parent.assertFalse(torch.isinf(loss).item())
|
||||||
|
|
||||||
|
loss.backward()
|
||||||
|
|
||||||
|
def check_labels_out_of_vocab(self, config, input_values, *args):
|
||||||
|
model = SEWForCTC(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.train()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||||
|
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size + 100)
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
model(input_values, labels=labels)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config, input_values, attention_mask = self.prepare_config_and_inputs()
|
||||||
|
inputs_dict = {"input_values": input_values, "attention_mask": attention_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class SEWModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (SEWForCTC, SEWModel) if is_torch_available() else ()
|
||||||
|
test_pruning = False
|
||||||
|
test_headmasking = False
|
||||||
|
test_torchscript = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = SEWModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=SEWConfig, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_ctc_loss_inference(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.check_ctc_loss(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_ctc_train(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.check_ctc_training(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_labels_out_of_vocab(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.check_labels_out_of_vocab(*config_and_inputs)
|
||||||
|
|
||||||
|
# Hubert has no inputs_embeds
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# `input_ids` is renamed to `input_values`
|
||||||
|
def test_forward_signature(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# SEW cannot resize token embeddings
|
||||||
|
# since it has no tokens embeddings
|
||||||
|
def test_resize_tokens_embeddings(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# SEW has no inputs_embeds
|
||||||
|
# and thus the `get_input_embeddings` fn
|
||||||
|
# is not implemented
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.output_hidden_states = True
|
||||||
|
config.output_attentions = True
|
||||||
|
|
||||||
|
# no need to test all models as different heads yield the same functionality
|
||||||
|
model_class = self.all_model_classes[0]
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
# set layer drop to 0
|
||||||
|
model.config.layerdrop = 0.0
|
||||||
|
|
||||||
|
input_values = inputs_dict["input_values"]
|
||||||
|
|
||||||
|
input_lengths = torch.tensor(
|
||||||
|
[input_values.shape[1] for _ in range(input_values.shape[0])], dtype=torch.long, device=torch_device
|
||||||
|
)
|
||||||
|
output_lengths = model._get_feat_extract_output_lengths(input_lengths)
|
||||||
|
|
||||||
|
labels = ids_tensor((input_values.shape[0], output_lengths[0] - 2), self.model_tester.vocab_size)
|
||||||
|
inputs_dict["attention_mask"] = torch.ones_like(inputs_dict["attention_mask"])
|
||||||
|
inputs_dict["labels"] = labels
|
||||||
|
|
||||||
|
outputs = model(**inputs_dict)
|
||||||
|
|
||||||
|
output = outputs[0]
|
||||||
|
|
||||||
|
# Encoder-/Decoder-only models
|
||||||
|
hidden_states = outputs.hidden_states[0]
|
||||||
|
attentions = outputs.attentions[0]
|
||||||
|
|
||||||
|
hidden_states.retain_grad()
|
||||||
|
attentions.retain_grad()
|
||||||
|
|
||||||
|
output.flatten()[0].backward(retain_graph=True)
|
||||||
|
|
||||||
|
self.assertIsNotNone(hidden_states.grad)
|
||||||
|
self.assertIsNotNone(attentions.grad)
|
||||||
|
|
||||||
|
def test_initialization(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config)
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
for name, param in model.named_parameters():
|
||||||
|
uniform_init_parms = [
|
||||||
|
"conv.weight",
|
||||||
|
"masked_spec_embed",
|
||||||
|
"quantizer.weight_proj.weight",
|
||||||
|
]
|
||||||
|
if param.requires_grad:
|
||||||
|
if any([x in name for x in uniform_init_parms]):
|
||||||
|
self.assertTrue(
|
||||||
|
-1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertIn(
|
||||||
|
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||||
|
[0.0, 1.0],
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
|
||||||
|
# overwrite from test_modeling_common
|
||||||
|
def _mock_init_weights(self, module):
|
||||||
|
if hasattr(module, "weight") and module.weight is not None:
|
||||||
|
module.weight.data.fill_(3)
|
||||||
|
if hasattr(module, "weight_g") and module.weight_g is not None:
|
||||||
|
module.weight_g.data.fill_(3)
|
||||||
|
if hasattr(module, "weight_v") and module.weight_v is not None:
|
||||||
|
module.weight_v.data.fill_(3)
|
||||||
|
if hasattr(module, "bias") and module.bias is not None:
|
||||||
|
module.bias.data.fill_(3)
|
||||||
|
if hasattr(module, "masked_spec_embed") and module.masked_spec_embed is not None:
|
||||||
|
module.masked_spec_embed.data.fill_(3)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
model = SEWModel.from_pretrained("asapp/sew-tiny-100k")
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class SEWUtilsTest(unittest.TestCase):
|
||||||
|
def test_compute_mask_indices(self):
|
||||||
|
batch_size = 4
|
||||||
|
sequence_length = 60
|
||||||
|
mask_prob = 0.5
|
||||||
|
mask_length = 1
|
||||||
|
|
||||||
|
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||||
|
mask = torch.from_numpy(mask).to(torch_device)
|
||||||
|
|
||||||
|
self.assertListEqual(mask.sum(axis=-1).tolist(), [mask_prob * sequence_length for _ in range(batch_size)])
|
||||||
|
|
||||||
|
def test_compute_mask_indices_overlap(self):
|
||||||
|
batch_size = 4
|
||||||
|
sequence_length = 80
|
||||||
|
mask_prob = 0.5
|
||||||
|
mask_length = 4
|
||||||
|
|
||||||
|
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||||
|
mask = torch.from_numpy(mask).to(torch_device)
|
||||||
|
|
||||||
|
# because of overlap mask don't have to add up exactly to `mask_prob * sequence_length`, but have to be smaller or equal
|
||||||
|
for batch_sum in mask.sum(axis=-1):
|
||||||
|
self.assertTrue(int(batch_sum) <= mask_prob * sequence_length)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_datasets
|
||||||
|
@require_soundfile
|
||||||
|
@slow
|
||||||
|
class SEWModelIntegrationTest(unittest.TestCase):
|
||||||
|
def _load_datasamples(self, num_samples):
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
import soundfile as sf
|
||||||
|
|
||||||
|
ids = [f"1272-141231-000{i}" for i in range(num_samples)]
|
||||||
|
|
||||||
|
# map files to raw
|
||||||
|
def map_to_array(batch):
|
||||||
|
speech, _ = sf.read(batch["file"])
|
||||||
|
batch["speech"] = speech
|
||||||
|
return batch
|
||||||
|
|
||||||
|
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
||||||
|
|
||||||
|
ds = ds.filter(lambda x: x["id"] in ids).sort("id").map(map_to_array)
|
||||||
|
|
||||||
|
return ds["speech"][:num_samples]
|
||||||
|
|
||||||
|
def test_inference_pretrained_batched(self):
|
||||||
|
model = SEWModel.from_pretrained("asapp/sew-tiny-100k").to(torch_device)
|
||||||
|
processor = Wav2Vec2FeatureExtractor.from_pretrained("asapp/sew-tiny-100k")
|
||||||
|
|
||||||
|
input_speech = self._load_datasamples(2)
|
||||||
|
|
||||||
|
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||||
|
|
||||||
|
input_values = inputs.input_values.to(torch_device)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(input_values).last_hidden_state
|
||||||
|
|
||||||
|
# expected outputs taken from the original SEW implementation
|
||||||
|
expected_outputs_first = torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
[0.1509, 0.5372, 0.3061, -0.1694],
|
||||||
|
[-0.1700, 0.5764, 0.2753, -0.1299],
|
||||||
|
[0.1281, 0.7949, 0.2342, -0.1624],
|
||||||
|
[-0.1627, 0.6710, 0.2215, -0.1317],
|
||||||
|
],
|
||||||
|
[
|
||||||
|
[0.0408, 1.4355, 0.8605, -0.0968],
|
||||||
|
[0.0393, 1.2368, 0.6826, 0.0364],
|
||||||
|
[-0.1269, 1.9215, 1.1677, -0.1297],
|
||||||
|
[-0.1654, 1.6524, 0.6877, -0.0196],
|
||||||
|
],
|
||||||
|
],
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
expected_outputs_last = torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
[1.3379, -0.1450, -0.1500, -0.0515],
|
||||||
|
[0.8364, -0.1680, -0.1248, -0.0689],
|
||||||
|
[1.2791, -0.1507, -0.1523, -0.0564],
|
||||||
|
[0.8208, -0.1690, -0.1199, -0.0751],
|
||||||
|
],
|
||||||
|
[
|
||||||
|
[0.6959, -0.0861, -0.1235, -0.0861],
|
||||||
|
[0.4700, -0.1686, -0.1141, -0.1199],
|
||||||
|
[1.0776, -0.1137, -0.0124, -0.0472],
|
||||||
|
[0.5774, -0.1675, -0.0376, -0.0823],
|
||||||
|
],
|
||||||
|
],
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
expected_output_sum = 62146.7422
|
||||||
|
|
||||||
|
self.assertTrue(torch.allclose(outputs[:, :4, :4], expected_outputs_first, atol=5e-3))
|
||||||
|
self.assertTrue(torch.allclose(outputs[:, -4:, -4:], expected_outputs_last, atol=5e-3))
|
||||||
|
self.assertTrue(abs(outputs.sum() - expected_output_sum) < 2)
|
||||||
|
|
||||||
|
@tooslow
|
||||||
|
def test_inference_ctc_batched(self):
|
||||||
|
# TODO: enable this test once the finetuned models are available
|
||||||
|
model = SEWForCTC.from_pretrained("asapp/sew-tiny-100k-ft-100h").to(torch_device)
|
||||||
|
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-tiny-100k-ft-100h", do_lower_case=True)
|
||||||
|
|
||||||
|
input_speech = self._load_datasamples(2)
|
||||||
|
|
||||||
|
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||||
|
|
||||||
|
input_values = inputs.input_values.to(torch_device)
|
||||||
|
attention_mask = inputs.attention_mask.to(torch_device)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
logits = model(input_values, attention_mask=attention_mask).logits
|
||||||
|
|
||||||
|
predicted_ids = torch.argmax(logits, dim=-1)
|
||||||
|
predicted_trans = processor.batch_decode(predicted_ids)
|
||||||
|
|
||||||
|
EXPECTED_TRANSCRIPTIONS = [
|
||||||
|
"a man said to the universe sir i exist",
|
||||||
|
"sweat covered brion's body trickling into the tight loin cloth that was the only garment he wore",
|
||||||
|
]
|
||||||
|
self.assertListEqual(predicted_trans, EXPECTED_TRANSCRIPTIONS)
|
||||||
524
tests/test_modeling_sew_d.py
Normal file
524
tests/test_modeling_sew_d.py
Normal file
@@ -0,0 +1,524 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch Hubert model. """
|
||||||
|
|
||||||
|
|
||||||
|
import math
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from tests.test_modeling_common import floats_tensor, ids_tensor, random_attention_mask
|
||||||
|
from transformers import SEWDConfig, is_torch_available
|
||||||
|
from transformers.testing_utils import require_datasets, require_soundfile, require_torch, slow, tooslow, torch_device
|
||||||
|
|
||||||
|
from .test_configuration_common import ConfigTester
|
||||||
|
from .test_modeling_common import ModelTesterMixin, _config_zero_init
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import SEWDForCTC, SEWDModel, Wav2Vec2FeatureExtractor, Wav2Vec2Processor
|
||||||
|
from transformers.models.hubert.modeling_hubert import _compute_mask_indices
|
||||||
|
|
||||||
|
|
||||||
|
class SEWDModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=13,
|
||||||
|
seq_length=1024, # speech is longer
|
||||||
|
is_training=False,
|
||||||
|
hidden_size=32,
|
||||||
|
feat_extract_norm="group",
|
||||||
|
feat_extract_dropout=0.0,
|
||||||
|
feat_extract_activation="gelu",
|
||||||
|
conv_dim=(64, 32, 32),
|
||||||
|
conv_stride=(5, 2, 1),
|
||||||
|
conv_kernel=(10, 3, 1),
|
||||||
|
conv_bias=False,
|
||||||
|
num_conv_pos_embeddings=31,
|
||||||
|
num_conv_pos_embedding_groups=2,
|
||||||
|
squeeze_factor=2,
|
||||||
|
max_position_embeddings=512,
|
||||||
|
position_buckets=256,
|
||||||
|
share_att_key=True,
|
||||||
|
relative_attention=True,
|
||||||
|
position_biased_input=False,
|
||||||
|
pos_att_type=("p2c", "c2p"),
|
||||||
|
norm_rel_ebd="layer_norm",
|
||||||
|
num_hidden_layers=4,
|
||||||
|
num_attention_heads=2,
|
||||||
|
hidden_dropout=0.1,
|
||||||
|
intermediate_size=20,
|
||||||
|
layer_norm_eps=1e-5,
|
||||||
|
hidden_act="gelu",
|
||||||
|
initializer_range=0.02,
|
||||||
|
vocab_size=32,
|
||||||
|
do_stable_layer_norm=False,
|
||||||
|
scope=None,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.feat_extract_norm = feat_extract_norm
|
||||||
|
self.feat_extract_dropout = feat_extract_dropout
|
||||||
|
self.feat_extract_activation = feat_extract_activation
|
||||||
|
self.conv_dim = conv_dim
|
||||||
|
self.conv_stride = conv_stride
|
||||||
|
self.conv_kernel = conv_kernel
|
||||||
|
self.conv_bias = conv_bias
|
||||||
|
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||||
|
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||||
|
self.squeeze_factor = squeeze_factor
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.position_buckets = position_buckets
|
||||||
|
self.share_att_key = share_att_key
|
||||||
|
self.relative_attention = relative_attention
|
||||||
|
self.position_biased_input = position_biased_input
|
||||||
|
self.pos_att_type = pos_att_type
|
||||||
|
self.norm_rel_ebd = norm_rel_ebd
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.hidden_dropout = hidden_dropout
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.do_stable_layer_norm = do_stable_layer_norm
|
||||||
|
self.scope = scope
|
||||||
|
|
||||||
|
output_seq_length = self.seq_length
|
||||||
|
for kernel, stride in zip(self.conv_kernel, self.conv_stride):
|
||||||
|
output_seq_length = (output_seq_length - (kernel - 1)) / stride
|
||||||
|
self.output_seq_length = int(math.ceil(output_seq_length))
|
||||||
|
self.encoder_seq_length = self.output_seq_length // self.squeeze_factor
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_values = floats_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||||
|
attention_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
|
||||||
|
return config, input_values, attention_mask
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return SEWDConfig(
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
feat_extract_norm=self.feat_extract_norm,
|
||||||
|
feat_extract_dropout=self.feat_extract_dropout,
|
||||||
|
feat_extract_activation=self.feat_extract_activation,
|
||||||
|
conv_dim=self.conv_dim,
|
||||||
|
conv_stride=self.conv_stride,
|
||||||
|
conv_kernel=self.conv_kernel,
|
||||||
|
conv_bias=self.conv_bias,
|
||||||
|
num_conv_pos_embeddings=self.num_conv_pos_embeddings,
|
||||||
|
num_conv_pos_embedding_groups=self.num_conv_pos_embedding_groups,
|
||||||
|
squeeze_factor=self.squeeze_factor,
|
||||||
|
max_position_embeddings=self.max_position_embeddings,
|
||||||
|
position_buckets=self.position_buckets,
|
||||||
|
share_att_key=self.share_att_key,
|
||||||
|
relative_attention=self.relative_attention,
|
||||||
|
position_biased_input=self.position_biased_input,
|
||||||
|
pos_att_type=self.pos_att_type,
|
||||||
|
norm_rel_ebd=self.norm_rel_ebd,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
hidden_dropout=self.hidden_dropout,
|
||||||
|
intermediate_size=self.intermediate_size,
|
||||||
|
layer_norm_eps=self.layer_norm_eps,
|
||||||
|
hidden_act=self.hidden_act,
|
||||||
|
initializer_range=self.initializer_range,
|
||||||
|
vocab_size=self.vocab_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model(self, config, input_values, attention_mask):
|
||||||
|
model = SEWDModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
result = model(input_values, attention_mask=attention_mask)
|
||||||
|
self.parent.assertEqual(
|
||||||
|
result.last_hidden_state.shape, (self.batch_size, self.output_seq_length, self.hidden_size)
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_batch_inference(self, config, input_values, *args):
|
||||||
|
# test does not pass for models making use of `group_norm`
|
||||||
|
# check: https://github.com/pytorch/fairseq/issues/3227
|
||||||
|
model = SEWDModel(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.bool)
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
|
||||||
|
# pad input
|
||||||
|
for i in range(len(input_lengths)):
|
||||||
|
input_values[i, input_lengths[i] :] = 0.0
|
||||||
|
attention_mask[i, input_lengths[i] :] = 0.0
|
||||||
|
|
||||||
|
batch_outputs = model(input_values, attention_mask=attention_mask).last_hidden_state
|
||||||
|
|
||||||
|
for i in range(input_values.shape[0]):
|
||||||
|
input_slice = input_values[i : i + 1, : input_lengths[i]]
|
||||||
|
output = model(input_slice).last_hidden_state
|
||||||
|
|
||||||
|
batch_output = batch_outputs[i : i + 1, : output.shape[1]]
|
||||||
|
self.parent.assertTrue(torch.allclose(output, batch_output, atol=1e-3))
|
||||||
|
|
||||||
|
def check_ctc_loss(self, config, input_values, *args):
|
||||||
|
model = SEWDForCTC(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
# make sure that dropout is disabled
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.long)
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||||
|
labels = ids_tensor((input_values.shape[0], min(max_length_labels) - 1), model.config.vocab_size)
|
||||||
|
|
||||||
|
# pad input
|
||||||
|
for i in range(len(input_lengths)):
|
||||||
|
input_values[i, input_lengths[i] :] = 0.0
|
||||||
|
attention_mask[i, input_lengths[i] :] = 0
|
||||||
|
|
||||||
|
model.config.ctc_loss_reduction = "sum"
|
||||||
|
sum_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||||
|
|
||||||
|
model.config.ctc_loss_reduction = "mean"
|
||||||
|
mean_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||||
|
|
||||||
|
self.parent.assertTrue(isinstance(sum_loss, float))
|
||||||
|
self.parent.assertTrue(isinstance(mean_loss, float))
|
||||||
|
|
||||||
|
def check_ctc_training(self, config, input_values, *args):
|
||||||
|
config.ctc_zero_infinity = True
|
||||||
|
model = SEWDForCTC(config=config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.train()
|
||||||
|
|
||||||
|
# freeze feature encoder
|
||||||
|
model.freeze_feature_extractor()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||||
|
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size)
|
||||||
|
|
||||||
|
# pad input
|
||||||
|
for i in range(len(input_lengths)):
|
||||||
|
input_values[i, input_lengths[i] :] = 0.0
|
||||||
|
|
||||||
|
if max_length_labels[i] < labels.shape[-1]:
|
||||||
|
# it's important that we make sure that target lenghts are at least
|
||||||
|
# one shorter than logit lenghts to prevent -inf
|
||||||
|
labels[i, max_length_labels[i] - 1 :] = -100
|
||||||
|
|
||||||
|
loss = model(input_values, labels=labels).loss
|
||||||
|
self.parent.assertFalse(torch.isinf(loss).item())
|
||||||
|
|
||||||
|
loss.backward()
|
||||||
|
|
||||||
|
def check_labels_out_of_vocab(self, config, input_values, *args):
|
||||||
|
model = SEWDForCTC(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.train()
|
||||||
|
|
||||||
|
input_values = input_values[:3]
|
||||||
|
|
||||||
|
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||||
|
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||||
|
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size + 100)
|
||||||
|
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
model(input_values, labels=labels)
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config, input_values, attention_mask = self.prepare_config_and_inputs()
|
||||||
|
inputs_dict = {"input_values": input_values, "attention_mask": attention_mask}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class SEWDModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (SEWDForCTC, SEWDModel) if is_torch_available() else ()
|
||||||
|
test_pruning = False
|
||||||
|
test_headmasking = False
|
||||||
|
test_torchscript = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = SEWDModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=SEWDConfig, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_model(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_ctc_loss_inference(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.check_ctc_loss(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_ctc_train(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.check_ctc_training(*config_and_inputs)
|
||||||
|
|
||||||
|
def test_labels_out_of_vocab(self):
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.check_labels_out_of_vocab(*config_and_inputs)
|
||||||
|
|
||||||
|
# Hubert has no inputs_embeds
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# `input_ids` is renamed to `input_values`
|
||||||
|
def test_forward_signature(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# SEW cannot resize token embeddings
|
||||||
|
# since it has no tokens embeddings
|
||||||
|
def test_resize_tokens_embeddings(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# SEW has no inputs_embeds
|
||||||
|
# and thus the `get_input_embeddings` fn
|
||||||
|
# is not implemented
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_retain_grad_hidden_states_attentions(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
config.output_hidden_states = True
|
||||||
|
config.output_attentions = True
|
||||||
|
|
||||||
|
# no need to test all models as different heads yield the same functionality
|
||||||
|
model_class = self.all_model_classes[0]
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
# set layer drop to 0
|
||||||
|
model.config.layerdrop = 0.0
|
||||||
|
|
||||||
|
input_values = inputs_dict["input_values"]
|
||||||
|
|
||||||
|
input_lengths = torch.tensor(
|
||||||
|
[input_values.shape[1] for _ in range(input_values.shape[0])], dtype=torch.long, device=torch_device
|
||||||
|
)
|
||||||
|
output_lengths = model._get_feat_extract_output_lengths(input_lengths)
|
||||||
|
|
||||||
|
labels = ids_tensor((input_values.shape[0], output_lengths[0] - 2), self.model_tester.vocab_size)
|
||||||
|
inputs_dict["attention_mask"] = torch.ones_like(inputs_dict["attention_mask"])
|
||||||
|
inputs_dict["labels"] = labels
|
||||||
|
|
||||||
|
outputs = model(**inputs_dict)
|
||||||
|
|
||||||
|
output = outputs[0]
|
||||||
|
|
||||||
|
# Encoder-/Decoder-only models
|
||||||
|
hidden_states = outputs.hidden_states[0]
|
||||||
|
attentions = outputs.attentions[0]
|
||||||
|
|
||||||
|
hidden_states.retain_grad()
|
||||||
|
attentions.retain_grad()
|
||||||
|
|
||||||
|
output.flatten()[0].backward(retain_graph=True)
|
||||||
|
|
||||||
|
self.assertIsNotNone(hidden_states.grad)
|
||||||
|
self.assertIsNotNone(attentions.grad)
|
||||||
|
|
||||||
|
def test_initialization(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config)
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
for name, param in model.named_parameters():
|
||||||
|
uniform_init_parms = [
|
||||||
|
"conv.weight",
|
||||||
|
"masked_spec_embed",
|
||||||
|
"quantizer.weight_proj.weight",
|
||||||
|
]
|
||||||
|
if param.requires_grad:
|
||||||
|
if any([x in name for x in uniform_init_parms]):
|
||||||
|
self.assertTrue(
|
||||||
|
-1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertIn(
|
||||||
|
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||||
|
[0.0, 1.0],
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
|
||||||
|
# overwrite from test_modeling_common
|
||||||
|
def _mock_init_weights(self, module):
|
||||||
|
if hasattr(module, "weight") and module.weight is not None:
|
||||||
|
module.weight.data.fill_(3)
|
||||||
|
if hasattr(module, "weight_g") and module.weight_g is not None:
|
||||||
|
module.weight_g.data.fill_(3)
|
||||||
|
if hasattr(module, "weight_v") and module.weight_v is not None:
|
||||||
|
module.weight_v.data.fill_(3)
|
||||||
|
if hasattr(module, "bias") and module.bias is not None:
|
||||||
|
module.bias.data.fill_(3)
|
||||||
|
if hasattr(module, "masked_spec_embed") and module.masked_spec_embed is not None:
|
||||||
|
module.masked_spec_embed.data.fill_(3)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_model_from_pretrained(self):
|
||||||
|
model = SEWDModel.from_pretrained("asapp/sew-d-tiny-100k")
|
||||||
|
self.assertIsNotNone(model)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class SEWDUtilsTest(unittest.TestCase):
|
||||||
|
def test_compute_mask_indices(self):
|
||||||
|
batch_size = 4
|
||||||
|
sequence_length = 60
|
||||||
|
mask_prob = 0.5
|
||||||
|
mask_length = 1
|
||||||
|
|
||||||
|
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||||
|
mask = torch.from_numpy(mask).to(torch_device)
|
||||||
|
|
||||||
|
self.assertListEqual(mask.sum(axis=-1).tolist(), [mask_prob * sequence_length for _ in range(batch_size)])
|
||||||
|
|
||||||
|
def test_compute_mask_indices_overlap(self):
|
||||||
|
batch_size = 4
|
||||||
|
sequence_length = 80
|
||||||
|
mask_prob = 0.5
|
||||||
|
mask_length = 4
|
||||||
|
|
||||||
|
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||||
|
mask = torch.from_numpy(mask).to(torch_device)
|
||||||
|
|
||||||
|
# because of overlap mask don't have to add up exactly to `mask_prob * sequence_length`, but have to be smaller or equal
|
||||||
|
for batch_sum in mask.sum(axis=-1):
|
||||||
|
self.assertTrue(int(batch_sum) <= mask_prob * sequence_length)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@require_datasets
|
||||||
|
@require_soundfile
|
||||||
|
@slow
|
||||||
|
class SEWDModelIntegrationTest(unittest.TestCase):
|
||||||
|
def _load_datasamples(self, num_samples):
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
import soundfile as sf
|
||||||
|
|
||||||
|
ids = [f"1272-141231-000{i}" for i in range(num_samples)]
|
||||||
|
|
||||||
|
# map files to raw
|
||||||
|
def map_to_array(batch):
|
||||||
|
speech, _ = sf.read(batch["file"])
|
||||||
|
batch["speech"] = speech
|
||||||
|
return batch
|
||||||
|
|
||||||
|
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
||||||
|
|
||||||
|
ds = ds.filter(lambda x: x["id"] in ids).sort("id").map(map_to_array)
|
||||||
|
|
||||||
|
return ds["speech"][:num_samples]
|
||||||
|
|
||||||
|
def test_inference_pretrained_batched(self):
|
||||||
|
model = SEWDModel.from_pretrained("asapp/sew-d-tiny-100k").to(torch_device)
|
||||||
|
processor = Wav2Vec2FeatureExtractor.from_pretrained("asapp/sew-d-tiny-100k")
|
||||||
|
|
||||||
|
input_speech = self._load_datasamples(2)
|
||||||
|
|
||||||
|
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||||
|
|
||||||
|
input_values = inputs.input_values.to(torch_device)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(input_values).last_hidden_state
|
||||||
|
|
||||||
|
# expected outputs taken from the original SEW-D implementation
|
||||||
|
expected_outputs_first = torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
[-0.1619, 0.6995, 0.4062, -0.1014],
|
||||||
|
[-0.1364, 0.5960, 0.0952, -0.0873],
|
||||||
|
[-0.1572, 0.5718, 0.4228, -0.0864],
|
||||||
|
[-0.1325, 0.6823, 0.1387, -0.0871],
|
||||||
|
],
|
||||||
|
[
|
||||||
|
[-0.1296, 0.4008, 0.4952, -0.1450],
|
||||||
|
[-0.1152, 0.3693, 0.3037, -0.1290],
|
||||||
|
[-0.1194, 0.6074, 0.3531, -0.1466],
|
||||||
|
[-0.1113, 0.3135, 0.2224, -0.1338],
|
||||||
|
],
|
||||||
|
],
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
expected_outputs_last = torch.tensor(
|
||||||
|
[
|
||||||
|
[
|
||||||
|
[-0.1577, 0.5108, 0.8553, 0.2550],
|
||||||
|
[-0.1530, 0.3580, 0.6143, 0.2672],
|
||||||
|
[-0.1535, 0.4954, 0.8503, 0.1387],
|
||||||
|
[-0.1572, 0.3363, 0.6217, 0.1490],
|
||||||
|
],
|
||||||
|
[
|
||||||
|
[-0.1338, 0.5459, 0.9607, -0.1133],
|
||||||
|
[-0.1502, 0.3738, 0.7313, -0.0986],
|
||||||
|
[-0.0953, 0.4708, 1.0821, -0.0944],
|
||||||
|
[-0.1474, 0.3598, 0.7248, -0.0748],
|
||||||
|
],
|
||||||
|
],
|
||||||
|
device=torch_device,
|
||||||
|
)
|
||||||
|
expected_output_sum = 54201.0469
|
||||||
|
|
||||||
|
self.assertTrue(torch.allclose(outputs[:, :4, :4], expected_outputs_first, atol=5e-3))
|
||||||
|
self.assertTrue(torch.allclose(outputs[:, -4:, -4:], expected_outputs_last, atol=5e-3))
|
||||||
|
self.assertTrue(abs(outputs.sum() - expected_output_sum) < 5)
|
||||||
|
|
||||||
|
@tooslow
|
||||||
|
def test_inference_ctc_batched(self):
|
||||||
|
# TODO: enable this test once the finetuned models are available
|
||||||
|
model = SEWDForCTC.from_pretrained("asapp/sew-d-tiny-100k-ft-100h").to(torch_device)
|
||||||
|
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-tiny-100k-ft-100h", do_lower_case=True)
|
||||||
|
|
||||||
|
input_speech = self._load_datasamples(2)
|
||||||
|
|
||||||
|
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||||
|
|
||||||
|
input_values = inputs.input_values.to(torch_device)
|
||||||
|
attention_mask = inputs.attention_mask.to(torch_device)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
logits = model(input_values, attention_mask=attention_mask).logits
|
||||||
|
|
||||||
|
predicted_ids = torch.argmax(logits, dim=-1)
|
||||||
|
predicted_trans = processor.batch_decode(predicted_ids)
|
||||||
|
|
||||||
|
EXPECTED_TRANSCRIPTIONS = [
|
||||||
|
"a man said to the universe sir i exist",
|
||||||
|
"sweat covered brion's body trickling into the tight loin cloth that was the only garment he wore",
|
||||||
|
]
|
||||||
|
self.assertListEqual(predicted_trans, EXPECTED_TRANSCRIPTIONS)
|
||||||
@@ -123,6 +123,8 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
|||||||
"TFRagTokenForGeneration",
|
"TFRagTokenForGeneration",
|
||||||
"Wav2Vec2ForCTC",
|
"Wav2Vec2ForCTC",
|
||||||
"HubertForCTC",
|
"HubertForCTC",
|
||||||
|
"SEWForCTC",
|
||||||
|
"SEWDForCTC",
|
||||||
"XLMForQuestionAnswering",
|
"XLMForQuestionAnswering",
|
||||||
"XLNetForQuestionAnswering",
|
"XLNetForQuestionAnswering",
|
||||||
"SeparableConv1D",
|
"SeparableConv1D",
|
||||||
|
|||||||
Reference in New Issue
Block a user