Add the SEW and SEW-D speech models (#13962)
* Working encoder * SEW-D and tests * Further conv fixes * Automodels and conv inits * Update integration tests, add docs * Docs cleanup, resolve todos * Conf fix * Fix docs * Fix tests, apply suggestions * Update src/transformers/models/sew/modeling_sew.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Model conversion and updated no-mask tests * Remove copy of feature_proj * Style * Update src/transformers/models/auto/feature_extraction_auto.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Update src/transformers/models/auto/feature_extraction_auto.py Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Move orgs Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
This commit is contained in:
@@ -267,6 +267,8 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
|
||||
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
|
||||
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[SEW](https://huggingface.co/transformers/model_doc/sew.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
1. **[SEW-D](https://huggingface.co/transformers/model_doc/sew_d.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
||||
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
||||
1. **[Splinter](https://huggingface.co/transformers/model_doc/splinter.html)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy.
|
||||
|
||||
@@ -289,6 +289,8 @@ conda install -c huggingface transformers
|
||||
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (来自 Google Research) 伴随论文 [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) 由 Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder 发布。
|
||||
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (来自 Facebook), 伴随论文 [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) 由 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov 发布。
|
||||
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。
|
||||
1. **[SEW](https://huggingface.co/transformers/model_doc/sew.html)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
|
||||
1. **[SEW-D](https://huggingface.co/transformers/model_doc/sew_d.html)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。
|
||||
1. **[SpeechEncoderDecoder](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html)**
|
||||
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (来自 Facebook), 伴随论文 [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino 发布。
|
||||
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (来自 Facebook) 伴随论文 [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) 由 Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau 发布。
|
||||
|
||||
@@ -301,6 +301,8 @@ conda install -c huggingface transformers
|
||||
1. **[RemBERT](https://huggingface.co/transformers/model_doc/rembert.html)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/pdf/2010.12821.pdf) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder.
|
||||
1. **[RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
|
||||
1. **[RoFormer](https://huggingface.co/transformers/model_doc/roformer.html)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
1. **[SEW](https://huggingface.co/transformers/model_doc/sew.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
1. **[SEW-D](https://huggingface.co/transformers/model_doc/sew_d.html)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
1. **[SpeechEncoderDecoder](https://huggingface.co/transformers/model_doc/speechencoderdecoder.html)**
|
||||
1. **[SpeechToTextTransformer](https://huggingface.co/transformers/model_doc/speech_to_text.html)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
||||
1. **[SpeechToTextTransformer2](https://huggingface.co/transformers/model_doc/speech_to_text_2.html)** (from Facebook) released with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
||||
|
||||
@@ -268,59 +268,65 @@ Supported models
|
||||
57. :doc:`RoFormer <model_doc/roformer>` (from ZhuiyiTechnology), released together with the paper a `RoFormer:
|
||||
Enhanced Transformer with Rotary Position Embedding <https://arxiv.org/pdf/2104.09864v1.pdf>`__ by Jianlin Su and
|
||||
Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
|
||||
58. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
|
||||
58. :doc:`SEW <model_doc/sew>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in Unsupervised
|
||||
Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu
|
||||
Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
59. :doc:`SEW-D <model_doc/sew_d>` (from ASAPP) released with the paper `Performance-Efficiency Trade-offs in
|
||||
Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
|
||||
Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
60. :doc:`SpeechToTextTransformer <model_doc/speech_to_text>` (from Facebook), released together with the paper
|
||||
`fairseq S2T: Fast Speech-to-Text Modeling with fairseq <https://arxiv.org/abs/2010.05171>`__ by Changhan Wang, Yun
|
||||
Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino.
|
||||
59. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
|
||||
61. :doc:`SpeechToTextTransformer2 <model_doc/speech_to_text_2>` (from Facebook), released together with the paper
|
||||
`Large-Scale Self- and Semi-Supervised Learning for Speech Translation <https://arxiv.org/abs/2104.06678>`__ by
|
||||
Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau.
|
||||
60. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
|
||||
62. :doc:`Splinter <model_doc/splinter>` (from Tel Aviv University), released together with the paper `Few-Shot
|
||||
Question Answering by Pretraining Span Selection <https://arxiv.org/abs/2101.00438>`__ by Ori Ram, Yuval Kirstain,
|
||||
Jonathan Berant, Amir Globerson, Omer Levy.
|
||||
61. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
|
||||
63. :doc:`SqueezeBert <model_doc/squeezebert>` (from Berkeley) released with the paper `SqueezeBERT: What can computer
|
||||
vision teach NLP about efficient neural networks? <https://arxiv.org/abs/2006.11316>`__ by Forrest N. Iandola,
|
||||
Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
|
||||
62. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
||||
64. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
|
||||
Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
|
||||
Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
|
||||
63. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
|
||||
65. :doc:`T5v1.1 <model_doc/t5v1.1>` (from Google AI) released in the repository
|
||||
`google-research/text-to-text-transfer-transformer
|
||||
<https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511>`__ by
|
||||
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi
|
||||
Zhou and Wei Li and Peter J. Liu.
|
||||
64. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
||||
66. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
|
||||
Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
|
||||
Francesco Piccinno and Julian Martin Eisenschlos.
|
||||
65. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
||||
67. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
|
||||
Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
|
||||
Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
|
||||
66. `TrOCR <https://huggingface.co/transformers/master/model_doc/trocr.html>`__ (from Microsoft), released together
|
||||
68. `TrOCR <https://huggingface.co/transformers/master/model_doc/trocr.html>`__ (from Microsoft), released together
|
||||
with the paper `TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
|
||||
<https://arxiv.org/abs/2109.10282>`__ by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang,
|
||||
Zhoujun Li, Furu Wei.
|
||||
67. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
|
||||
69. :doc:`Vision Transformer (ViT) <model_doc/vit>` (from Google AI) released with the paper `An Image is Worth 16x16
|
||||
Words: Transformers for Image Recognition at Scale <https://arxiv.org/abs/2010.11929>`__ by Alexey Dosovitskiy,
|
||||
Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
|
||||
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
|
||||
68. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
|
||||
70. :doc:`VisualBERT <model_doc/visual_bert>` (from UCLA NLP) released with the paper `VisualBERT: A Simple and
|
||||
Performant Baseline for Vision and Language <https://arxiv.org/pdf/1908.03557>`__ by Liunian Harold Li, Mark
|
||||
Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang.
|
||||
69. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||
71. :doc:`Wav2Vec2 <model_doc/wav2vec2>` (from Facebook AI) released with the paper `wav2vec 2.0: A Framework for
|
||||
Self-Supervised Learning of Speech Representations <https://arxiv.org/abs/2006.11477>`__ by Alexei Baevski, Henry
|
||||
Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||
70. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||
72. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
|
||||
Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
|
||||
71. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||
73. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
|
||||
Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
|
||||
Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
72. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||
74. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
|
||||
Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
|
||||
Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
|
||||
Zettlemoyer and Veselin Stoyanov.
|
||||
73. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||
75. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `XLNet: Generalized Autoregressive
|
||||
Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
|
||||
Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
|
||||
74. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
||||
76. :doc:`XLSR-Wav2Vec2 <model_doc/xlsr_wav2vec2>` (from Facebook AI) released with the paper `Unsupervised
|
||||
Cross-Lingual Representation Learning For Speech Recognition <https://arxiv.org/abs/2006.13979>`__ by Alexis
|
||||
Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli.
|
||||
|
||||
@@ -446,6 +452,10 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| RoFormer | ✅ | ✅ | ✅ | ✅ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| SEW | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
+-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
|
||||
| Speech2Text | ✅ | ❌ | ✅ | ❌ | ❌ |
|
||||
@@ -621,6 +631,8 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
model_doc/retribert
|
||||
model_doc/roberta
|
||||
model_doc/roformer
|
||||
model_doc/sew
|
||||
model_doc/sew_d
|
||||
model_doc/speechencoderdecoder
|
||||
model_doc/speech_to_text
|
||||
model_doc/speech_to_text_2
|
||||
|
||||
61
docs/source/model_doc/sew.rst
Normal file
61
docs/source/model_doc/sew.rst
Normal file
@@ -0,0 +1,61 @@
|
||||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
SEW
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
SEW (Squeezed and Efficient Wav2Vec) was proposed in `Performance-Efficiency Trade-offs in Unsupervised Pre-training
|
||||
for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q.
|
||||
Weinberger, Yoav Artzi.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition
|
||||
(ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance
|
||||
and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a
|
||||
pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a
|
||||
variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x
|
||||
inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference
|
||||
time, SEW reduces word error rate by 25-50% across different model sizes.*
|
||||
|
||||
Tips:
|
||||
|
||||
- SEW is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
|
||||
- SEWForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using
|
||||
:class:`~transformers.Wav2Vec2CTCTokenizer`.
|
||||
|
||||
This model was contributed by `anton-l <https://huggingface.co/anton-l>`__.
|
||||
|
||||
|
||||
SEWConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SEWConfig
|
||||
:members:
|
||||
|
||||
|
||||
SEWModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SEWModel
|
||||
:members: forward
|
||||
|
||||
|
||||
SEWForCTC
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SEWForCTC
|
||||
:members: forward
|
||||
|
||||
61
docs/source/model_doc/sew_d.rst
Normal file
61
docs/source/model_doc/sew_d.rst
Normal file
@@ -0,0 +1,61 @@
|
||||
..
|
||||
Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
SEW-D
|
||||
-----------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Overview
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
SEW-D (Squeezed and Efficient Wav2Vec with Disentangled attention) was proposed in `Performance-Efficiency Trade-offs
|
||||
in Unsupervised Pre-training for Speech Recognition <https://arxiv.org/abs/2109.06870>`__ by Felix Wu, Kwangyoun Kim,
|
||||
Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*This paper is a study of performance-efficiency trade-offs in pre-trained models for automatic speech recognition
|
||||
(ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance
|
||||
and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a
|
||||
pre-trained model architecture with significant improvements along both performance and efficiency dimensions across a
|
||||
variety of training setups. For example, under the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x
|
||||
inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference
|
||||
time, SEW reduces word error rate by 25-50% across different model sizes.*
|
||||
|
||||
Tips:
|
||||
|
||||
- SEW-D is a speech model that accepts a float array corresponding to the raw waveform of the speech signal.
|
||||
- SEWDForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded
|
||||
using :class:`~transformers.Wav2Vec2CTCTokenizer`.
|
||||
|
||||
This model was contributed by `anton-l <https://huggingface.co/anton-l>`__.
|
||||
|
||||
|
||||
SEWDConfig
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SEWDConfig
|
||||
:members:
|
||||
|
||||
|
||||
SEWDModel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SEWDModel
|
||||
:members: forward
|
||||
|
||||
|
||||
SEWDForCTC
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. autoclass:: transformers.SEWDForCTC
|
||||
:members: forward
|
||||
|
||||
@@ -251,6 +251,8 @@ _import_structure = {
|
||||
"models.retribert": ["RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RetriBertConfig", "RetriBertTokenizer"],
|
||||
"models.roberta": ["ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "RobertaConfig", "RobertaTokenizer"],
|
||||
"models.roformer": ["ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoFormerConfig", "RoFormerTokenizer"],
|
||||
"models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
|
||||
"models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
|
||||
"models.speech_encoder_decoder": ["SpeechEncoderDecoderConfig"],
|
||||
"models.speech_to_text": [
|
||||
"SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
@@ -1135,6 +1137,22 @@ if is_torch_available():
|
||||
"load_tf_weights_in_roformer",
|
||||
]
|
||||
)
|
||||
_import_structure["models.sew"].extend(
|
||||
[
|
||||
"SEW_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SEWForCTC",
|
||||
"SEWModel",
|
||||
"SEWPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.sew_d"].extend(
|
||||
[
|
||||
"SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SEWDForCTC",
|
||||
"SEWDModel",
|
||||
"SEWDPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.speech_encoder_decoder"].extend(["SpeechEncoderDecoderModel"])
|
||||
_import_structure["models.speech_to_text"].extend(
|
||||
[
|
||||
@@ -2095,6 +2113,8 @@ if TYPE_CHECKING:
|
||||
from .models.retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig, RetriBertTokenizer
|
||||
from .models.roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig, RobertaTokenizer
|
||||
from .models.roformer import ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, RoFormerConfig, RoFormerTokenizer
|
||||
from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
|
||||
from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
|
||||
from .models.speech_encoder_decoder import SpeechEncoderDecoderConfig
|
||||
from .models.speech_to_text import SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, Speech2TextConfig
|
||||
from .models.speech_to_text_2 import (
|
||||
@@ -2835,6 +2855,8 @@ if TYPE_CHECKING:
|
||||
RoFormerPreTrainedModel,
|
||||
load_tf_weights_in_roformer,
|
||||
)
|
||||
from .models.sew import SEW_PRETRAINED_MODEL_ARCHIVE_LIST, SEWForCTC, SEWModel, SEWPreTrainedModel
|
||||
from .models.sew_d import SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST, SEWDForCTC, SEWDModel, SEWDPreTrainedModel
|
||||
from .models.speech_encoder_decoder import SpeechEncoderDecoderModel
|
||||
from .models.speech_to_text import (
|
||||
SPEECH_TO_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
|
||||
@@ -80,6 +80,8 @@ from . import (
|
||||
retribert,
|
||||
roberta,
|
||||
roformer,
|
||||
sew,
|
||||
sew_d,
|
||||
speech_to_text,
|
||||
splinter,
|
||||
squeezebert,
|
||||
|
||||
@@ -96,6 +96,8 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
("rag", "RagConfig"),
|
||||
("tapas", "TapasConfig"),
|
||||
("splinter", "SplinterConfig"),
|
||||
("sew-d", "SEWDConfig"),
|
||||
("sew", "SEWConfig"),
|
||||
]
|
||||
)
|
||||
|
||||
@@ -162,6 +164,8 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||
("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("splinter", "SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
]
|
||||
)
|
||||
|
||||
@@ -245,6 +249,8 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
("byt5", "ByT5"),
|
||||
("mbart50", "mBART-50"),
|
||||
("splinter", "Splinter"),
|
||||
("sew-d", "SEW-D"),
|
||||
("sew", "SEW"),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
@@ -92,6 +92,8 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("tapas", "TapasModel"),
|
||||
("ibert", "IBertModel"),
|
||||
("splinter", "SplinterModel"),
|
||||
("sew", "SEWModel"),
|
||||
("sew-d", "SEWDModel"),
|
||||
]
|
||||
)
|
||||
|
||||
@@ -482,6 +484,8 @@ MODEL_FOR_CTC_MAPPING_NAMES = OrderedDict(
|
||||
# Model for Connectionist temporal classification (CTC) mapping
|
||||
("wav2vec2", "Wav2Vec2ForCTC"),
|
||||
("hubert", "HubertForCTC"),
|
||||
("sew", "SEWForCTC"),
|
||||
("sew-d", "SEWDForCTC"),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
@@ -141,7 +141,7 @@ def _compute_mask_indices(
|
||||
class HubertNoLayerNormConvLayer(nn.Module):
|
||||
def __init__(self, config, layer_id=0):
|
||||
super().__init__()
|
||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
||||
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||
self.out_conv_dim = config.conv_dim[layer_id]
|
||||
|
||||
self.conv = nn.Conv1d(
|
||||
@@ -163,7 +163,7 @@ class HubertNoLayerNormConvLayer(nn.Module):
|
||||
class HubertLayerNormConvLayer(nn.Module):
|
||||
def __init__(self, config, layer_id=0):
|
||||
super().__init__()
|
||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
||||
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||
self.out_conv_dim = config.conv_dim[layer_id]
|
||||
|
||||
self.conv = nn.Conv1d(
|
||||
@@ -191,7 +191,7 @@ class HubertLayerNormConvLayer(nn.Module):
|
||||
class HubertGroupNormConvLayer(nn.Module):
|
||||
def __init__(self, config, layer_id=0):
|
||||
super().__init__()
|
||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
||||
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||
self.out_conv_dim = config.conv_dim[layer_id]
|
||||
|
||||
self.conv = nn.Conv1d(
|
||||
|
||||
45
src/transformers/models/sew/__init__.py
Normal file
45
src/transformers/models/sew/__init__.py
Normal file
@@ -0,0 +1,45 @@
|
||||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...file_utils import _LazyModule, is_torch_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"],
|
||||
}
|
||||
|
||||
if is_torch_available():
|
||||
_import_structure["modeling_sew"] = [
|
||||
"SEW_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SEWForCTC",
|
||||
"SEWModel",
|
||||
"SEWPreTrainedModel",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig
|
||||
|
||||
if is_torch_available():
|
||||
from .modeling_sew import SEW_PRETRAINED_MODEL_ARCHIVE_LIST, SEWForCTC, SEWModel, SEWPreTrainedModel
|
||||
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
||||
216
src/transformers/models/sew/configuration_sew.py
Normal file
216
src/transformers/models/sew/configuration_sew.py
Normal file
@@ -0,0 +1,216 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2021 ASAPP Inc. and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" SEW model configuration """
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
SEW_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"asapp/sew-tiny-100k": "https://huggingface.co/asapp/sew-tiny-100k/resolve/main/config.json",
|
||||
# See all SEW models at https://huggingface.co/models?filter=sew
|
||||
}
|
||||
|
||||
|
||||
class SEWConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a :class:`~transformers.SEWModel`. It is used to
|
||||
instantiate a SEW model according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the SEW `asapp/sew-tiny-100k
|
||||
<https://huggingface.co/asapp/sew-tiny-100k>`__ architecture.
|
||||
|
||||
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
|
||||
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (:obj:`int`, `optional`, defaults to 32):
|
||||
Vocabulary size of the SEW model. Defines the number of different tokens that can be represented by the
|
||||
:obj:`inputs_ids` passed when calling :class:`~transformers.SEW`.
|
||||
hidden_size (:obj:`int`, `optional`, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
squeeze_factor (:obj:`int`, `optional`, defaults to 2):
|
||||
Sequence length downsampling factor after the encoder and upsampling factor after the transformer.
|
||||
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string,
|
||||
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||
hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||
The dropout ratio for the attention probabilities.
|
||||
final_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||
The dropout probability for the final projection layer of :class:`SEWForCTC`.
|
||||
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
|
||||
The epsilon used by the layer normalization layers.
|
||||
feat_extract_norm (:obj:`str`, `optional`, defaults to :obj:`"group"`):
|
||||
The norm to be applied to 1D convolutional layers in feature extractor. One of :obj:`"group"` for group
|
||||
normalization of only the first 1D convolutional layer or :obj:`"layer"` for layer normalization of all 1D
|
||||
convolutional layers.
|
||||
feat_proj_dropout (:obj:`float`, `optional`, defaults to 0.0):
|
||||
The dropout probability for output of the feature extractor.
|
||||
feat_extract_activation (:obj:`str, `optional`, defaults to :obj:`"gelu"`):
|
||||
The non-linear activation function (function or string) in the 1D convolutional layers of the feature
|
||||
extractor. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||
conv_dim (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512)`):
|
||||
A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
|
||||
feature extractor. The length of `conv_dim` defines the number of 1D convolutional layers.
|
||||
conv_stride (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1)`):
|
||||
A tuple of integers defining the stride of each 1D convolutional layer in the feature extractor. The length
|
||||
of `conv_stride` defines the number of convolutional layers and has to match the the length of `conv_dim`.
|
||||
conv_kernel (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1)`):
|
||||
A tuple of integers defining the kernel size of each 1D convolutional layer in the feature extractor. The
|
||||
length of `conv_kernel` defines the number of convolutional layers and has to match the the length of
|
||||
`conv_dim`.
|
||||
conv_bias (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether the 1D convolutional layers have a bias.
|
||||
num_conv_pos_embeddings (:obj:`int`, `optional`, defaults to 128):
|
||||
Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
|
||||
embeddings layer.
|
||||
num_conv_pos_embedding_groups (:obj:`int`, `optional`, defaults to 16):
|
||||
Number of groups of 1D convolutional positional embeddings layer.
|
||||
apply_spec_augment (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether to apply *SpecAugment* data augmentation to the outputs of the feature extractor. For reference see
|
||||
`SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
|
||||
<https://arxiv.org/abs/1904.08779>`__.
|
||||
mask_time_prob (:obj:`float`, `optional`, defaults to 0.05):
|
||||
Propability of each feature vector along the time axis to be chosen as the start of the vector span to be
|
||||
masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature vectors will be
|
||||
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||
mask_time_length (:obj:`int`, `optional`, defaults to 10):
|
||||
Length of vector span along the time axis.
|
||||
mask_feature_prob (:obj:`float`, `optional`, defaults to 0.0):
|
||||
Propability of each feature vector along the feature axis to be chosen as the start of the vector span to
|
||||
be masked. Approximately ``mask_time_prob * hidden_size // mask_time_length`` feature vectors will be
|
||||
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||
mask_feature_length (:obj:`int`, `optional`, defaults to 10):
|
||||
Length of vector span along the feature axis.
|
||||
ctc_loss_reduction (:obj:`str`, `optional`, defaults to :obj:`"sum"`):
|
||||
Specifies the reduction to apply to the output of ``torch.nn.CTCLoss``. Only relevant when training an
|
||||
instance of :class:`~transformers.SEWForCTC`.
|
||||
ctc_zero_infinity (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether to zero infinite losses and the associated gradients of ``torch.nn.CTCLoss``. Infinite losses
|
||||
mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an
|
||||
instance of :class:`~transformers.SEWForCTC`.
|
||||
|
||||
Example::
|
||||
|
||||
>>> from transformers import SEWModel, SEWConfig
|
||||
|
||||
>>> # Initializing a SEW asapp/sew-tiny-100k style configuration
|
||||
>>> configuration = SEWConfig()
|
||||
|
||||
>>> # Initializing a model from the asapp/sew-tiny-100k style configuration
|
||||
>>> model = SEWModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
"""
|
||||
model_type = "sew"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=32,
|
||||
hidden_size=768,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=12,
|
||||
intermediate_size=3072,
|
||||
squeeze_factor=2,
|
||||
hidden_act="gelu",
|
||||
hidden_dropout=0.1,
|
||||
activation_dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
feat_proj_dropout=0.0,
|
||||
final_dropout=0.1,
|
||||
layerdrop=0.1,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-5,
|
||||
feat_extract_norm="group",
|
||||
feat_extract_activation="gelu",
|
||||
conv_dim=(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512),
|
||||
conv_stride=(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1),
|
||||
conv_kernel=(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1),
|
||||
conv_bias=False,
|
||||
num_conv_pos_embeddings=128,
|
||||
num_conv_pos_embedding_groups=16,
|
||||
apply_spec_augment=True,
|
||||
mask_time_prob=0.05,
|
||||
mask_time_length=10,
|
||||
mask_feature_prob=0.0,
|
||||
mask_feature_length=10,
|
||||
ctc_loss_reduction="sum",
|
||||
ctc_zero_infinity=False,
|
||||
pad_token_id=0,
|
||||
bos_token_id=1,
|
||||
eos_token_id=2,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs, pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id)
|
||||
self.hidden_size = hidden_size
|
||||
self.feat_extract_norm = feat_extract_norm
|
||||
self.feat_extract_activation = feat_extract_activation
|
||||
self.conv_dim = list(conv_dim)
|
||||
self.conv_stride = list(conv_stride)
|
||||
self.conv_kernel = list(conv_kernel)
|
||||
self.conv_bias = conv_bias
|
||||
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||
self.num_feat_extract_layers = len(self.conv_dim)
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.intermediate_size = intermediate_size
|
||||
self.squeeze_factor = squeeze_factor
|
||||
self.hidden_act = hidden_act
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_dropout = hidden_dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.activation_dropout = activation_dropout
|
||||
self.feat_proj_dropout = feat_proj_dropout
|
||||
self.final_dropout = final_dropout
|
||||
self.layerdrop = layerdrop
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.initializer_range = initializer_range
|
||||
self.vocab_size = vocab_size
|
||||
|
||||
if (
|
||||
(len(self.conv_stride) != self.num_feat_extract_layers)
|
||||
or (len(self.conv_kernel) != self.num_feat_extract_layers)
|
||||
or (len(self.conv_dim) != self.num_feat_extract_layers)
|
||||
):
|
||||
raise ValueError(
|
||||
"Configuration for convolutional layers is incorrect."
|
||||
"It is required that `len(config.conv_dim)` == `len(config.conv_stride)` == `len(config.conv_kernel)`,"
|
||||
f"but is `len(config.conv_dim) = {len(self.conv_dim)}`, `len(config.conv_stride)"
|
||||
f"= {len(self.conv_stride)}`, `len(config.conv_kernel) = {len(self.conv_kernel)}`."
|
||||
)
|
||||
|
||||
# fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
|
||||
self.apply_spec_augment = apply_spec_augment
|
||||
self.mask_time_prob = mask_time_prob
|
||||
self.mask_time_length = mask_time_length
|
||||
self.mask_feature_prob = mask_feature_prob
|
||||
self.mask_feature_length = mask_feature_length
|
||||
|
||||
# ctc loss
|
||||
self.ctc_loss_reduction = ctc_loss_reduction
|
||||
self.ctc_zero_infinity = ctc_zero_infinity
|
||||
@@ -0,0 +1,286 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2021 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert SEW checkpoint."""
|
||||
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
import fairseq
|
||||
import torch
|
||||
from fairseq.data import Dictionary
|
||||
|
||||
# Register SEW's fairseq modules
|
||||
from sew_asapp import tasks # noqa: F401
|
||||
from transformers import (
|
||||
SEWConfig,
|
||||
SEWForCTC,
|
||||
SEWModel,
|
||||
Wav2Vec2CTCTokenizer,
|
||||
Wav2Vec2FeatureExtractor,
|
||||
Wav2Vec2Processor,
|
||||
logging,
|
||||
)
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
MAPPING = {
|
||||
"post_extract_proj": "feature_projection",
|
||||
"encoder.pos_conv.0": "encoder.pos_conv_embed.conv",
|
||||
"self_attn.k_proj": "encoder.layers.*.attention.k_proj",
|
||||
"self_attn.v_proj": "encoder.layers.*.attention.v_proj",
|
||||
"self_attn.q_proj": "encoder.layers.*.attention.q_proj",
|
||||
"self_attn.out_proj": "encoder.layers.*.attention.out_proj",
|
||||
"self_attn_layer_norm": "encoder.layers.*.layer_norm",
|
||||
"fc1": "encoder.layers.*.feed_forward.intermediate_dense",
|
||||
"fc2": "encoder.layers.*.feed_forward.output_dense",
|
||||
"final_layer_norm": "encoder.layers.*.final_layer_norm",
|
||||
"encoder.upsample.0": "encoder.upsample.projection",
|
||||
"encoder.layer_norm": "encoder.layer_norm",
|
||||
"w2v_encoder.layer_norm": "layer_norm",
|
||||
"w2v_encoder.proj": "lm_head",
|
||||
"mask_emb": "masked_spec_embed",
|
||||
}
|
||||
|
||||
|
||||
def set_recursively(hf_pointer, key, value, full_name, weight_type):
|
||||
for attribute in key.split("."):
|
||||
hf_pointer = getattr(hf_pointer, attribute)
|
||||
|
||||
if weight_type is not None:
|
||||
hf_shape = getattr(hf_pointer, weight_type).shape
|
||||
else:
|
||||
hf_shape = hf_pointer.shape
|
||||
|
||||
assert (
|
||||
hf_shape == value.shape
|
||||
), f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be {value.shape} for {full_name}"
|
||||
|
||||
if weight_type == "weight":
|
||||
hf_pointer.weight.data = value
|
||||
elif weight_type == "weight_g":
|
||||
hf_pointer.weight_g.data = value
|
||||
elif weight_type == "weight_v":
|
||||
hf_pointer.weight_v.data = value
|
||||
elif weight_type == "bias":
|
||||
hf_pointer.bias.data = value
|
||||
else:
|
||||
hf_pointer.data = value
|
||||
|
||||
logger.info(f"{key + '.' + weight_type if weight_type is not None else ''} was initialized from {full_name}.")
|
||||
|
||||
|
||||
def recursively_load_weights(fairseq_model, hf_model, is_finetuned):
|
||||
unused_weights = []
|
||||
fairseq_dict = fairseq_model.state_dict()
|
||||
|
||||
feature_extractor = hf_model.sew.feature_extractor if is_finetuned else hf_model.feature_extractor
|
||||
|
||||
for name, value in fairseq_dict.items():
|
||||
is_used = False
|
||||
if "conv_layers" in name:
|
||||
load_conv_layer(
|
||||
name,
|
||||
value,
|
||||
feature_extractor,
|
||||
unused_weights,
|
||||
hf_model.config.feat_extract_norm == "group",
|
||||
)
|
||||
is_used = True
|
||||
else:
|
||||
for key, mapped_key in MAPPING.items():
|
||||
mapped_key = "sew." + mapped_key if (is_finetuned and mapped_key != "lm_head") else mapped_key
|
||||
|
||||
if key in name or key.split("w2v_encoder.")[-1] == name.split(".")[0]:
|
||||
is_used = True
|
||||
if "*" in mapped_key:
|
||||
layer_index = name.split(key)[0].split(".")[-2]
|
||||
mapped_key = mapped_key.replace("*", layer_index)
|
||||
if "weight_g" in name:
|
||||
weight_type = "weight_g"
|
||||
elif "weight_v" in name:
|
||||
weight_type = "weight_v"
|
||||
elif "weight" in name:
|
||||
weight_type = "weight"
|
||||
elif "bias" in name:
|
||||
weight_type = "bias"
|
||||
else:
|
||||
weight_type = None
|
||||
set_recursively(hf_model, mapped_key, value, name, weight_type)
|
||||
continue
|
||||
if not is_used:
|
||||
unused_weights.append(name)
|
||||
|
||||
logger.warning(f"Unused weights: {unused_weights}")
|
||||
|
||||
|
||||
def load_conv_layer(full_name, value, feature_extractor, unused_weights, use_group_norm):
|
||||
name = full_name.split("conv_layers.")[-1]
|
||||
items = name.split(".")
|
||||
layer_id = int(items[0])
|
||||
type_id = int(items[1])
|
||||
|
||||
if type_id == 0:
|
||||
if "bias" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].conv.bias.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.bias.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].conv.bias.data = value
|
||||
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||
elif "weight" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].conv.weight.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.weight.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].conv.weight.data = value
|
||||
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||
elif (type_id == 2 and not use_group_norm) or (type_id == 2 and layer_id == 0 and use_group_norm):
|
||||
if "bias" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.bias.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.bias.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].layer_norm.bias.data = value
|
||||
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||
elif "weight" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.weight.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.weight.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].layer_norm.weight.data = value
|
||||
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||
else:
|
||||
unused_weights.append(full_name)
|
||||
|
||||
|
||||
def convert_config(model):
|
||||
config = SEWConfig()
|
||||
fs_config = model.cfg
|
||||
|
||||
config.activation_dropout = fs_config.activation_dropout
|
||||
config.apply_spec_augment = fs_config.mask_prob > 0 or fs_config.mask_channel_prob > 0
|
||||
config.attention_dropout = fs_config.attention_dropout
|
||||
config.conv_bias = fs_config.conv_bias
|
||||
conv_layers = eval(fs_config.conv_feature_layers)
|
||||
config.conv_dim = [x[0] for x in conv_layers]
|
||||
config.conv_kernel = [x[1] for x in conv_layers]
|
||||
config.conv_stride = [x[2] for x in conv_layers]
|
||||
config.feat_extract_activation = "gelu"
|
||||
config.feat_extract_norm = "layer" if fs_config.extractor_mode == "layer_norm" else "group"
|
||||
config.feat_proj_dropout = fs_config.dropout_input
|
||||
config.final_dropout = 0.0
|
||||
config.hidden_act = fs_config.activation_fn.name
|
||||
config.hidden_dropout = fs_config.dropout
|
||||
config.hidden_size = fs_config.encoder_embed_dim
|
||||
config.initializer_range = 0.02
|
||||
config.intermediate_size = fs_config.encoder_ffn_embed_dim
|
||||
config.layer_norm_eps = 1e-5
|
||||
config.layerdrop = fs_config.encoder_layerdrop
|
||||
config.mask_feature_length = fs_config.mask_channel_length
|
||||
config.mask_feature_prob = fs_config.mask_channel_prob
|
||||
config.mask_time_length = fs_config.mask_length
|
||||
config.mask_time_prob = fs_config.mask_prob
|
||||
config.num_attention_heads = fs_config.encoder_attention_heads
|
||||
config.num_conv_pos_embedding_groups = fs_config.conv_pos_groups
|
||||
config.num_conv_pos_embeddings = fs_config.conv_pos
|
||||
config.num_feat_extract_layers = len(conv_layers)
|
||||
config.num_hidden_layers = fs_config.encoder_layers
|
||||
config.squeeze_factor = fs_config.squeeze_factor
|
||||
|
||||
return config
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_sew_checkpoint(
|
||||
checkpoint_path, pytorch_dump_folder_path, config_path=None, dict_path=None, is_finetuned=True
|
||||
):
|
||||
"""
|
||||
Copy/paste/tweak model's weights to transformers design.
|
||||
"""
|
||||
|
||||
if is_finetuned:
|
||||
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
|
||||
[checkpoint_path], arg_overrides={"data": "/".join(dict_path.split("/")[:-1])}
|
||||
)
|
||||
else:
|
||||
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint_path])
|
||||
|
||||
if config_path is not None:
|
||||
config = SEWConfig.from_pretrained(config_path)
|
||||
else:
|
||||
config = convert_config(model[0])
|
||||
model = model[0].eval()
|
||||
|
||||
return_attention_mask = True if config.feat_extract_norm == "layer" else False
|
||||
feature_extractor = Wav2Vec2FeatureExtractor(
|
||||
feature_size=1,
|
||||
sampling_rate=16000,
|
||||
padding_value=0,
|
||||
do_normalize=True,
|
||||
return_attention_mask=return_attention_mask,
|
||||
)
|
||||
|
||||
if is_finetuned:
|
||||
if dict_path:
|
||||
target_dict = Dictionary.load(dict_path)
|
||||
|
||||
# important change bos & pad token id since CTC symbol is <pad> and
|
||||
# not <s> as in fairseq
|
||||
config.bos_token_id = target_dict.pad_index
|
||||
config.pad_token_id = target_dict.bos_index
|
||||
config.eos_token_id = target_dict.eos_index
|
||||
config.vocab_size = len(target_dict.symbols)
|
||||
vocab_path = os.path.join(pytorch_dump_folder_path, "vocab.json")
|
||||
if not os.path.isdir(pytorch_dump_folder_path):
|
||||
logger.error("--pytorch_dump_folder_path ({}) should be a directory".format(pytorch_dump_folder_path))
|
||||
return
|
||||
os.makedirs(pytorch_dump_folder_path, exist_ok=True)
|
||||
with open(vocab_path, "w", encoding="utf-8") as vocab_handle:
|
||||
json.dump(target_dict.indices, vocab_handle)
|
||||
tokenizer = Wav2Vec2CTCTokenizer(
|
||||
vocab_path,
|
||||
unk_token=target_dict.unk_word,
|
||||
pad_token=target_dict.pad_word,
|
||||
bos_token=target_dict.bos_word,
|
||||
eos_token=target_dict.eos_word,
|
||||
word_delimiter_token="|",
|
||||
do_lower_case=False,
|
||||
)
|
||||
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||
processor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
hf_model = SEWForCTC(config)
|
||||
else:
|
||||
hf_model = SEWModel(config)
|
||||
feature_extractor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
recursively_load_weights(model, hf_model, is_finetuned)
|
||||
|
||||
hf_model.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
|
||||
parser.add_argument("--dict_path", default=None, type=str, help="Path to dict of fine-tuned model")
|
||||
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||
parser.add_argument(
|
||||
"--is_finetuned", action="store_true", help="Whether the model to convert is a fine-tuned model or not"
|
||||
)
|
||||
args = parser.parse_args()
|
||||
convert_sew_checkpoint(
|
||||
args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.dict_path, args.is_finetuned
|
||||
)
|
||||
1035
src/transformers/models/sew/modeling_sew.py
Normal file
1035
src/transformers/models/sew/modeling_sew.py
Normal file
File diff suppressed because it is too large
Load Diff
45
src/transformers/models/sew_d/__init__.py
Normal file
45
src/transformers/models/sew_d/__init__.py
Normal file
@@ -0,0 +1,45 @@
|
||||
# flake8: noqa
|
||||
# There's no way to ignore "F401 '...' imported but unused" warnings in this
|
||||
# module, but to preserve other warnings. So, don't check this module at all.
|
||||
|
||||
# Copyright 2021 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...file_utils import _LazyModule, is_torch_available
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"],
|
||||
}
|
||||
|
||||
if is_torch_available():
|
||||
_import_structure["modeling_sew_d"] = [
|
||||
"SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"SEWDForCTC",
|
||||
"SEWDModel",
|
||||
"SEWDPreTrainedModel",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig
|
||||
|
||||
if is_torch_available():
|
||||
from .modeling_sew_d import SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST, SEWDForCTC, SEWDModel, SEWDPreTrainedModel
|
||||
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
|
||||
248
src/transformers/models/sew_d/configuration_sew_d.py
Normal file
248
src/transformers/models/sew_d/configuration_sew_d.py
Normal file
@@ -0,0 +1,248 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2021 ASAPP Inc. and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" SEW-D model configuration """
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"asapp/sew-d-tiny-100k": "https://huggingface.co/asapp/sew-d-tiny-100k/resolve/main/config.json",
|
||||
# See all SEW-D models at https://huggingface.co/models?filter=sew-d
|
||||
}
|
||||
|
||||
|
||||
class SEWDConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a :class:`~transformers.SEWDModel`. It is used to
|
||||
instantiate a SEW-D model according to the specified arguments, defining the model architecture. Instantiating a
|
||||
configuration with the defaults will yield a similar configuration to that of the SEW-D `asapp/sew-d-tiny-100k
|
||||
<https://huggingface.co/asapp/sew-d-tiny-100k>`__ architecture.
|
||||
|
||||
Configuration objects inherit from :class:`~transformers.PretrainedConfig` and can be used to control the model
|
||||
outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
|
||||
|
||||
|
||||
Args:
|
||||
vocab_size (:obj:`int`, `optional`, defaults to 32):
|
||||
Vocabulary size of the SEW-D model. Defines the number of different tokens that can be represented by the
|
||||
:obj:`inputs_ids` passed when calling :class:`~transformers.SEWD`.
|
||||
hidden_size (:obj:`int`, `optional`, defaults to 768):
|
||||
Dimensionality of the encoder layers and the pooler layer.
|
||||
num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_attention_heads (:obj:`int`, `optional`, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
intermediate_size (:obj:`int`, `optional`, defaults to 3072):
|
||||
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||
squeeze_factor (:obj:`int`, `optional`, defaults to 2):
|
||||
Sequence length downsampling factor after the encoder and upsampling factor after the transformer.
|
||||
max_position_embeddings (:obj:`int`, `optional`, defaults to 512):
|
||||
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
position_buckets (:obj:`int`, `optional`, defaults to 256):
|
||||
The maximum size of relative position embeddings.
|
||||
share_att_key (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether to share attention key with c2p and p2c.
|
||||
relative_attention (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether to use relative position encoding.
|
||||
position_biased_input (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether to add absolute position embedding to content embedding.
|
||||
pos_att_type (:obj:`Tuple[str]`, `optional`, defaults to :obj:`("p2c", "c2p")`):
|
||||
The type of relative position attention, it can be a combination of :obj:`("p2c", "c2p", "p2p")`, e.g.
|
||||
:obj:`("p2c")`, :obj:`("p2c", "c2p")`, :obj:`("p2c", "c2p", 'p2p")`.
|
||||
norm_rel_ebd (:obj:`str`, `optional`, defaults to :obj:`"layer_norm"`):
|
||||
Whether to use layer norm in relative embedding (:obj:`"layer_norm"` if yes)
|
||||
hidden_act (:obj:`str` or :obj:`function`, `optional`, defaults to :obj:`"gelu"`):
|
||||
The non-linear activation function (function or string) in the encoder and pooler. If string,
|
||||
:obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||
hidden_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
attention_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||
The dropout ratio for the attention probabilities.
|
||||
final_dropout (:obj:`float`, `optional`, defaults to 0.1):
|
||||
The dropout probability for the final projection layer of :class:`SEWDForCTC`.
|
||||
initializer_range (:obj:`float`, `optional`, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
|
||||
The epsilon used by the layer normalization layers.
|
||||
feat_extract_norm (:obj:`str`, `optional`, defaults to :obj:`"group"`):
|
||||
The norm to be applied to 1D convolutional layers in feature extractor. One of :obj:`"group"` for group
|
||||
normalization of only the first 1D convolutional layer or :obj:`"layer"` for layer normalization of all 1D
|
||||
convolutional layers.
|
||||
feat_proj_dropout (:obj:`float`, `optional`, defaults to 0.0):
|
||||
The dropout probability for output of the feature extractor.
|
||||
feat_extract_activation (:obj:`str, `optional`, defaults to :obj:`"gelu"`):
|
||||
The non-linear activation function (function or string) in the 1D convolutional layers of the feature
|
||||
extractor. If string, :obj:`"gelu"`, :obj:`"relu"`, :obj:`"selu"` and :obj:`"gelu_new"` are supported.
|
||||
conv_dim (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512)`):
|
||||
A tuple of integers defining the number of input and output channels of each 1D convolutional layer in the
|
||||
feature extractor. The length of `conv_dim` defines the number of 1D convolutional layers.
|
||||
conv_stride (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1)`):
|
||||
A tuple of integers defining the stride of each 1D convolutional layer in the feature extractor. The length
|
||||
of `conv_stride` defines the number of convolutional layers and has to match the the length of `conv_dim`.
|
||||
conv_kernel (:obj:`Tuple[int]`, `optional`, defaults to :obj:`(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1)`):
|
||||
A tuple of integers defining the kernel size of each 1D convolutional layer in the feature extractor. The
|
||||
length of `conv_kernel` defines the number of convolutional layers and has to match the the length of
|
||||
`conv_dim`.
|
||||
conv_bias (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether the 1D convolutional layers have a bias.
|
||||
num_conv_pos_embeddings (:obj:`int`, `optional`, defaults to 128):
|
||||
Number of convolutional positional embeddings. Defines the kernel size of 1D convolutional positional
|
||||
embeddings layer.
|
||||
num_conv_pos_embedding_groups (:obj:`int`, `optional`, defaults to 16):
|
||||
Number of groups of 1D convolutional positional embeddings layer.
|
||||
apply_spec_augment (:obj:`bool`, `optional`, defaults to :obj:`True`):
|
||||
Whether to apply *SpecAugment* data augmentation to the outputs of the feature extractor. For reference see
|
||||
`SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
|
||||
<https://arxiv.org/abs/1904.08779>`__.
|
||||
mask_time_prob (:obj:`float`, `optional`, defaults to 0.05):
|
||||
Propability of each feature vector along the time axis to be chosen as the start of the vector span to be
|
||||
masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature vectors will be
|
||||
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||
mask_time_length (:obj:`int`, `optional`, defaults to 10):
|
||||
Length of vector span along the time axis.
|
||||
mask_feature_prob (:obj:`float`, `optional`, defaults to 0.0):
|
||||
Propability of each feature vector along the feature axis to be chosen as the start of the vector span to
|
||||
be masked. Approximately ``mask_time_prob * hidden_size // mask_time_length`` feature vectors will be
|
||||
masked along the time axis. This is only relevant if ``apply_spec_augment is True``.
|
||||
mask_feature_length (:obj:`int`, `optional`, defaults to 10):
|
||||
Length of vector span along the feature axis.
|
||||
diversity_loss_weight (:obj:`int`, `optional`, defaults to 0.1):
|
||||
The weight of the codebook diversity loss component.
|
||||
ctc_loss_reduction (:obj:`str`, `optional`, defaults to :obj:`"sum"`):
|
||||
Specifies the reduction to apply to the output of ``torch.nn.CTCLoss``. Only relevant when training an
|
||||
instance of :class:`~transformers.SEWDForCTC`.
|
||||
ctc_zero_infinity (:obj:`bool`, `optional`, defaults to :obj:`False`):
|
||||
Whether to zero infinite losses and the associated gradients of ``torch.nn.CTCLoss``. Infinite losses
|
||||
mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an
|
||||
instance of :class:`~transformers.SEWDForCTC`.
|
||||
|
||||
Example::
|
||||
|
||||
>>> from transformers import SEWDModel, SEWDConfig
|
||||
|
||||
>>> # Initializing a SEW-D asapp/sew-d-tiny-100k style configuration
|
||||
>>> configuration = SEWDConfig()
|
||||
|
||||
>>> # Initializing a model from the asapp/sew-d-tiny-100k style configuration
|
||||
>>> model = SEWDModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
"""
|
||||
model_type = "sew-d"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=32,
|
||||
hidden_size=768,
|
||||
num_hidden_layers=12,
|
||||
num_attention_heads=12,
|
||||
intermediate_size=3072,
|
||||
squeeze_factor=2,
|
||||
max_position_embeddings=512,
|
||||
position_buckets=256,
|
||||
share_att_key=True,
|
||||
relative_attention=True,
|
||||
position_biased_input=False,
|
||||
pos_att_type=("p2c", "c2p"),
|
||||
norm_rel_ebd="layer_norm",
|
||||
hidden_act="gelu",
|
||||
hidden_dropout=0.1,
|
||||
activation_dropout=0.1,
|
||||
attention_dropout=0.1,
|
||||
feat_proj_dropout=0.0,
|
||||
final_dropout=0.1,
|
||||
layerdrop=0.1,
|
||||
initializer_range=0.02,
|
||||
layer_norm_eps=1e-5,
|
||||
feat_extract_norm="group",
|
||||
feat_extract_activation="gelu",
|
||||
conv_dim=(64, 128, 128, 128, 128, 256, 256, 256, 256, 512, 512, 512, 512),
|
||||
conv_stride=(5, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1),
|
||||
conv_kernel=(10, 3, 1, 3, 1, 3, 1, 3, 1, 2, 1, 2, 1),
|
||||
conv_bias=False,
|
||||
num_conv_pos_embeddings=128,
|
||||
num_conv_pos_embedding_groups=16,
|
||||
apply_spec_augment=True,
|
||||
mask_time_prob=0.05,
|
||||
mask_time_length=10,
|
||||
mask_feature_prob=0.0,
|
||||
mask_feature_length=10,
|
||||
ctc_loss_reduction="sum",
|
||||
ctc_zero_infinity=False,
|
||||
pad_token_id=0,
|
||||
bos_token_id=1,
|
||||
eos_token_id=2,
|
||||
**kwargs
|
||||
):
|
||||
super().__init__(**kwargs, pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id)
|
||||
self.hidden_size = hidden_size
|
||||
self.feat_extract_norm = feat_extract_norm
|
||||
self.feat_extract_activation = feat_extract_activation
|
||||
self.conv_dim = list(conv_dim)
|
||||
self.conv_stride = list(conv_stride)
|
||||
self.conv_kernel = list(conv_kernel)
|
||||
self.conv_bias = conv_bias
|
||||
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||
self.num_feat_extract_layers = len(self.conv_dim)
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.intermediate_size = intermediate_size
|
||||
self.squeeze_factor = squeeze_factor
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.position_buckets = position_buckets
|
||||
self.share_att_key = share_att_key
|
||||
self.relative_attention = relative_attention
|
||||
self.norm_rel_ebd = norm_rel_ebd
|
||||
self.position_biased_input = position_biased_input
|
||||
self.pos_att_type = list(pos_att_type)
|
||||
self.hidden_act = hidden_act
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_dropout = hidden_dropout
|
||||
self.attention_dropout = attention_dropout
|
||||
self.activation_dropout = activation_dropout
|
||||
self.feat_proj_dropout = feat_proj_dropout
|
||||
self.final_dropout = final_dropout
|
||||
self.layerdrop = layerdrop
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.initializer_range = initializer_range
|
||||
self.vocab_size = vocab_size
|
||||
|
||||
if (
|
||||
(len(self.conv_stride) != self.num_feat_extract_layers)
|
||||
or (len(self.conv_kernel) != self.num_feat_extract_layers)
|
||||
or (len(self.conv_dim) != self.num_feat_extract_layers)
|
||||
):
|
||||
raise ValueError(
|
||||
"Configuration for convolutional layers is incorrect."
|
||||
"It is required that `len(config.conv_dim)` == `len(config.conv_stride)` == `len(config.conv_kernel)`,"
|
||||
f"but is `len(config.conv_dim) = {len(self.conv_dim)}`, `len(config.conv_stride)"
|
||||
f"= {len(self.conv_stride)}`, `len(config.conv_kernel) = {len(self.conv_kernel)}`."
|
||||
)
|
||||
|
||||
# fine-tuning config parameters for SpecAugment: https://arxiv.org/abs/1904.08779
|
||||
self.apply_spec_augment = apply_spec_augment
|
||||
self.mask_time_prob = mask_time_prob
|
||||
self.mask_time_length = mask_time_length
|
||||
self.mask_feature_prob = mask_feature_prob
|
||||
self.mask_feature_length = mask_feature_length
|
||||
|
||||
# ctc loss
|
||||
self.ctc_loss_reduction = ctc_loss_reduction
|
||||
self.ctc_zero_infinity = ctc_zero_infinity
|
||||
@@ -0,0 +1,298 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2021 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Convert SEW checkpoint."""
|
||||
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
import fairseq
|
||||
import torch
|
||||
from fairseq.data import Dictionary
|
||||
|
||||
# Register SEW's fairseq modules
|
||||
from sew_asapp import tasks # noqa: F401
|
||||
from transformers import (
|
||||
SEWDConfig,
|
||||
SEWDForCTC,
|
||||
SEWDModel,
|
||||
Wav2Vec2CTCTokenizer,
|
||||
Wav2Vec2FeatureExtractor,
|
||||
Wav2Vec2Processor,
|
||||
logging,
|
||||
)
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
MAPPING = {
|
||||
"post_extract_proj": "feature_projection",
|
||||
"encoder.pos_conv.0": "encoder.pos_conv_embed.conv",
|
||||
"attention.self.query_proj": "encoder.encoder.layer.*.attention.self.query_proj",
|
||||
"attention.self.key_proj": "encoder.encoder.layer.*.attention.self.key_proj",
|
||||
"attention.self.value_proj": "encoder.encoder.layer.*.attention.self.value_proj",
|
||||
"attention.output.dense": "encoder.encoder.layer.*.attention.output.dense",
|
||||
"attention.output.LayerNorm": "encoder.encoder.layer.*.attention.output.LayerNorm",
|
||||
"intermediate.dense": "encoder.encoder.layer.*.intermediate.dense",
|
||||
"output.dense": "encoder.encoder.layer.*.output.dense",
|
||||
"output.LayerNorm": "encoder.encoder.layer.*.output.LayerNorm",
|
||||
"encoder.encoder.rel_embeddings": "encoder.encoder.rel_embeddings",
|
||||
"encoder.encoder.LayerNorm": "encoder.encoder.LayerNorm",
|
||||
"encoder.upsample.0": "encoder.upsample.projection",
|
||||
"encoder.layer_norm": "encoder.layer_norm",
|
||||
"w2v_encoder.layer_norm": "layer_norm",
|
||||
"w2v_encoder.proj": "lm_head",
|
||||
"mask_emb": "masked_spec_embed",
|
||||
}
|
||||
|
||||
|
||||
def set_recursively(hf_pointer, key, value, full_name, weight_type):
|
||||
for attribute in key.split("."):
|
||||
hf_pointer = getattr(hf_pointer, attribute)
|
||||
|
||||
if weight_type is not None:
|
||||
hf_shape = getattr(hf_pointer, weight_type).shape
|
||||
else:
|
||||
hf_shape = hf_pointer.shape
|
||||
|
||||
assert (
|
||||
hf_shape == value.shape
|
||||
), f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be {value.shape} for {full_name}"
|
||||
|
||||
if weight_type == "weight":
|
||||
hf_pointer.weight.data = value
|
||||
elif weight_type == "weight_g":
|
||||
hf_pointer.weight_g.data = value
|
||||
elif weight_type == "weight_v":
|
||||
hf_pointer.weight_v.data = value
|
||||
elif weight_type == "bias":
|
||||
hf_pointer.bias.data = value
|
||||
else:
|
||||
hf_pointer.data = value
|
||||
|
||||
logger.info(f"{key + '.' + weight_type if weight_type is not None else ''} was initialized from {full_name}.")
|
||||
|
||||
|
||||
def recursively_load_weights(fairseq_model, hf_model, is_finetuned):
|
||||
unused_weights = []
|
||||
fairseq_dict = fairseq_model.state_dict()
|
||||
|
||||
feature_extractor = hf_model.sew.feature_extractor if is_finetuned else hf_model.feature_extractor
|
||||
|
||||
for name, value in fairseq_dict.items():
|
||||
is_used = False
|
||||
if "conv_layers" in name:
|
||||
load_conv_layer(
|
||||
name,
|
||||
value,
|
||||
feature_extractor,
|
||||
unused_weights,
|
||||
hf_model.config.feat_extract_norm == "group",
|
||||
)
|
||||
is_used = True
|
||||
else:
|
||||
for key, mapped_key in MAPPING.items():
|
||||
mapped_key = "sew_d." + mapped_key if (is_finetuned and mapped_key != "lm_head") else mapped_key
|
||||
|
||||
if key in name or key.split("w2v_encoder.")[-1] == name.split(".")[0]:
|
||||
is_used = True
|
||||
if "*" in mapped_key:
|
||||
layer_index = name.split(key)[0].split(".")[-2]
|
||||
if not layer_index.isnumeric():
|
||||
continue
|
||||
mapped_key = mapped_key.replace("*", layer_index)
|
||||
if "weight_g" in name:
|
||||
weight_type = "weight_g"
|
||||
elif "weight_v" in name:
|
||||
weight_type = "weight_v"
|
||||
elif "weight" in name:
|
||||
weight_type = "weight"
|
||||
elif "bias" in name:
|
||||
weight_type = "bias"
|
||||
else:
|
||||
weight_type = None
|
||||
set_recursively(hf_model, mapped_key, value, name, weight_type)
|
||||
continue
|
||||
if not is_used:
|
||||
unused_weights.append(name)
|
||||
|
||||
logger.warning(f"Unused weights: {unused_weights}")
|
||||
|
||||
|
||||
def load_conv_layer(full_name, value, feature_extractor, unused_weights, use_group_norm):
|
||||
name = full_name.split("conv_layers.")[-1]
|
||||
items = name.split(".")
|
||||
layer_id = int(items[0])
|
||||
type_id = int(items[1])
|
||||
|
||||
if type_id == 0:
|
||||
if "bias" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].conv.bias.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.bias.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].conv.bias.data = value
|
||||
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||
elif "weight" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].conv.weight.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor.conv_layers[layer_id].conv.weight.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].conv.weight.data = value
|
||||
logger.info(f"Feat extract conv layer {layer_id} was initialized from {full_name}.")
|
||||
elif (type_id == 2 and not use_group_norm) or (type_id == 2 and layer_id == 0 and use_group_norm):
|
||||
if "bias" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.bias.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.bias.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].layer_norm.bias.data = value
|
||||
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||
elif "weight" in name:
|
||||
assert (
|
||||
value.shape == feature_extractor.conv_layers[layer_id].layer_norm.weight.data.shape
|
||||
), f"{full_name} has size {value.shape}, but {feature_extractor[layer_id].layer_norm.weight.data.shape} was found."
|
||||
feature_extractor.conv_layers[layer_id].layer_norm.weight.data = value
|
||||
logger.info(f"Feat extract layer norm weight of layer {layer_id} was initialized from {full_name}.")
|
||||
else:
|
||||
unused_weights.append(full_name)
|
||||
|
||||
|
||||
def convert_config(model):
|
||||
config = SEWDConfig()
|
||||
fs_config = model.cfg
|
||||
|
||||
config.activation_dropout = fs_config.activation_dropout
|
||||
config.apply_spec_augment = fs_config.mask_prob > 0 or fs_config.mask_channel_prob > 0
|
||||
config.attention_dropout = fs_config.attention_dropout
|
||||
config.conv_bias = fs_config.conv_bias
|
||||
conv_layers = eval(fs_config.conv_feature_layers)
|
||||
config.conv_dim = [x[0] for x in conv_layers]
|
||||
config.conv_kernel = [x[1] for x in conv_layers]
|
||||
config.conv_stride = [x[2] for x in conv_layers]
|
||||
config.feat_extract_activation = "gelu"
|
||||
config.feat_extract_norm = "layer" if fs_config.extractor_mode == "layer_norm" else "group"
|
||||
config.feat_proj_dropout = fs_config.dropout_input
|
||||
config.final_dropout = 0.0
|
||||
config.hidden_act = fs_config.activation_fn.name
|
||||
config.hidden_dropout = fs_config.dropout
|
||||
config.hidden_size = fs_config.encoder_embed_dim
|
||||
config.initializer_range = 0.02
|
||||
config.intermediate_size = fs_config.encoder_ffn_embed_dim
|
||||
config.layer_norm_eps = 1e-5
|
||||
config.layerdrop = fs_config.encoder_layerdrop
|
||||
config.mask_feature_length = fs_config.mask_channel_length
|
||||
config.mask_feature_prob = fs_config.mask_channel_prob
|
||||
config.mask_time_length = fs_config.mask_length
|
||||
config.mask_time_prob = fs_config.mask_prob
|
||||
config.num_attention_heads = fs_config.encoder_attention_heads
|
||||
config.num_conv_pos_embedding_groups = fs_config.conv_pos_groups
|
||||
config.num_conv_pos_embeddings = fs_config.conv_pos
|
||||
config.num_feat_extract_layers = len(conv_layers)
|
||||
config.num_hidden_layers = fs_config.encoder_layers
|
||||
config.squeeze_factor = fs_config.squeeze_factor
|
||||
# DeBERTa-specific parameters:
|
||||
config.max_position_embeddings = fs_config.max_position_embeddings
|
||||
config.position_buckets = fs_config.position_buckets
|
||||
config.share_att_key = fs_config.share_att_key
|
||||
config.relative_attention = fs_config.relative_attention
|
||||
config.position_biased_input = fs_config.position_biased_input
|
||||
config.pos_att_type = tuple(fs_config.pos_att_type.split("|"))
|
||||
config.norm_rel_ebd = fs_config.norm_rel_ebd
|
||||
|
||||
return config
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def convert_sew_checkpoint(
|
||||
checkpoint_path, pytorch_dump_folder_path, config_path=None, dict_path=None, is_finetuned=True
|
||||
):
|
||||
"""
|
||||
Copy/paste/tweak model's weights to transformers design.
|
||||
"""
|
||||
|
||||
if is_finetuned:
|
||||
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
|
||||
[checkpoint_path], arg_overrides={"data": "/".join(dict_path.split("/")[:-1])}
|
||||
)
|
||||
else:
|
||||
model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([checkpoint_path])
|
||||
|
||||
if config_path is not None:
|
||||
config = SEWDConfig.from_pretrained(config_path)
|
||||
else:
|
||||
config = convert_config(model[0])
|
||||
model = model[0].eval()
|
||||
|
||||
return_attention_mask = True if config.feat_extract_norm == "layer" else False
|
||||
feature_extractor = Wav2Vec2FeatureExtractor(
|
||||
feature_size=1,
|
||||
sampling_rate=16000,
|
||||
padding_value=0,
|
||||
do_normalize=True,
|
||||
return_attention_mask=return_attention_mask,
|
||||
)
|
||||
|
||||
if is_finetuned:
|
||||
if dict_path:
|
||||
target_dict = Dictionary.load(dict_path)
|
||||
|
||||
# important change bos & pad token id since CTC symbol is <pad> and
|
||||
# not <s> as in fairseq
|
||||
config.bos_token_id = target_dict.pad_index
|
||||
config.pad_token_id = target_dict.bos_index
|
||||
config.eos_token_id = target_dict.eos_index
|
||||
config.vocab_size = len(target_dict.symbols)
|
||||
vocab_path = os.path.join(pytorch_dump_folder_path, "vocab.json")
|
||||
if not os.path.isdir(pytorch_dump_folder_path):
|
||||
logger.error("--pytorch_dump_folder_path ({}) should be a directory".format(pytorch_dump_folder_path))
|
||||
return
|
||||
os.makedirs(pytorch_dump_folder_path, exist_ok=True)
|
||||
with open(vocab_path, "w", encoding="utf-8") as vocab_handle:
|
||||
json.dump(target_dict.indices, vocab_handle)
|
||||
tokenizer = Wav2Vec2CTCTokenizer(
|
||||
vocab_path,
|
||||
unk_token=target_dict.unk_word,
|
||||
pad_token=target_dict.pad_word,
|
||||
bos_token=target_dict.bos_word,
|
||||
eos_token=target_dict.eos_word,
|
||||
word_delimiter_token="|",
|
||||
do_lower_case=False,
|
||||
)
|
||||
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||
processor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
hf_model = SEWDForCTC(config)
|
||||
else:
|
||||
hf_model = SEWDModel(config)
|
||||
feature_extractor.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
recursively_load_weights(model, hf_model, is_finetuned)
|
||||
|
||||
hf_model.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||
parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint")
|
||||
parser.add_argument("--dict_path", default=None, type=str, help="Path to dict of fine-tuned model")
|
||||
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||
parser.add_argument(
|
||||
"--is_finetuned", action="store_true", help="Whether the model to convert is a fine-tuned model or not"
|
||||
)
|
||||
args = parser.parse_args()
|
||||
convert_sew_checkpoint(
|
||||
args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.dict_path, args.is_finetuned
|
||||
)
|
||||
1540
src/transformers/models/sew_d/modeling_sew_d.py
Normal file
1540
src/transformers/models/sew_d/modeling_sew_d.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -256,7 +256,7 @@ def _sample_negative_indices(
|
||||
class Wav2Vec2NoLayerNormConvLayer(nn.Module):
|
||||
def __init__(self, config, layer_id=0):
|
||||
super().__init__()
|
||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
||||
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||
self.out_conv_dim = config.conv_dim[layer_id]
|
||||
|
||||
self.conv = nn.Conv1d(
|
||||
@@ -277,7 +277,7 @@ class Wav2Vec2NoLayerNormConvLayer(nn.Module):
|
||||
class Wav2Vec2LayerNormConvLayer(nn.Module):
|
||||
def __init__(self, config, layer_id=0):
|
||||
super().__init__()
|
||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
||||
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||
self.out_conv_dim = config.conv_dim[layer_id]
|
||||
|
||||
self.conv = nn.Conv1d(
|
||||
@@ -304,7 +304,7 @@ class Wav2Vec2LayerNormConvLayer(nn.Module):
|
||||
class Wav2Vec2GroupNormConvLayer(nn.Module):
|
||||
def __init__(self, config, layer_id=0):
|
||||
super().__init__()
|
||||
self.in_conv_dim = config.conv_dim[layer_id] if layer_id > 0 else 1
|
||||
self.in_conv_dim = config.conv_dim[layer_id - 1] if layer_id > 0 else 1
|
||||
self.out_conv_dim = config.conv_dim[layer_id]
|
||||
|
||||
self.conv = nn.Conv1d(
|
||||
|
||||
@@ -3281,6 +3281,58 @@ def load_tf_weights_in_roformer(*args, **kwargs):
|
||||
requires_backends(load_tf_weights_in_roformer, ["torch"])
|
||||
|
||||
|
||||
SEW_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class SEWForCTC:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SEWModel:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, *args, **kwargs):
|
||||
requires_backends(cls, ["torch"])
|
||||
|
||||
|
||||
class SEWPreTrainedModel:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, *args, **kwargs):
|
||||
requires_backends(cls, ["torch"])
|
||||
|
||||
|
||||
SEW_D_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class SEWDForCTC:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class SEWDModel:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, *args, **kwargs):
|
||||
requires_backends(cls, ["torch"])
|
||||
|
||||
|
||||
class SEWDPreTrainedModel:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, *args, **kwargs):
|
||||
requires_backends(cls, ["torch"])
|
||||
|
||||
|
||||
class SpeechEncoderDecoderModel:
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
503
tests/test_modeling_sew.py
Normal file
503
tests/test_modeling_sew.py
Normal file
@@ -0,0 +1,503 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch Hubert model. """
|
||||
|
||||
|
||||
import math
|
||||
import unittest
|
||||
|
||||
import pytest
|
||||
|
||||
from tests.test_modeling_common import floats_tensor, ids_tensor, random_attention_mask
|
||||
from transformers import SEWConfig, is_torch_available
|
||||
from transformers.testing_utils import require_datasets, require_soundfile, require_torch, slow, tooslow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, _config_zero_init
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import SEWForCTC, SEWModel, Wav2Vec2FeatureExtractor, Wav2Vec2Processor
|
||||
from transformers.models.hubert.modeling_hubert import _compute_mask_indices
|
||||
|
||||
|
||||
class SEWModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=1024, # speech is longer
|
||||
is_training=False,
|
||||
hidden_size=32,
|
||||
feat_extract_norm="group",
|
||||
feat_extract_dropout=0.0,
|
||||
feat_extract_activation="gelu",
|
||||
conv_dim=(64, 32, 32),
|
||||
conv_stride=(5, 2, 1),
|
||||
conv_kernel=(10, 3, 1),
|
||||
conv_bias=False,
|
||||
num_conv_pos_embeddings=31,
|
||||
num_conv_pos_embedding_groups=2,
|
||||
squeeze_factor=2,
|
||||
num_hidden_layers=4,
|
||||
num_attention_heads=2,
|
||||
hidden_dropout=0.1,
|
||||
intermediate_size=20,
|
||||
layer_norm_eps=1e-5,
|
||||
hidden_act="gelu",
|
||||
initializer_range=0.02,
|
||||
vocab_size=32,
|
||||
do_stable_layer_norm=False,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.hidden_size = hidden_size
|
||||
self.feat_extract_norm = feat_extract_norm
|
||||
self.feat_extract_dropout = feat_extract_dropout
|
||||
self.feat_extract_activation = feat_extract_activation
|
||||
self.conv_dim = conv_dim
|
||||
self.conv_stride = conv_stride
|
||||
self.conv_kernel = conv_kernel
|
||||
self.conv_bias = conv_bias
|
||||
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||
self.squeeze_factor = squeeze_factor
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_dropout = hidden_dropout
|
||||
self.intermediate_size = intermediate_size
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.hidden_act = hidden_act
|
||||
self.initializer_range = initializer_range
|
||||
self.vocab_size = vocab_size
|
||||
self.do_stable_layer_norm = do_stable_layer_norm
|
||||
self.scope = scope
|
||||
|
||||
output_seq_length = self.seq_length
|
||||
for kernel, stride in zip(self.conv_kernel, self.conv_stride):
|
||||
output_seq_length = (output_seq_length - (kernel - 1)) / stride
|
||||
self.output_seq_length = int(math.ceil(output_seq_length))
|
||||
self.encoder_seq_length = self.output_seq_length // self.squeeze_factor
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_values = floats_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
attention_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_values, attention_mask
|
||||
|
||||
def get_config(self):
|
||||
return SEWConfig(
|
||||
hidden_size=self.hidden_size,
|
||||
feat_extract_norm=self.feat_extract_norm,
|
||||
feat_extract_dropout=self.feat_extract_dropout,
|
||||
feat_extract_activation=self.feat_extract_activation,
|
||||
conv_dim=self.conv_dim,
|
||||
conv_stride=self.conv_stride,
|
||||
conv_kernel=self.conv_kernel,
|
||||
conv_bias=self.conv_bias,
|
||||
num_conv_pos_embeddings=self.num_conv_pos_embeddings,
|
||||
num_conv_pos_embedding_groups=self.num_conv_pos_embedding_groups,
|
||||
squeeze_factor=self.squeeze_factor,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
hidden_dropout=self.hidden_dropout,
|
||||
intermediate_size=self.intermediate_size,
|
||||
layer_norm_eps=self.layer_norm_eps,
|
||||
hidden_act=self.hidden_act,
|
||||
initializer_range=self.initializer_range,
|
||||
vocab_size=self.vocab_size,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, input_values, attention_mask):
|
||||
model = SEWModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_values, attention_mask=attention_mask)
|
||||
self.parent.assertEqual(
|
||||
result.last_hidden_state.shape, (self.batch_size, self.output_seq_length, self.hidden_size)
|
||||
)
|
||||
|
||||
def create_and_check_batch_inference(self, config, input_values, *args):
|
||||
# test does not pass for models making use of `group_norm`
|
||||
# check: https://github.com/pytorch/fairseq/issues/3227
|
||||
model = SEWModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
input_values = input_values[:3]
|
||||
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.bool)
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
|
||||
# pad input
|
||||
for i in range(len(input_lengths)):
|
||||
input_values[i, input_lengths[i] :] = 0.0
|
||||
attention_mask[i, input_lengths[i] :] = 0.0
|
||||
|
||||
batch_outputs = model(input_values, attention_mask=attention_mask).last_hidden_state
|
||||
|
||||
for i in range(input_values.shape[0]):
|
||||
input_slice = input_values[i : i + 1, : input_lengths[i]]
|
||||
output = model(input_slice).last_hidden_state
|
||||
|
||||
batch_output = batch_outputs[i : i + 1, : output.shape[1]]
|
||||
self.parent.assertTrue(torch.allclose(output, batch_output, atol=1e-3))
|
||||
|
||||
def check_ctc_loss(self, config, input_values, *args):
|
||||
model = SEWForCTC(config=config)
|
||||
model.to(torch_device)
|
||||
|
||||
# make sure that dropout is disabled
|
||||
model.eval()
|
||||
|
||||
input_values = input_values[:3]
|
||||
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.long)
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||
labels = ids_tensor((input_values.shape[0], min(max_length_labels) - 1), model.config.vocab_size)
|
||||
|
||||
# pad input
|
||||
for i in range(len(input_lengths)):
|
||||
input_values[i, input_lengths[i] :] = 0.0
|
||||
attention_mask[i, input_lengths[i] :] = 0
|
||||
|
||||
model.config.ctc_loss_reduction = "sum"
|
||||
sum_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||
|
||||
model.config.ctc_loss_reduction = "mean"
|
||||
mean_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||
|
||||
self.parent.assertTrue(isinstance(sum_loss, float))
|
||||
self.parent.assertTrue(isinstance(mean_loss, float))
|
||||
|
||||
def check_ctc_training(self, config, input_values, *args):
|
||||
config.ctc_zero_infinity = True
|
||||
model = SEWForCTC(config=config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
|
||||
# freeze feature encoder
|
||||
model.freeze_feature_extractor()
|
||||
|
||||
input_values = input_values[:3]
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size)
|
||||
|
||||
# pad input
|
||||
for i in range(len(input_lengths)):
|
||||
input_values[i, input_lengths[i] :] = 0.0
|
||||
|
||||
if max_length_labels[i] < labels.shape[-1]:
|
||||
# it's important that we make sure that target lenghts are at least
|
||||
# one shorter than logit lenghts to prevent -inf
|
||||
labels[i, max_length_labels[i] - 1 :] = -100
|
||||
|
||||
loss = model(input_values, labels=labels).loss
|
||||
self.parent.assertFalse(torch.isinf(loss).item())
|
||||
|
||||
loss.backward()
|
||||
|
||||
def check_labels_out_of_vocab(self, config, input_values, *args):
|
||||
model = SEWForCTC(config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
|
||||
input_values = input_values[:3]
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size + 100)
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
model(input_values, labels=labels)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config, input_values, attention_mask = self.prepare_config_and_inputs()
|
||||
inputs_dict = {"input_values": input_values, "attention_mask": attention_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class SEWModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (SEWForCTC, SEWModel) if is_torch_available() else ()
|
||||
test_pruning = False
|
||||
test_headmasking = False
|
||||
test_torchscript = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = SEWModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=SEWConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_ctc_loss_inference(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.check_ctc_loss(*config_and_inputs)
|
||||
|
||||
def test_ctc_train(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.check_ctc_training(*config_and_inputs)
|
||||
|
||||
def test_labels_out_of_vocab(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.check_labels_out_of_vocab(*config_and_inputs)
|
||||
|
||||
# Hubert has no inputs_embeds
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
# `input_ids` is renamed to `input_values`
|
||||
def test_forward_signature(self):
|
||||
pass
|
||||
|
||||
# SEW cannot resize token embeddings
|
||||
# since it has no tokens embeddings
|
||||
def test_resize_tokens_embeddings(self):
|
||||
pass
|
||||
|
||||
# SEW has no inputs_embeds
|
||||
# and thus the `get_input_embeddings` fn
|
||||
# is not implemented
|
||||
def test_model_common_attributes(self):
|
||||
pass
|
||||
|
||||
def test_retain_grad_hidden_states_attentions(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.output_hidden_states = True
|
||||
config.output_attentions = True
|
||||
|
||||
# no need to test all models as different heads yield the same functionality
|
||||
model_class = self.all_model_classes[0]
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
|
||||
# set layer drop to 0
|
||||
model.config.layerdrop = 0.0
|
||||
|
||||
input_values = inputs_dict["input_values"]
|
||||
|
||||
input_lengths = torch.tensor(
|
||||
[input_values.shape[1] for _ in range(input_values.shape[0])], dtype=torch.long, device=torch_device
|
||||
)
|
||||
output_lengths = model._get_feat_extract_output_lengths(input_lengths)
|
||||
|
||||
labels = ids_tensor((input_values.shape[0], output_lengths[0] - 2), self.model_tester.vocab_size)
|
||||
inputs_dict["attention_mask"] = torch.ones_like(inputs_dict["attention_mask"])
|
||||
inputs_dict["labels"] = labels
|
||||
|
||||
outputs = model(**inputs_dict)
|
||||
|
||||
output = outputs[0]
|
||||
|
||||
# Encoder-/Decoder-only models
|
||||
hidden_states = outputs.hidden_states[0]
|
||||
attentions = outputs.attentions[0]
|
||||
|
||||
hidden_states.retain_grad()
|
||||
attentions.retain_grad()
|
||||
|
||||
output.flatten()[0].backward(retain_graph=True)
|
||||
|
||||
self.assertIsNotNone(hidden_states.grad)
|
||||
self.assertIsNotNone(attentions.grad)
|
||||
|
||||
def test_initialization(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
configs_no_init = _config_zero_init(config)
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
for name, param in model.named_parameters():
|
||||
uniform_init_parms = [
|
||||
"conv.weight",
|
||||
"masked_spec_embed",
|
||||
"quantizer.weight_proj.weight",
|
||||
]
|
||||
if param.requires_grad:
|
||||
if any([x in name for x in uniform_init_parms]):
|
||||
self.assertTrue(
|
||||
-1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
else:
|
||||
self.assertIn(
|
||||
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||
[0.0, 1.0],
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
|
||||
# overwrite from test_modeling_common
|
||||
def _mock_init_weights(self, module):
|
||||
if hasattr(module, "weight") and module.weight is not None:
|
||||
module.weight.data.fill_(3)
|
||||
if hasattr(module, "weight_g") and module.weight_g is not None:
|
||||
module.weight_g.data.fill_(3)
|
||||
if hasattr(module, "weight_v") and module.weight_v is not None:
|
||||
module.weight_v.data.fill_(3)
|
||||
if hasattr(module, "bias") and module.bias is not None:
|
||||
module.bias.data.fill_(3)
|
||||
if hasattr(module, "masked_spec_embed") and module.masked_spec_embed is not None:
|
||||
module.masked_spec_embed.data.fill_(3)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
model = SEWModel.from_pretrained("asapp/sew-tiny-100k")
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
@require_torch
|
||||
class SEWUtilsTest(unittest.TestCase):
|
||||
def test_compute_mask_indices(self):
|
||||
batch_size = 4
|
||||
sequence_length = 60
|
||||
mask_prob = 0.5
|
||||
mask_length = 1
|
||||
|
||||
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||
mask = torch.from_numpy(mask).to(torch_device)
|
||||
|
||||
self.assertListEqual(mask.sum(axis=-1).tolist(), [mask_prob * sequence_length for _ in range(batch_size)])
|
||||
|
||||
def test_compute_mask_indices_overlap(self):
|
||||
batch_size = 4
|
||||
sequence_length = 80
|
||||
mask_prob = 0.5
|
||||
mask_length = 4
|
||||
|
||||
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||
mask = torch.from_numpy(mask).to(torch_device)
|
||||
|
||||
# because of overlap mask don't have to add up exactly to `mask_prob * sequence_length`, but have to be smaller or equal
|
||||
for batch_sum in mask.sum(axis=-1):
|
||||
self.assertTrue(int(batch_sum) <= mask_prob * sequence_length)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_datasets
|
||||
@require_soundfile
|
||||
@slow
|
||||
class SEWModelIntegrationTest(unittest.TestCase):
|
||||
def _load_datasamples(self, num_samples):
|
||||
from datasets import load_dataset
|
||||
|
||||
import soundfile as sf
|
||||
|
||||
ids = [f"1272-141231-000{i}" for i in range(num_samples)]
|
||||
|
||||
# map files to raw
|
||||
def map_to_array(batch):
|
||||
speech, _ = sf.read(batch["file"])
|
||||
batch["speech"] = speech
|
||||
return batch
|
||||
|
||||
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
||||
|
||||
ds = ds.filter(lambda x: x["id"] in ids).sort("id").map(map_to_array)
|
||||
|
||||
return ds["speech"][:num_samples]
|
||||
|
||||
def test_inference_pretrained_batched(self):
|
||||
model = SEWModel.from_pretrained("asapp/sew-tiny-100k").to(torch_device)
|
||||
processor = Wav2Vec2FeatureExtractor.from_pretrained("asapp/sew-tiny-100k")
|
||||
|
||||
input_speech = self._load_datasamples(2)
|
||||
|
||||
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||
|
||||
input_values = inputs.input_values.to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(input_values).last_hidden_state
|
||||
|
||||
# expected outputs taken from the original SEW implementation
|
||||
expected_outputs_first = torch.tensor(
|
||||
[
|
||||
[
|
||||
[0.1509, 0.5372, 0.3061, -0.1694],
|
||||
[-0.1700, 0.5764, 0.2753, -0.1299],
|
||||
[0.1281, 0.7949, 0.2342, -0.1624],
|
||||
[-0.1627, 0.6710, 0.2215, -0.1317],
|
||||
],
|
||||
[
|
||||
[0.0408, 1.4355, 0.8605, -0.0968],
|
||||
[0.0393, 1.2368, 0.6826, 0.0364],
|
||||
[-0.1269, 1.9215, 1.1677, -0.1297],
|
||||
[-0.1654, 1.6524, 0.6877, -0.0196],
|
||||
],
|
||||
],
|
||||
device=torch_device,
|
||||
)
|
||||
expected_outputs_last = torch.tensor(
|
||||
[
|
||||
[
|
||||
[1.3379, -0.1450, -0.1500, -0.0515],
|
||||
[0.8364, -0.1680, -0.1248, -0.0689],
|
||||
[1.2791, -0.1507, -0.1523, -0.0564],
|
||||
[0.8208, -0.1690, -0.1199, -0.0751],
|
||||
],
|
||||
[
|
||||
[0.6959, -0.0861, -0.1235, -0.0861],
|
||||
[0.4700, -0.1686, -0.1141, -0.1199],
|
||||
[1.0776, -0.1137, -0.0124, -0.0472],
|
||||
[0.5774, -0.1675, -0.0376, -0.0823],
|
||||
],
|
||||
],
|
||||
device=torch_device,
|
||||
)
|
||||
expected_output_sum = 62146.7422
|
||||
|
||||
self.assertTrue(torch.allclose(outputs[:, :4, :4], expected_outputs_first, atol=5e-3))
|
||||
self.assertTrue(torch.allclose(outputs[:, -4:, -4:], expected_outputs_last, atol=5e-3))
|
||||
self.assertTrue(abs(outputs.sum() - expected_output_sum) < 2)
|
||||
|
||||
@tooslow
|
||||
def test_inference_ctc_batched(self):
|
||||
# TODO: enable this test once the finetuned models are available
|
||||
model = SEWForCTC.from_pretrained("asapp/sew-tiny-100k-ft-100h").to(torch_device)
|
||||
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-tiny-100k-ft-100h", do_lower_case=True)
|
||||
|
||||
input_speech = self._load_datasamples(2)
|
||||
|
||||
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||
|
||||
input_values = inputs.input_values.to(torch_device)
|
||||
attention_mask = inputs.attention_mask.to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
logits = model(input_values, attention_mask=attention_mask).logits
|
||||
|
||||
predicted_ids = torch.argmax(logits, dim=-1)
|
||||
predicted_trans = processor.batch_decode(predicted_ids)
|
||||
|
||||
EXPECTED_TRANSCRIPTIONS = [
|
||||
"a man said to the universe sir i exist",
|
||||
"sweat covered brion's body trickling into the tight loin cloth that was the only garment he wore",
|
||||
]
|
||||
self.assertListEqual(predicted_trans, EXPECTED_TRANSCRIPTIONS)
|
||||
524
tests/test_modeling_sew_d.py
Normal file
524
tests/test_modeling_sew_d.py
Normal file
@@ -0,0 +1,524 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch Hubert model. """
|
||||
|
||||
|
||||
import math
|
||||
import unittest
|
||||
|
||||
import pytest
|
||||
|
||||
from tests.test_modeling_common import floats_tensor, ids_tensor, random_attention_mask
|
||||
from transformers import SEWDConfig, is_torch_available
|
||||
from transformers.testing_utils import require_datasets, require_soundfile, require_torch, slow, tooslow, torch_device
|
||||
|
||||
from .test_configuration_common import ConfigTester
|
||||
from .test_modeling_common import ModelTesterMixin, _config_zero_init
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import SEWDForCTC, SEWDModel, Wav2Vec2FeatureExtractor, Wav2Vec2Processor
|
||||
from transformers.models.hubert.modeling_hubert import _compute_mask_indices
|
||||
|
||||
|
||||
class SEWDModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=13,
|
||||
seq_length=1024, # speech is longer
|
||||
is_training=False,
|
||||
hidden_size=32,
|
||||
feat_extract_norm="group",
|
||||
feat_extract_dropout=0.0,
|
||||
feat_extract_activation="gelu",
|
||||
conv_dim=(64, 32, 32),
|
||||
conv_stride=(5, 2, 1),
|
||||
conv_kernel=(10, 3, 1),
|
||||
conv_bias=False,
|
||||
num_conv_pos_embeddings=31,
|
||||
num_conv_pos_embedding_groups=2,
|
||||
squeeze_factor=2,
|
||||
max_position_embeddings=512,
|
||||
position_buckets=256,
|
||||
share_att_key=True,
|
||||
relative_attention=True,
|
||||
position_biased_input=False,
|
||||
pos_att_type=("p2c", "c2p"),
|
||||
norm_rel_ebd="layer_norm",
|
||||
num_hidden_layers=4,
|
||||
num_attention_heads=2,
|
||||
hidden_dropout=0.1,
|
||||
intermediate_size=20,
|
||||
layer_norm_eps=1e-5,
|
||||
hidden_act="gelu",
|
||||
initializer_range=0.02,
|
||||
vocab_size=32,
|
||||
do_stable_layer_norm=False,
|
||||
scope=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.hidden_size = hidden_size
|
||||
self.feat_extract_norm = feat_extract_norm
|
||||
self.feat_extract_dropout = feat_extract_dropout
|
||||
self.feat_extract_activation = feat_extract_activation
|
||||
self.conv_dim = conv_dim
|
||||
self.conv_stride = conv_stride
|
||||
self.conv_kernel = conv_kernel
|
||||
self.conv_bias = conv_bias
|
||||
self.num_conv_pos_embeddings = num_conv_pos_embeddings
|
||||
self.num_conv_pos_embedding_groups = num_conv_pos_embedding_groups
|
||||
self.squeeze_factor = squeeze_factor
|
||||
self.max_position_embeddings = max_position_embeddings
|
||||
self.position_buckets = position_buckets
|
||||
self.share_att_key = share_att_key
|
||||
self.relative_attention = relative_attention
|
||||
self.position_biased_input = position_biased_input
|
||||
self.pos_att_type = pos_att_type
|
||||
self.norm_rel_ebd = norm_rel_ebd
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.hidden_dropout = hidden_dropout
|
||||
self.intermediate_size = intermediate_size
|
||||
self.layer_norm_eps = layer_norm_eps
|
||||
self.hidden_act = hidden_act
|
||||
self.initializer_range = initializer_range
|
||||
self.vocab_size = vocab_size
|
||||
self.do_stable_layer_norm = do_stable_layer_norm
|
||||
self.scope = scope
|
||||
|
||||
output_seq_length = self.seq_length
|
||||
for kernel, stride in zip(self.conv_kernel, self.conv_stride):
|
||||
output_seq_length = (output_seq_length - (kernel - 1)) / stride
|
||||
self.output_seq_length = int(math.ceil(output_seq_length))
|
||||
self.encoder_seq_length = self.output_seq_length // self.squeeze_factor
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_values = floats_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
attention_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
return config, input_values, attention_mask
|
||||
|
||||
def get_config(self):
|
||||
return SEWDConfig(
|
||||
hidden_size=self.hidden_size,
|
||||
feat_extract_norm=self.feat_extract_norm,
|
||||
feat_extract_dropout=self.feat_extract_dropout,
|
||||
feat_extract_activation=self.feat_extract_activation,
|
||||
conv_dim=self.conv_dim,
|
||||
conv_stride=self.conv_stride,
|
||||
conv_kernel=self.conv_kernel,
|
||||
conv_bias=self.conv_bias,
|
||||
num_conv_pos_embeddings=self.num_conv_pos_embeddings,
|
||||
num_conv_pos_embedding_groups=self.num_conv_pos_embedding_groups,
|
||||
squeeze_factor=self.squeeze_factor,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
position_buckets=self.position_buckets,
|
||||
share_att_key=self.share_att_key,
|
||||
relative_attention=self.relative_attention,
|
||||
position_biased_input=self.position_biased_input,
|
||||
pos_att_type=self.pos_att_type,
|
||||
norm_rel_ebd=self.norm_rel_ebd,
|
||||
num_hidden_layers=self.num_hidden_layers,
|
||||
num_attention_heads=self.num_attention_heads,
|
||||
hidden_dropout=self.hidden_dropout,
|
||||
intermediate_size=self.intermediate_size,
|
||||
layer_norm_eps=self.layer_norm_eps,
|
||||
hidden_act=self.hidden_act,
|
||||
initializer_range=self.initializer_range,
|
||||
vocab_size=self.vocab_size,
|
||||
)
|
||||
|
||||
def create_and_check_model(self, config, input_values, attention_mask):
|
||||
model = SEWDModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(input_values, attention_mask=attention_mask)
|
||||
self.parent.assertEqual(
|
||||
result.last_hidden_state.shape, (self.batch_size, self.output_seq_length, self.hidden_size)
|
||||
)
|
||||
|
||||
def create_and_check_batch_inference(self, config, input_values, *args):
|
||||
# test does not pass for models making use of `group_norm`
|
||||
# check: https://github.com/pytorch/fairseq/issues/3227
|
||||
model = SEWDModel(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
input_values = input_values[:3]
|
||||
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.bool)
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
|
||||
# pad input
|
||||
for i in range(len(input_lengths)):
|
||||
input_values[i, input_lengths[i] :] = 0.0
|
||||
attention_mask[i, input_lengths[i] :] = 0.0
|
||||
|
||||
batch_outputs = model(input_values, attention_mask=attention_mask).last_hidden_state
|
||||
|
||||
for i in range(input_values.shape[0]):
|
||||
input_slice = input_values[i : i + 1, : input_lengths[i]]
|
||||
output = model(input_slice).last_hidden_state
|
||||
|
||||
batch_output = batch_outputs[i : i + 1, : output.shape[1]]
|
||||
self.parent.assertTrue(torch.allclose(output, batch_output, atol=1e-3))
|
||||
|
||||
def check_ctc_loss(self, config, input_values, *args):
|
||||
model = SEWDForCTC(config=config)
|
||||
model.to(torch_device)
|
||||
|
||||
# make sure that dropout is disabled
|
||||
model.eval()
|
||||
|
||||
input_values = input_values[:3]
|
||||
attention_mask = torch.ones(input_values.shape, device=torch_device, dtype=torch.long)
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||
labels = ids_tensor((input_values.shape[0], min(max_length_labels) - 1), model.config.vocab_size)
|
||||
|
||||
# pad input
|
||||
for i in range(len(input_lengths)):
|
||||
input_values[i, input_lengths[i] :] = 0.0
|
||||
attention_mask[i, input_lengths[i] :] = 0
|
||||
|
||||
model.config.ctc_loss_reduction = "sum"
|
||||
sum_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||
|
||||
model.config.ctc_loss_reduction = "mean"
|
||||
mean_loss = model(input_values, attention_mask=attention_mask, labels=labels).loss.item()
|
||||
|
||||
self.parent.assertTrue(isinstance(sum_loss, float))
|
||||
self.parent.assertTrue(isinstance(mean_loss, float))
|
||||
|
||||
def check_ctc_training(self, config, input_values, *args):
|
||||
config.ctc_zero_infinity = True
|
||||
model = SEWDForCTC(config=config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
|
||||
# freeze feature encoder
|
||||
model.freeze_feature_extractor()
|
||||
|
||||
input_values = input_values[:3]
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size)
|
||||
|
||||
# pad input
|
||||
for i in range(len(input_lengths)):
|
||||
input_values[i, input_lengths[i] :] = 0.0
|
||||
|
||||
if max_length_labels[i] < labels.shape[-1]:
|
||||
# it's important that we make sure that target lenghts are at least
|
||||
# one shorter than logit lenghts to prevent -inf
|
||||
labels[i, max_length_labels[i] - 1 :] = -100
|
||||
|
||||
loss = model(input_values, labels=labels).loss
|
||||
self.parent.assertFalse(torch.isinf(loss).item())
|
||||
|
||||
loss.backward()
|
||||
|
||||
def check_labels_out_of_vocab(self, config, input_values, *args):
|
||||
model = SEWDForCTC(config)
|
||||
model.to(torch_device)
|
||||
model.train()
|
||||
|
||||
input_values = input_values[:3]
|
||||
|
||||
input_lengths = [input_values.shape[-1] // i for i in [4, 2, 1]]
|
||||
max_length_labels = model._get_feat_extract_output_lengths(torch.tensor(input_lengths))
|
||||
labels = ids_tensor((input_values.shape[0], max(max_length_labels) - 2), model.config.vocab_size + 100)
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
model(input_values, labels=labels)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config, input_values, attention_mask = self.prepare_config_and_inputs()
|
||||
inputs_dict = {"input_values": input_values, "attention_mask": attention_mask}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class SEWDModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (SEWDForCTC, SEWDModel) if is_torch_available() else ()
|
||||
test_pruning = False
|
||||
test_headmasking = False
|
||||
test_torchscript = False
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = SEWDModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=SEWDConfig, hidden_size=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_ctc_loss_inference(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.check_ctc_loss(*config_and_inputs)
|
||||
|
||||
def test_ctc_train(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.check_ctc_training(*config_and_inputs)
|
||||
|
||||
def test_labels_out_of_vocab(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.check_labels_out_of_vocab(*config_and_inputs)
|
||||
|
||||
# Hubert has no inputs_embeds
|
||||
def test_inputs_embeds(self):
|
||||
pass
|
||||
|
||||
# `input_ids` is renamed to `input_values`
|
||||
def test_forward_signature(self):
|
||||
pass
|
||||
|
||||
# SEW cannot resize token embeddings
|
||||
# since it has no tokens embeddings
|
||||
def test_resize_tokens_embeddings(self):
|
||||
pass
|
||||
|
||||
# SEW has no inputs_embeds
|
||||
# and thus the `get_input_embeddings` fn
|
||||
# is not implemented
|
||||
def test_model_common_attributes(self):
|
||||
pass
|
||||
|
||||
def test_retain_grad_hidden_states_attentions(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
config.output_hidden_states = True
|
||||
config.output_attentions = True
|
||||
|
||||
# no need to test all models as different heads yield the same functionality
|
||||
model_class = self.all_model_classes[0]
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
|
||||
# set layer drop to 0
|
||||
model.config.layerdrop = 0.0
|
||||
|
||||
input_values = inputs_dict["input_values"]
|
||||
|
||||
input_lengths = torch.tensor(
|
||||
[input_values.shape[1] for _ in range(input_values.shape[0])], dtype=torch.long, device=torch_device
|
||||
)
|
||||
output_lengths = model._get_feat_extract_output_lengths(input_lengths)
|
||||
|
||||
labels = ids_tensor((input_values.shape[0], output_lengths[0] - 2), self.model_tester.vocab_size)
|
||||
inputs_dict["attention_mask"] = torch.ones_like(inputs_dict["attention_mask"])
|
||||
inputs_dict["labels"] = labels
|
||||
|
||||
outputs = model(**inputs_dict)
|
||||
|
||||
output = outputs[0]
|
||||
|
||||
# Encoder-/Decoder-only models
|
||||
hidden_states = outputs.hidden_states[0]
|
||||
attentions = outputs.attentions[0]
|
||||
|
||||
hidden_states.retain_grad()
|
||||
attentions.retain_grad()
|
||||
|
||||
output.flatten()[0].backward(retain_graph=True)
|
||||
|
||||
self.assertIsNotNone(hidden_states.grad)
|
||||
self.assertIsNotNone(attentions.grad)
|
||||
|
||||
def test_initialization(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
configs_no_init = _config_zero_init(config)
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config=configs_no_init)
|
||||
for name, param in model.named_parameters():
|
||||
uniform_init_parms = [
|
||||
"conv.weight",
|
||||
"masked_spec_embed",
|
||||
"quantizer.weight_proj.weight",
|
||||
]
|
||||
if param.requires_grad:
|
||||
if any([x in name for x in uniform_init_parms]):
|
||||
self.assertTrue(
|
||||
-1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
else:
|
||||
self.assertIn(
|
||||
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||
[0.0, 1.0],
|
||||
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||
)
|
||||
|
||||
# overwrite from test_modeling_common
|
||||
def _mock_init_weights(self, module):
|
||||
if hasattr(module, "weight") and module.weight is not None:
|
||||
module.weight.data.fill_(3)
|
||||
if hasattr(module, "weight_g") and module.weight_g is not None:
|
||||
module.weight_g.data.fill_(3)
|
||||
if hasattr(module, "weight_v") and module.weight_v is not None:
|
||||
module.weight_v.data.fill_(3)
|
||||
if hasattr(module, "bias") and module.bias is not None:
|
||||
module.bias.data.fill_(3)
|
||||
if hasattr(module, "masked_spec_embed") and module.masked_spec_embed is not None:
|
||||
module.masked_spec_embed.data.fill_(3)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
model = SEWDModel.from_pretrained("asapp/sew-d-tiny-100k")
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
|
||||
@require_torch
|
||||
class SEWDUtilsTest(unittest.TestCase):
|
||||
def test_compute_mask_indices(self):
|
||||
batch_size = 4
|
||||
sequence_length = 60
|
||||
mask_prob = 0.5
|
||||
mask_length = 1
|
||||
|
||||
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||
mask = torch.from_numpy(mask).to(torch_device)
|
||||
|
||||
self.assertListEqual(mask.sum(axis=-1).tolist(), [mask_prob * sequence_length for _ in range(batch_size)])
|
||||
|
||||
def test_compute_mask_indices_overlap(self):
|
||||
batch_size = 4
|
||||
sequence_length = 80
|
||||
mask_prob = 0.5
|
||||
mask_length = 4
|
||||
|
||||
mask = _compute_mask_indices((batch_size, sequence_length), mask_prob, mask_length)
|
||||
mask = torch.from_numpy(mask).to(torch_device)
|
||||
|
||||
# because of overlap mask don't have to add up exactly to `mask_prob * sequence_length`, but have to be smaller or equal
|
||||
for batch_sum in mask.sum(axis=-1):
|
||||
self.assertTrue(int(batch_sum) <= mask_prob * sequence_length)
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_datasets
|
||||
@require_soundfile
|
||||
@slow
|
||||
class SEWDModelIntegrationTest(unittest.TestCase):
|
||||
def _load_datasamples(self, num_samples):
|
||||
from datasets import load_dataset
|
||||
|
||||
import soundfile as sf
|
||||
|
||||
ids = [f"1272-141231-000{i}" for i in range(num_samples)]
|
||||
|
||||
# map files to raw
|
||||
def map_to_array(batch):
|
||||
speech, _ = sf.read(batch["file"])
|
||||
batch["speech"] = speech
|
||||
return batch
|
||||
|
||||
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
|
||||
|
||||
ds = ds.filter(lambda x: x["id"] in ids).sort("id").map(map_to_array)
|
||||
|
||||
return ds["speech"][:num_samples]
|
||||
|
||||
def test_inference_pretrained_batched(self):
|
||||
model = SEWDModel.from_pretrained("asapp/sew-d-tiny-100k").to(torch_device)
|
||||
processor = Wav2Vec2FeatureExtractor.from_pretrained("asapp/sew-d-tiny-100k")
|
||||
|
||||
input_speech = self._load_datasamples(2)
|
||||
|
||||
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||
|
||||
input_values = inputs.input_values.to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
outputs = model(input_values).last_hidden_state
|
||||
|
||||
# expected outputs taken from the original SEW-D implementation
|
||||
expected_outputs_first = torch.tensor(
|
||||
[
|
||||
[
|
||||
[-0.1619, 0.6995, 0.4062, -0.1014],
|
||||
[-0.1364, 0.5960, 0.0952, -0.0873],
|
||||
[-0.1572, 0.5718, 0.4228, -0.0864],
|
||||
[-0.1325, 0.6823, 0.1387, -0.0871],
|
||||
],
|
||||
[
|
||||
[-0.1296, 0.4008, 0.4952, -0.1450],
|
||||
[-0.1152, 0.3693, 0.3037, -0.1290],
|
||||
[-0.1194, 0.6074, 0.3531, -0.1466],
|
||||
[-0.1113, 0.3135, 0.2224, -0.1338],
|
||||
],
|
||||
],
|
||||
device=torch_device,
|
||||
)
|
||||
expected_outputs_last = torch.tensor(
|
||||
[
|
||||
[
|
||||
[-0.1577, 0.5108, 0.8553, 0.2550],
|
||||
[-0.1530, 0.3580, 0.6143, 0.2672],
|
||||
[-0.1535, 0.4954, 0.8503, 0.1387],
|
||||
[-0.1572, 0.3363, 0.6217, 0.1490],
|
||||
],
|
||||
[
|
||||
[-0.1338, 0.5459, 0.9607, -0.1133],
|
||||
[-0.1502, 0.3738, 0.7313, -0.0986],
|
||||
[-0.0953, 0.4708, 1.0821, -0.0944],
|
||||
[-0.1474, 0.3598, 0.7248, -0.0748],
|
||||
],
|
||||
],
|
||||
device=torch_device,
|
||||
)
|
||||
expected_output_sum = 54201.0469
|
||||
|
||||
self.assertTrue(torch.allclose(outputs[:, :4, :4], expected_outputs_first, atol=5e-3))
|
||||
self.assertTrue(torch.allclose(outputs[:, -4:, -4:], expected_outputs_last, atol=5e-3))
|
||||
self.assertTrue(abs(outputs.sum() - expected_output_sum) < 5)
|
||||
|
||||
@tooslow
|
||||
def test_inference_ctc_batched(self):
|
||||
# TODO: enable this test once the finetuned models are available
|
||||
model = SEWDForCTC.from_pretrained("asapp/sew-d-tiny-100k-ft-100h").to(torch_device)
|
||||
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-tiny-100k-ft-100h", do_lower_case=True)
|
||||
|
||||
input_speech = self._load_datasamples(2)
|
||||
|
||||
inputs = processor(input_speech, return_tensors="pt", padding=True)
|
||||
|
||||
input_values = inputs.input_values.to(torch_device)
|
||||
attention_mask = inputs.attention_mask.to(torch_device)
|
||||
|
||||
with torch.no_grad():
|
||||
logits = model(input_values, attention_mask=attention_mask).logits
|
||||
|
||||
predicted_ids = torch.argmax(logits, dim=-1)
|
||||
predicted_trans = processor.batch_decode(predicted_ids)
|
||||
|
||||
EXPECTED_TRANSCRIPTIONS = [
|
||||
"a man said to the universe sir i exist",
|
||||
"sweat covered brion's body trickling into the tight loin cloth that was the only garment he wore",
|
||||
]
|
||||
self.assertListEqual(predicted_trans, EXPECTED_TRANSCRIPTIONS)
|
||||
@@ -123,6 +123,8 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||
"TFRagTokenForGeneration",
|
||||
"Wav2Vec2ForCTC",
|
||||
"HubertForCTC",
|
||||
"SEWForCTC",
|
||||
"SEWDForCTC",
|
||||
"XLMForQuestionAnswering",
|
||||
"XLNetForQuestionAnswering",
|
||||
"SeparableConv1D",
|
||||
|
||||
Reference in New Issue
Block a user