diff --git a/README.md b/README.md index 543a13108b..5253b491ba 100644 --- a/README.md +++ b/README.md @@ -489,6 +489,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. diff --git a/README_es.md b/README_es.md index e0f07ca8c5..bcd84333ef 100644 --- a/README_es.md +++ b/README_es.md @@ -466,6 +466,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. diff --git a/README_hd.md b/README_hd.md index 1b02e6f8b1..d87ef37e8b 100644 --- a/README_hd.md +++ b/README_hd.md @@ -438,6 +438,7 @@ conda install -c huggingface transformers 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा। 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा। +1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise से) Jaehyeon Kim, Jungil Kong, Juhee Son. द्वाराअनुसंधान पत्र [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) के साथ जारी किया गया 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन] (https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा। 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग ](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया। diff --git a/README_ja.md b/README_ja.md index 0c9a264898..c6b9fb0d79 100644 --- a/README_ja.md +++ b/README_ja.md @@ -500,6 +500,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) +1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise から) Jaehyeon Kim, Jungil Kong, Juhee Son. から公開された研究論文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI から) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI から) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino から公開された研究論文: [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) diff --git a/README_ko.md b/README_ko.md index 3d50ec4ae0..5d2056e4f7 100644 --- a/README_ko.md +++ b/README_ko.md @@ -415,6 +415,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다. +1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise 에서 제공)은 Jaehyeon Kim, Jungil Kong, Juhee Son.의 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)논문과 함께 발표했습니다. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI 에서) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 의 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 논문과 함께 발표했습니다. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 의 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다. diff --git a/README_zh-hans.md b/README_zh-hans.md index 105dbae7de..23ecd11c23 100644 --- a/README_zh-hans.md +++ b/README_zh-hans.md @@ -439,6 +439,7 @@ conda install -c huggingface transformers 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布. +1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (来自 Kakao Enterprise) 伴随论文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) 由 Jaehyeon Kim, Jungil Kong, Juhee Son 发布。 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。 diff --git a/README_zh-hant.md b/README_zh-hant.md index e5ebcf10fd..3c05c1962f 100644 --- a/README_zh-hant.md +++ b/README_zh-hant.md @@ -451,6 +451,7 @@ conda install -c huggingface transformers 1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. 1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 72a8ee1dbf..fd55a47cd8 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -603,6 +603,8 @@ title: UniSpeech - local: model_doc/unispeech-sat title: UniSpeech-SAT + - local: model_doc/vits + title: VITS - local: model_doc/wav2vec2 title: Wav2Vec2 - local: model_doc/wav2vec2-conformer diff --git a/docs/source/en/index.md b/docs/source/en/index.md index 39376f369b..ee2d984c98 100644 --- a/docs/source/en/index.md +++ b/docs/source/en/index.md @@ -255,6 +255,7 @@ The documentation is organized into five sections: 1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. 1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. 1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. 1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. @@ -475,6 +476,7 @@ Flax), PyTorch, and/or TensorFlow. | VitDet | ✅ | ❌ | ❌ | | ViTMAE | ✅ | ✅ | ❌ | | ViTMSN | ✅ | ❌ | ❌ | +| VITS | ✅ | ❌ | ❌ | | ViViT | ✅ | ❌ | ❌ | | Wav2Vec2 | ✅ | ✅ | ✅ | | Wav2Vec2-Conformer | ✅ | ❌ | ❌ | diff --git a/docs/source/en/model_doc/vits.md b/docs/source/en/model_doc/vits.md new file mode 100644 index 0000000000..84450c4276 --- /dev/null +++ b/docs/source/en/model_doc/vits.md @@ -0,0 +1,114 @@ + + +# VITS + +## Overview + +The VITS model was proposed in [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. + + +VITS (**V**ariational **I**nference with adversarial learning for end-to-end **T**ext-to-**S**peech) is an end-to-end +speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational +autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. + +A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based +text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, +much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text +input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to +synthesise speech with different rhythms from the same input text. + +The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. +To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During +inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the +waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor, +the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform. + +The abstract from the paper is the following: + +*Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.* + +This model can also be used with TTS checkpoints from [Massively Multilingual Speech (MMS)](https://arxiv.org/abs/2305.13516) +as these checkpoints use the same architecture and a slightly modified tokenizer. + +This model was contributed by [Matthijs](https://huggingface.co/Matthijs) and [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original code can be found [here](https://github.com/jaywalnut310/vits). + +## Model Usage + +Both the VITS and MMS-TTS checkpoints can be used with the same API. Since the flow-based model is non-deterministic, it +is good practice to set a seed to ensure reproducibility of the outputs. For languages with a Roman alphabet, +such as English or French, the tokenizer can be used directly to pre-process the text inputs. The following code example +runs a forward pass using the MMS-TTS English checkpoint: + +```python +import torch +from transformers import VitsTokenizer, VitsModel, set_seed + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") +model = VitsModel.from_pretrained("facebook/mms-tts-eng") + +inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt") + +set_seed(555) # make deterministic + +with torch.no_grad(): + outputs = model(**inputs) + +waveform = outputs.waveform[0] +``` + +The resulting waveform can be saved as a `.wav` file: + +```python +import scipy + +scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=waveform) +``` + +Or displayed in a Jupyter Notebook / Google Colab: + +```python +from IPython.display import Audio + +Audio(waveform, rate=model.config.sampling_rate) +``` + +For certain languages with a non-Roman alphabet, such as Arabic, Mandarin or Hindi, the [`uroman`](https://github.com/isi-nlp/uroman) +perl package is required to pre-process the text inputs to the Roman alphabet. + +You can check whether you require the `uroman` package for your language by inspecting the `is_uroman` attribute of +the pre-trained `tokenizer`: + +```python +from transformers import VitsTokenizer + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") +print(tokenizer.is_uroman) +``` + +If required, you should apply the uroman package to your text inputs **prior** to passing them to the `VitsTokenizer`, +since currently the tokenizer does not support performing the pre-processing itself. + +## VitsConfig + +[[autodoc]] VitsConfig + +## VitsTokenizer + +[[autodoc]] VitsTokenizer + - __call__ + - save_vocabulary + +## VitsModel + +[[autodoc]] VitsModel + - forward diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index a6232787f4..b978757d1f 100644 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -587,6 +587,11 @@ _import_structure = { "models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"], "models.vit_msn": ["VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMSNConfig"], "models.vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"], + "models.vits": [ + "VITS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "VitsConfig", + "VitsTokenizer", + ], "models.vivit": [ "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VivitConfig", @@ -2935,6 +2940,13 @@ else: "VitDetPreTrainedModel", ] ) + _import_structure["models.vits"].extend( + [ + "VITS_PRETRAINED_MODEL_ARCHIVE_LIST", + "VitsModel", + "VitsPreTrainedModel", + ] + ) _import_structure["models.vivit"].extend( [ "VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -4643,6 +4655,11 @@ if TYPE_CHECKING: from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig from .models.vit_msn import VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMSNConfig from .models.vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig + from .models.vits import ( + VITS_PRETRAINED_CONFIG_ARCHIVE_MAP, + VitsConfig, + VitsTokenizer, + ) from .models.vivit import VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, VivitConfig from .models.wav2vec2 import ( WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, @@ -6595,6 +6612,11 @@ if TYPE_CHECKING: VitDetModel, VitDetPreTrainedModel, ) + from .models.vits import ( + VITS_PRETRAINED_MODEL_ARCHIVE_LIST, + VitsModel, + VitsPreTrainedModel, + ) from .models.vivit import ( VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST, VivitForVideoClassification, diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py index da35fbea60..3241a41257 100644 --- a/src/transformers/models/__init__.py +++ b/src/transformers/models/__init__.py @@ -211,6 +211,7 @@ from . import ( vit_mae, vit_msn, vitdet, + vits, vivit, wav2vec2, wav2vec2_conformer, diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index 76ec62eed7..0a3effd795 100755 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -219,6 +219,7 @@ CONFIG_MAPPING_NAMES = OrderedDict( ("vit_mae", "ViTMAEConfig"), ("vit_msn", "ViTMSNConfig"), ("vitdet", "VitDetConfig"), + ("vits", "VitsConfig"), ("vivit", "VivitConfig"), ("wav2vec2", "Wav2Vec2Config"), ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"), @@ -410,6 +411,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict( ("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vitdet", "VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("vits", "VITS_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vivit", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -643,6 +645,7 @@ MODEL_NAMES_MAPPING = OrderedDict( ("vit_mae", "ViTMAE"), ("vit_msn", "ViTMSN"), ("vitdet", "VitDet"), + ("vits", "VITS"), ("vivit", "ViViT"), ("wav2vec2", "Wav2Vec2"), ("wav2vec2-conformer", "Wav2Vec2-Conformer"), diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index 229251d4ef..823f564894 100755 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -205,6 +205,7 @@ MODEL_MAPPING_NAMES = OrderedDict( ("vit_mae", "ViTMAEModel"), ("vit_msn", "ViTMSNModel"), ("vitdet", "VitDetModel"), + ("vits", "VitsModel"), ("vivit", "VivitModel"), ("wav2vec2", "Wav2Vec2Model"), ("wav2vec2-conformer", "Wav2Vec2ConformerModel"), diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py index 70102940f0..2d50cafc80 100644 --- a/src/transformers/models/auto/tokenization_auto.py +++ b/src/transformers/models/auto/tokenization_auto.py @@ -347,6 +347,7 @@ else: ), ("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), + ("vits", ("VitsTokenizer", None)), ("wav2vec2", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)), diff --git a/src/transformers/models/vits/__init__.py b/src/transformers/models/vits/__init__.py new file mode 100644 index 0000000000..79c18048e7 --- /dev/null +++ b/src/transformers/models/vits/__init__.py @@ -0,0 +1,67 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_sentencepiece_available, + is_speech_available, + is_torch_available, +) + + +_import_structure = { + "configuration_vits": [ + "VITS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "VitsConfig", + ], + "tokenization_vits": ["VitsTokenizer"], +} + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_vits"] = [ + "VITS_PRETRAINED_MODEL_ARCHIVE_LIST", + "VitsModel", + "VitsPreTrainedModel", + ] + +if TYPE_CHECKING: + from .configuration_vits import ( + VITS_PRETRAINED_CONFIG_ARCHIVE_MAP, + VitsConfig, + ) + from .tokenization_vits import VitsTokenizer + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_vits import ( + VITS_PRETRAINED_MODEL_ARCHIVE_LIST, + VitsModel, + VitsPreTrainedModel, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/vits/configuration_vits.py b/src/transformers/models/vits/configuration_vits.py new file mode 100644 index 0000000000..689fe8a77c --- /dev/null +++ b/src/transformers/models/vits/configuration_vits.py @@ -0,0 +1,254 @@ +# coding=utf-8 +# Copyright 2023 The Kakao Enterprise Authors and the HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" VITS model configuration""" + + +from ...configuration_utils import PretrainedConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + +VITS_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "facebook/mms-tts-eng": "https://huggingface.co/facebook/mms-tts-eng/resolve/main/config.json", +} + + +class VitsConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`VitsModel`]. It is used to instantiate a VITS + model according to the specified arguments, defining the model architecture. Instantiating a configuration with the + defaults will yield a similar configuration to that of the VITS + [facebook/mms-tts-eng](https://huggingface.co/facebook/mms-tts-eng) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 38): + Vocabulary size of the VITS model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed to the forward method of [`VitsModel`]. + hidden_size (`int`, *optional*, defaults to 192): + Dimensionality of the text encoder layers. + num_hidden_layers (`int`, *optional*, defaults to 6): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 2): + Number of attention heads for each attention layer in the Transformer encoder. + window_size (`int`, *optional*, defaults to 4): + Window size for the relative positional embeddings in the attention layers of the Transformer encoder. + use_bias (`bool`, *optional*, defaults to `True`) + Whether to use bias in the key, query, value projection layers in the Transformer encoder. + ffn_dim (`int`, *optional*, defaults to 768): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + layerdrop (`float`, *optional*, defaults to 0.1): + The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) + for more details. + ffn_kernel_size (`int`, *optional*, defaults to 3): + Kernel size of the 1D convolution layers used by the feed-forward network in the Transformer encoder. + flow_size (`int`, *optional*, defaults to 192): + Dimensionality of the flow layers. + spectrogram_bins (`int`, *optional*, defaults to 513): + Number of frequency bins in the target spectrogram. + hidden_act (`str` or `function`, *optional*, defaults to `"relu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` are supported. + hidden_dropout (`float`, *optional*, defaults to 0.1): + The dropout probability for all fully connected layers in the embeddings and encoder. + attention_dropout (`float`, *optional*, defaults to 0.1): + The dropout ratio for the attention probabilities. + activation_dropout (`float`, *optional*, defaults to 0.1): + The dropout ratio for activations inside the fully connected layer. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + layer_norm_eps (`float`, *optional*, defaults to 1e-5): + The epsilon used by the layer normalization layers. + use_stochastic_duration_prediction (`bool`, *optional*, defaults to `True`): + Whether to use the stochastic duration prediction module or the regular duration predictor. + num_speakers (`int`, *optional*, defaults to 1): + Number of speakers if this is a multi-speaker model. + speaker_embedding_size (`int`, *optional*, defaults to 0): + Number of channels used by the speaker embeddings. Is zero for single-speaker models. + upsample_initial_channel (`int`, *optional*, defaults to 512): + The number of input channels into the HiFi-GAN upsampling network. + upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[8, 8, 2, 2]`): + A tuple of integers defining the stride of each 1D convolutional layer in the HiFi-GAN upsampling network. + The length of `upsample_rates` defines the number of convolutional layers and has to match the length of + `upsample_kernel_sizes`. + upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[16, 16, 4, 4]`): + A tuple of integers defining the kernel size of each 1D convolutional layer in the HiFi-GAN upsampling + network. The length of `upsample_kernel_sizes` defines the number of convolutional layers and has to match + the length of `upsample_rates`. + resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`): + A tuple of integers defining the kernel sizes of the 1D convolutional layers in the HiFi-GAN + multi-receptive field fusion (MRF) module. + resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`): + A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the + HiFi-GAN multi-receptive field fusion (MRF) module. + leaky_relu_slope (`float`, *optional*, defaults to 0.1): + The angle of the negative slope used by the leaky ReLU activation. + depth_separable_channels (`int`, *optional*, defaults to 2): + Number of channels to use in each depth-separable block. + depth_separable_num_layers (`int`, *optional*, defaults to 3): + Number of convolutional layers to use in each depth-separable block. + duration_predictor_flow_bins (`int`, *optional*, defaults to 10): + Number of channels to map using the unonstrained rational spline in the duration predictor model. + duration_predictor_tail_bound (`float`, *optional*, defaults to 5.0): + Value of the tail bin boundary when computing the unconstrained rational spline in the duration predictor + model. + duration_predictor_kernel_size (`int`, *optional*, defaults to 3): + Kernel size of the 1D convolution layers used in the duration predictor model. + duration_predictor_dropout (`float`, *optional*, defaults to 0.5): + The dropout ratio for the duration predictor model. + duration_predictor_num_flows (`int`, *optional*, defaults to 4): + Number of flow stages used by the duration predictor model. + duration_predictor_filter_channels (`int`, *optional*, defaults to 256): + Number of channels for the convolution layers used in the duration predictor model. + prior_encoder_num_flows (`int`, *optional*, defaults to 4): + Number of flow stages used by the prior encoder flow model. + prior_encoder_num_wavenet_layers (`int`, *optional*, defaults to 4): + Number of WaveNet layers used by the prior encoder flow model. + posterior_encoder_num_wavenet_layers (`int`, *optional*, defaults to 16): + Number of WaveNet layers used by the posterior encoder model. + wavenet_kernel_size (`int`, *optional*, defaults to 5): + Kernel size of the 1D convolution layers used in the WaveNet model. + wavenet_dilation_rate (`int`, *optional*, defaults to 1): + Dilation rates of the dilated 1D convolutional layers used in the WaveNet model. + wavenet_dropout (`float`, *optional*, defaults to 0.0): + The dropout ratio for the WaveNet layers. + speaking_rate (`float`, *optional*, defaults to 1.0): + Speaking rate. Larger values give faster synthesised speech. + noise_scale (`float`, *optional*, defaults to 0.667): + How random the speech prediction is. Larger values create more variation in the predicted speech. + noise_scale_duration (`float`, *optional*, defaults to 0.8): + How random the duration prediction is. Larger values create more variation in the predicted durations. + sampling_rate (`int`, *optional*, defaults to 16000): + The sampling rate at which the output audio waveform is digitalized expressed in hertz (Hz). + + Example: + + ```python + >>> from transformers import VitsModel, VitsConfig + + >>> # Initializing a "facebook/mms-tts-eng" style configuration + >>> configuration = VitsConfig() + + >>> # Initializing a model (with random weights) from the "facebook/mms-tts-eng" style configuration + >>> model = VitsModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + model_type = "vits" + + def __init__( + self, + vocab_size=38, + hidden_size=192, + num_hidden_layers=6, + num_attention_heads=2, + window_size=4, + use_bias=True, + ffn_dim=768, + layerdrop=0.1, + ffn_kernel_size=3, + flow_size=192, + spectrogram_bins=513, + hidden_act="relu", + hidden_dropout=0.1, + attention_dropout=0.1, + activation_dropout=0.1, + initializer_range=0.02, + layer_norm_eps=1e-5, + use_stochastic_duration_prediction=True, + num_speakers=1, + speaker_embedding_size=0, + upsample_initial_channel=512, + upsample_rates=[8, 8, 2, 2], + upsample_kernel_sizes=[16, 16, 4, 4], + resblock_kernel_sizes=[3, 7, 11], + resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]], + leaky_relu_slope=0.1, + depth_separable_channels=2, + depth_separable_num_layers=3, + duration_predictor_flow_bins=10, + duration_predictor_tail_bound=5.0, + duration_predictor_kernel_size=3, + duration_predictor_dropout=0.5, + duration_predictor_num_flows=4, + duration_predictor_filter_channels=256, + prior_encoder_num_flows=4, + prior_encoder_num_wavenet_layers=4, + posterior_encoder_num_wavenet_layers=16, + wavenet_kernel_size=5, + wavenet_dilation_rate=1, + wavenet_dropout=0.0, + speaking_rate=1.0, + noise_scale=0.667, + noise_scale_duration=0.8, + sampling_rate=16_000, + **kwargs, + ): + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.window_size = window_size + self.use_bias = use_bias + self.ffn_dim = ffn_dim + self.layerdrop = layerdrop + self.ffn_kernel_size = ffn_kernel_size + self.flow_size = flow_size + self.spectrogram_bins = spectrogram_bins + self.hidden_act = hidden_act + self.hidden_dropout = hidden_dropout + self.attention_dropout = attention_dropout + self.activation_dropout = activation_dropout + self.initializer_range = initializer_range + self.layer_norm_eps = layer_norm_eps + self.use_stochastic_duration_prediction = use_stochastic_duration_prediction + self.num_speakers = num_speakers + self.speaker_embedding_size = speaker_embedding_size + self.upsample_initial_channel = upsample_initial_channel + self.upsample_rates = upsample_rates + self.upsample_kernel_sizes = upsample_kernel_sizes + self.resblock_kernel_sizes = resblock_kernel_sizes + self.resblock_dilation_sizes = resblock_dilation_sizes + self.leaky_relu_slope = leaky_relu_slope + self.depth_separable_channels = depth_separable_channels + self.depth_separable_num_layers = depth_separable_num_layers + self.duration_predictor_flow_bins = duration_predictor_flow_bins + self.duration_predictor_tail_bound = duration_predictor_tail_bound + self.duration_predictor_kernel_size = duration_predictor_kernel_size + self.duration_predictor_dropout = duration_predictor_dropout + self.duration_predictor_num_flows = duration_predictor_num_flows + self.duration_predictor_filter_channels = duration_predictor_filter_channels + self.prior_encoder_num_flows = prior_encoder_num_flows + self.prior_encoder_num_wavenet_layers = prior_encoder_num_wavenet_layers + self.posterior_encoder_num_wavenet_layers = posterior_encoder_num_wavenet_layers + self.wavenet_kernel_size = wavenet_kernel_size + self.wavenet_dilation_rate = wavenet_dilation_rate + self.wavenet_dropout = wavenet_dropout + self.speaking_rate = speaking_rate + self.noise_scale = noise_scale + self.noise_scale_duration = noise_scale_duration + self.sampling_rate = sampling_rate + + if len(upsample_kernel_sizes) != len(upsample_rates): + raise ValueError( + f"The length of `upsample_kernel_sizes` ({len(upsample_kernel_sizes)}) must match the length of " + f"`upsample_rates` ({len(upsample_rates)})" + ) + + super().__init__(**kwargs) diff --git a/src/transformers/models/vits/convert_original_checkpoint.py b/src/transformers/models/vits/convert_original_checkpoint.py new file mode 100644 index 0000000000..267f72ccd0 --- /dev/null +++ b/src/transformers/models/vits/convert_original_checkpoint.py @@ -0,0 +1,390 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Convert VITS checkpoint.""" + +import argparse +import json +import tempfile + +import torch +from huggingface_hub import hf_hub_download + +from transformers import VitsConfig, VitsModel, VitsTokenizer, logging + + +logging.set_verbosity_info() +logger = logging.get_logger("transformers.models.vits") + +MAPPING_TEXT_ENCODER = { + "enc_p.emb": "text_encoder.embed_tokens", + "enc_p.encoder.attn_layers.*.conv_k": "text_encoder.encoder.layers.*.attention.k_proj", + "enc_p.encoder.attn_layers.*.conv_v": "text_encoder.encoder.layers.*.attention.v_proj", + "enc_p.encoder.attn_layers.*.conv_q": "text_encoder.encoder.layers.*.attention.q_proj", + "enc_p.encoder.attn_layers.*.conv_o": "text_encoder.encoder.layers.*.attention.out_proj", + "enc_p.encoder.attn_layers.*.emb_rel_k": "text_encoder.encoder.layers.*.attention.emb_rel_k", + "enc_p.encoder.attn_layers.*.emb_rel_v": "text_encoder.encoder.layers.*.attention.emb_rel_v", + "enc_p.encoder.norm_layers_1.*.gamma": "text_encoder.encoder.layers.*.layer_norm.weight", + "enc_p.encoder.norm_layers_1.*.beta": "text_encoder.encoder.layers.*.layer_norm.bias", + "enc_p.encoder.ffn_layers.*.conv_1": "text_encoder.encoder.layers.*.feed_forward.conv_1", + "enc_p.encoder.ffn_layers.*.conv_2": "text_encoder.encoder.layers.*.feed_forward.conv_2", + "enc_p.encoder.norm_layers_2.*.gamma": "text_encoder.encoder.layers.*.final_layer_norm.weight", + "enc_p.encoder.norm_layers_2.*.beta": "text_encoder.encoder.layers.*.final_layer_norm.bias", + "enc_p.proj": "text_encoder.project", +} +MAPPING_STOCHASTIC_DURATION_PREDICTOR = { + "dp.pre": "duration_predictor.conv_pre", + "dp.proj": "duration_predictor.conv_proj", + "dp.convs.convs_sep.*": "duration_predictor.conv_dds.convs_dilated.*", + "dp.convs.convs_1x1.*": "duration_predictor.conv_dds.convs_pointwise.*", + "dp.convs.norms_1.*.gamma": "duration_predictor.conv_dds.norms_1.*.weight", + "dp.convs.norms_1.*.beta": "duration_predictor.conv_dds.norms_1.*.bias", + "dp.convs.norms_2.*.gamma": "duration_predictor.conv_dds.norms_2.*.weight", + "dp.convs.norms_2.*.beta": "duration_predictor.conv_dds.norms_2.*.bias", + "dp.flows.0.logs": "duration_predictor.flows.0.log_scale", + "dp.flows.0.m": "duration_predictor.flows.0.translate", + "dp.flows.*.pre": "duration_predictor.flows.*.conv_pre", + "dp.flows.*.proj": "duration_predictor.flows.*.conv_proj", + "dp.flows.*.convs.convs_1x1.0": "duration_predictor.flows.*.conv_dds.convs_pointwise.0", + "dp.flows.*.convs.convs_1x1.1": "duration_predictor.flows.*.conv_dds.convs_pointwise.1", + "dp.flows.*.convs.convs_1x1.2": "duration_predictor.flows.*.conv_dds.convs_pointwise.2", + "dp.flows.*.convs.convs_sep.0": "duration_predictor.flows.*.conv_dds.convs_dilated.0", + "dp.flows.*.convs.convs_sep.1": "duration_predictor.flows.*.conv_dds.convs_dilated.1", + "dp.flows.*.convs.convs_sep.2": "duration_predictor.flows.*.conv_dds.convs_dilated.2", + "dp.flows.*.convs.norms_1.0.gamma": "duration_predictor.flows.*.conv_dds.norms_1.0.weight", + "dp.flows.*.convs.norms_1.0.beta": "duration_predictor.flows.*.conv_dds.norms_1.0.bias", + "dp.flows.*.convs.norms_1.1.gamma": "duration_predictor.flows.*.conv_dds.norms_1.1.weight", + "dp.flows.*.convs.norms_1.1.beta": "duration_predictor.flows.*.conv_dds.norms_1.1.bias", + "dp.flows.*.convs.norms_1.2.gamma": "duration_predictor.flows.*.conv_dds.norms_1.2.weight", + "dp.flows.*.convs.norms_1.2.beta": "duration_predictor.flows.*.conv_dds.norms_1.2.bias", + "dp.flows.*.convs.norms_2.0.gamma": "duration_predictor.flows.*.conv_dds.norms_2.0.weight", + "dp.flows.*.convs.norms_2.0.beta": "duration_predictor.flows.*.conv_dds.norms_2.0.bias", + "dp.flows.*.convs.norms_2.1.gamma": "duration_predictor.flows.*.conv_dds.norms_2.1.weight", + "dp.flows.*.convs.norms_2.1.beta": "duration_predictor.flows.*.conv_dds.norms_2.1.bias", + "dp.flows.*.convs.norms_2.2.gamma": "duration_predictor.flows.*.conv_dds.norms_2.2.weight", + "dp.flows.*.convs.norms_2.2.beta": "duration_predictor.flows.*.conv_dds.norms_2.2.bias", + "dp.post_pre": "duration_predictor.post_conv_pre", + "dp.post_proj": "duration_predictor.post_conv_proj", + "dp.post_convs.convs_sep.*": "duration_predictor.post_conv_dds.convs_dilated.*", + "dp.post_convs.convs_1x1.*": "duration_predictor.post_conv_dds.convs_pointwise.*", + "dp.post_convs.norms_1.*.gamma": "duration_predictor.post_conv_dds.norms_1.*.weight", + "dp.post_convs.norms_1.*.beta": "duration_predictor.post_conv_dds.norms_1.*.bias", + "dp.post_convs.norms_2.*.gamma": "duration_predictor.post_conv_dds.norms_2.*.weight", + "dp.post_convs.norms_2.*.beta": "duration_predictor.post_conv_dds.norms_2.*.bias", + "dp.post_flows.0.logs": "duration_predictor.post_flows.0.log_scale", + "dp.post_flows.0.m": "duration_predictor.post_flows.0.translate", + "dp.post_flows.*.pre": "duration_predictor.post_flows.*.conv_pre", + "dp.post_flows.*.proj": "duration_predictor.post_flows.*.conv_proj", + "dp.post_flows.*.convs.convs_1x1.0": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.0", + "dp.post_flows.*.convs.convs_1x1.1": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.1", + "dp.post_flows.*.convs.convs_1x1.2": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.2", + "dp.post_flows.*.convs.convs_sep.0": "duration_predictor.post_flows.*.conv_dds.convs_dilated.0", + "dp.post_flows.*.convs.convs_sep.1": "duration_predictor.post_flows.*.conv_dds.convs_dilated.1", + "dp.post_flows.*.convs.convs_sep.2": "duration_predictor.post_flows.*.conv_dds.convs_dilated.2", + "dp.post_flows.*.convs.norms_1.0.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.0.weight", + "dp.post_flows.*.convs.norms_1.0.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.0.bias", + "dp.post_flows.*.convs.norms_1.1.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.1.weight", + "dp.post_flows.*.convs.norms_1.1.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.1.bias", + "dp.post_flows.*.convs.norms_1.2.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.2.weight", + "dp.post_flows.*.convs.norms_1.2.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.2.bias", + "dp.post_flows.*.convs.norms_2.0.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.0.weight", + "dp.post_flows.*.convs.norms_2.0.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.0.bias", + "dp.post_flows.*.convs.norms_2.1.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.1.weight", + "dp.post_flows.*.convs.norms_2.1.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.1.bias", + "dp.post_flows.*.convs.norms_2.2.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.2.weight", + "dp.post_flows.*.convs.norms_2.2.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.2.bias", + "dp.cond": "duration_predictor.cond", # num_speakers > 1 +} +MAPPING_FLOW = { + "flow.flows.*.pre": "flow.flows.*.conv_pre", + "flow.flows.*.enc.in_layers.0": "flow.flows.*.wavenet.in_layers.0", + "flow.flows.*.enc.in_layers.1": "flow.flows.*.wavenet.in_layers.1", + "flow.flows.*.enc.in_layers.2": "flow.flows.*.wavenet.in_layers.2", + "flow.flows.*.enc.in_layers.3": "flow.flows.*.wavenet.in_layers.3", + "flow.flows.*.enc.res_skip_layers.0": "flow.flows.*.wavenet.res_skip_layers.0", + "flow.flows.*.enc.res_skip_layers.1": "flow.flows.*.wavenet.res_skip_layers.1", + "flow.flows.*.enc.res_skip_layers.2": "flow.flows.*.wavenet.res_skip_layers.2", + "flow.flows.*.enc.res_skip_layers.3": "flow.flows.*.wavenet.res_skip_layers.3", + "flow.flows.*.enc.cond_layer": "flow.flows.*.wavenet.cond_layer", # num_speakers > 1 + "flow.flows.*.post": "flow.flows.*.conv_post", +} +MAPPING_GENERATOR = { + "dec.conv_pre": "decoder.conv_pre", + "dec.ups.0": "decoder.upsampler.0", + "dec.ups.1": "decoder.upsampler.1", + "dec.ups.2": "decoder.upsampler.2", + "dec.ups.3": "decoder.upsampler.3", + "dec.resblocks.*.convs1.0": "decoder.resblocks.*.convs1.0", + "dec.resblocks.*.convs1.1": "decoder.resblocks.*.convs1.1", + "dec.resblocks.*.convs1.2": "decoder.resblocks.*.convs1.2", + "dec.resblocks.*.convs2.0": "decoder.resblocks.*.convs2.0", + "dec.resblocks.*.convs2.1": "decoder.resblocks.*.convs2.1", + "dec.resblocks.*.convs2.2": "decoder.resblocks.*.convs2.2", + "dec.conv_post": "decoder.conv_post", + "dec.cond": "decoder.cond", # num_speakers > 1 +} +MAPPING_POSTERIOR_ENCODER = { + "enc_q.pre": "posterior_encoder.conv_pre", + "enc_q.enc.in_layers.*": "posterior_encoder.wavenet.in_layers.*", + "enc_q.enc.res_skip_layers.*": "posterior_encoder.wavenet.res_skip_layers.*", + "enc_q.enc.cond_layer": "posterior_encoder.wavenet.cond_layer", # num_speakers > 1 + "enc_q.proj": "posterior_encoder.conv_proj", +} +MAPPING = { + **MAPPING_TEXT_ENCODER, + **MAPPING_STOCHASTIC_DURATION_PREDICTOR, + **MAPPING_FLOW, + **MAPPING_GENERATOR, + **MAPPING_POSTERIOR_ENCODER, + "emb_g": "embed_speaker", # num_speakers > 1 +} +TOP_LEVEL_KEYS = [] +IGNORE_KEYS = [] + + +def set_recursively(hf_pointer, key, value, full_name, weight_type): + for attribute in key.split("."): + hf_pointer = getattr(hf_pointer, attribute) + + if weight_type is not None: + hf_shape = getattr(hf_pointer, weight_type).shape + else: + hf_shape = hf_pointer.shape + + # strip off the kernel dimension at the end (original weights are Conv1d) + if key.endswith(".k_proj") or key.endswith(".v_proj") or key.endswith(".q_proj") or key.endswith(".out_proj"): + value = value.squeeze(-1) + + if hf_shape != value.shape: + raise ValueError( + f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be" + f" {value.shape} for {full_name}" + ) + + if weight_type == "weight": + hf_pointer.weight.data = value + elif weight_type == "weight_g": + hf_pointer.weight_g.data = value + elif weight_type == "weight_v": + hf_pointer.weight_v.data = value + elif weight_type == "bias": + hf_pointer.bias.data = value + elif weight_type == "running_mean": + hf_pointer.running_mean.data = value + elif weight_type == "running_var": + hf_pointer.running_var.data = value + elif weight_type == "num_batches_tracked": + hf_pointer.num_batches_tracked.data = value + else: + hf_pointer.data = value + + logger.info(f"{key + ('.' + weight_type if weight_type is not None else '')} was initialized from {full_name}.") + + +def should_ignore(name, ignore_keys): + for key in ignore_keys: + if key.endswith(".*"): + if name.startswith(key[:-1]): + return True + elif ".*." in key: + prefix, suffix = key.split(".*.") + if prefix in name and suffix in name: + return True + elif key in name: + return True + return False + + +def recursively_load_weights(fairseq_dict, hf_model): + unused_weights = [] + + for name, value in fairseq_dict.items(): + if should_ignore(name, IGNORE_KEYS): + logger.info(f"{name} was ignored") + continue + + is_used = False + for key, mapped_key in MAPPING.items(): + if key.endswith(".*"): + key = key[:-1] + elif "*" in key: + prefix, suffix = key.split(".*.") + if prefix in name and suffix in name: + key = suffix + + if key in name: + is_used = True + if mapped_key.endswith(".*"): + layer_index = name.split(key)[-1].split(".")[0] + mapped_key = mapped_key.replace("*", layer_index) + elif "*" in mapped_key: + layer_index = name.split(key)[0].split(".")[-2] + + # remap the layer index since we removed the Flip layers + if "flow.flows" in mapped_key: + layer_index = str(int(layer_index) // 2) + if "duration_predictor.flows" in mapped_key or "duration_predictor.post_flows" in mapped_key: + layer_index = str(int(layer_index) // 2 + 1) + + mapped_key = mapped_key.replace("*", layer_index) + if "weight_g" in name: + weight_type = "weight_g" + elif "weight_v" in name: + weight_type = "weight_v" + elif "bias" in name: + weight_type = "bias" + elif "weight" in name: + weight_type = "weight" + elif "running_mean" in name: + weight_type = "running_mean" + elif "running_var" in name: + weight_type = "running_var" + elif "num_batches_tracked" in name: + weight_type = "num_batches_tracked" + else: + weight_type = None + set_recursively(hf_model, mapped_key, value, name, weight_type) + continue + if not is_used: + unused_weights.append(name) + + logger.warning(f"Unused weights: {unused_weights}") + + +@torch.no_grad() +def convert_checkpoint( + pytorch_dump_folder_path, + checkpoint_path=None, + config_path=None, + vocab_path=None, + language=None, + num_speakers=None, + sampling_rate=None, + repo_id=None, +): + """ + Copy/paste/tweak model's weights to transformers design. + """ + if config_path is not None: + config = VitsConfig.from_pretrained(config_path) + else: + config = VitsConfig() + + if num_speakers: + config.num_speakers = num_speakers + config.speaker_embedding_size = 256 + + if sampling_rate: + config.sampling_rate = sampling_rate + + if checkpoint_path is None: + logger.info(f"***Converting model: facebook/mms-tts {language}***") + + vocab_path = hf_hub_download( + repo_id="facebook/mms-tts", + filename="vocab.txt", + subfolder=f"models/{language}", + ) + config_file = hf_hub_download( + repo_id="facebook/mms-tts", + filename="config.json", + subfolder=f"models/{language}", + ) + checkpoint_path = hf_hub_download( + repo_id="facebook/mms-tts", + filename="G_100000.pth", + subfolder=f"models/{language}", + ) + + with open(config_file, "r") as f: + data = f.read() + hps = json.loads(data) + + is_uroman = hps["data"]["training_files"].split(".")[-1] == "uroman" + if is_uroman: + logger.warning("For this checkpoint, you should use `uroman` to convert input text before tokenizing it!") + else: + logger.info(f"***Converting model: {checkpoint_path}***") + is_uroman = False + + # original VITS checkpoint + if vocab_path is None: + _pad = "_" + _punctuation = ';:,.!?¡¿—…"«»“” ' + _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" + _letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ" + symbols = _pad + _punctuation + _letters + _letters_ipa + symbol_to_id = {s: i for i, s in enumerate(symbols)} + phonemize = True + else: + # Save vocab as temporary json file + symbols = [line.replace("\n", "") for line in open(vocab_path, encoding="utf-8").readlines()] + symbol_to_id = {s: i for i, s in enumerate(symbols)} + # MMS-TTS does not use a token, so we set to the token used to space characters + _pad = symbols[0] + phonemize = False + + with tempfile.NamedTemporaryFile() as tf: + with open(tf.name, "w", encoding="utf-8") as f: + f.write(json.dumps(symbol_to_id, indent=2, sort_keys=True, ensure_ascii=False) + "\n") + + tokenizer = VitsTokenizer(tf.name, language=language, phonemize=phonemize, is_uroman=is_uroman, pad_token=_pad) + + config.vocab_size = len(symbols) + model = VitsModel(config) + + model.decoder.apply_weight_norm() + + orig_checkpoint = torch.load(checkpoint_path, map_location=torch.device("cpu")) + recursively_load_weights(orig_checkpoint["model"], model) + + model.decoder.remove_weight_norm() + + model.save_pretrained(pytorch_dump_folder_path) + tokenizer.save_pretrained(pytorch_dump_folder_path) + + if repo_id: + print("Pushing to the hub...") + tokenizer.push_to_hub(repo_id) + model.push_to_hub(repo_id) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--checkpoint_path", default=None, type=str, help="Local path to original checkpoint") + parser.add_argument("--vocab_path", default=None, type=str, help="Path to vocab.txt") + parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert") + parser.add_argument("--language", default=None, type=str, help="Tokenizer language (three-letter code)") + parser.add_argument("--num_speakers", default=None, type=int, help="Number of speakers") + parser.add_argument( + "--sampling_rate", default=None, type=int, help="Sampling rate on which the model was trained." + ) + parser.add_argument( + "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model." + ) + parser.add_argument( + "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub." + ) + + args = parser.parse_args() + convert_checkpoint( + args.pytorch_dump_folder_path, + args.checkpoint_path, + args.config_path, + args.vocab_path, + args.language, + args.num_speakers, + args.sampling_rate, + args.push_to_hub, + ) diff --git a/src/transformers/models/vits/modeling_vits.py b/src/transformers/models/vits/modeling_vits.py new file mode 100644 index 0000000000..876abc8704 --- /dev/null +++ b/src/transformers/models/vits/modeling_vits.py @@ -0,0 +1,1506 @@ +# coding=utf-8 +# Copyright 2023 The Kakao Enterprise Authors and the HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch VITS model.""" + +import math +from dataclasses import dataclass +from typing import Any, Optional, Tuple, Union + +import numpy as np +import torch +import torch.utils.checkpoint +from torch import nn + +from ...activations import ACT2FN +from ...deepspeed import is_deepspeed_zero3_enabled +from ...modeling_outputs import ( + BaseModelOutput, + ModelOutput, +) +from ...modeling_utils import PreTrainedModel +from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings +from .configuration_vits import VitsConfig + + +logger = logging.get_logger(__name__) + + +# General docstring +_CONFIG_FOR_DOC = "VitsConfig" + + +VITS_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "facebook/mms-tts-eng", + # See all VITS models at https://huggingface.co/models?filter=vits + # and all MMS models at https://huggingface.co/models?sort=trending&search=facebook%2Fmms-tts +] + + +@dataclass +class VitsModelOutput(ModelOutput): + """ + Describes the outputs for the VITS model, with potential hidden states and attentions. + + Args: + waveform (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): + The final audio waveform predicted by the model. + sequence_lengths (`torch.FloatTensor` of shape `(batch_size,)`): + The length in samples of each element in the `waveform` batch. + spectrogram (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`): + The log-mel spectrogram predicted at the output of the flow model. This spectrogram is passed to the Hi-Fi + GAN decoder model to obtain the final audio waveform. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attention weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + waveform: torch.FloatTensor = None + sequence_lengths: torch.FloatTensor = None + spectrogram: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +@dataclass +class VitsTextEncoderOutput(ModelOutput): + """ + Describes the outputs for the VITS text encoder model, with potential hidden states and attentions. + + Args: + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + prior_means (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + The predicted mean values of the prior distribution for the latent text variables. + prior_log_variances (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + The predicted log-variance values of the prior distribution for the latent text variables. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attention weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + last_hidden_state: torch.FloatTensor = None + prior_means: torch.FloatTensor = None + prior_log_variances: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +# Copied from transformers.models.bart.modeling_bart._expand_mask +def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): + """ + Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. + """ + bsz, src_len = mask.size() + tgt_len = tgt_len if tgt_len is not None else src_len + + expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) + + inverted_mask = 1.0 - expanded_mask + + return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) + + +@torch.jit.script +def fused_add_tanh_sigmoid_multiply(input_a, input_b, num_channels): + in_act = input_a + input_b + t_act = torch.tanh(in_act[:, :num_channels, :]) + s_act = torch.sigmoid(in_act[:, num_channels:, :]) + acts = t_act * s_act + return acts + + +def _unconstrained_rational_quadratic_spline( + inputs, + unnormalized_widths, + unnormalized_heights, + unnormalized_derivatives, + reverse=False, + tail_bound=5.0, + min_bin_width=1e-3, + min_bin_height=1e-3, + min_derivative=1e-3, +): + """ + This transformation represents a monotonically increasing piecewise rational quadratic function. Outside of the + `tail_bound`, the transform behaves as an identity function. + + Args: + inputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`: + Second half of the hidden-states input to the Vits convolutional flow module. + unnormalized_widths (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`): + First `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection + layer in the convolutional flow module + unnormalized_heights (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`): + Second `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection + layer in the convolutional flow module + unnormalized_derivatives (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`): + Third `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection + layer in the convolutional flow module + reverse (`bool`, *optional*, defaults to `False`): + Whether the model is being run in reverse mode. + tail_bound (`float`, *optional* defaults to 5): + Upper and lower limit bound for the rational quadratic function. Outside of this `tail_bound`, the + transform behaves as an identity function. + min_bin_width (`float`, *optional*, defaults to 1e-3): + Minimum bin value across the width dimension for the piecewise rational quadratic function. + min_bin_height (`float`, *optional*, defaults to 1e-3): + Minimum bin value across the height dimension for the piecewise rational quadratic function. + min_derivative (`float`, *optional*, defaults to 1e-3): + Minimum bin value across the derivatives for the piecewise rational quadratic function. + Returns: + outputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`: + Hidden-states as transformed by the piecewise rational quadratic function with the `tail_bound` limits + applied. + log_abs_det (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`: + Logarithm of the absolute value of the determinants corresponding to the `outputs` with the `tail_bound` + limits applied. + """ + inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound) + outside_interval_mask = ~inside_interval_mask + + outputs = torch.zeros_like(inputs) + log_abs_det = torch.zeros_like(inputs) + constant = np.log(np.exp(1 - min_derivative) - 1) + + unnormalized_derivatives = nn.functional.pad(unnormalized_derivatives, pad=(1, 1)) + unnormalized_derivatives[..., 0] = constant + unnormalized_derivatives[..., -1] = constant + + outputs[outside_interval_mask] = inputs[outside_interval_mask] + log_abs_det[outside_interval_mask] = 0.0 + + outputs[inside_interval_mask], log_abs_det[inside_interval_mask] = _rational_quadratic_spline( + inputs=inputs[inside_interval_mask], + unnormalized_widths=unnormalized_widths[inside_interval_mask, :], + unnormalized_heights=unnormalized_heights[inside_interval_mask, :], + unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :], + reverse=reverse, + tail_bound=tail_bound, + min_bin_width=min_bin_width, + min_bin_height=min_bin_height, + min_derivative=min_derivative, + ) + return outputs, log_abs_det + + +def _rational_quadratic_spline( + inputs, + unnormalized_widths, + unnormalized_heights, + unnormalized_derivatives, + reverse, + tail_bound, + min_bin_width, + min_bin_height, + min_derivative, +): + """ + This transformation represents a monotonically increasing piecewise rational quadratic function. Unlike the + function `_unconstrained_rational_quadratic_spline`, the function behaves the same across the `tail_bound`. + + Args: + inputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`: + Second half of the hidden-states input to the Vits convolutional flow module. + unnormalized_widths (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`): + First `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection + layer in the convolutional flow module + unnormalized_heights (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`): + Second `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection + layer in the convolutional flow module + unnormalized_derivatives (`torch.FloatTensor` of shape `(batch_size, channels, seq_len, duration_predictor_flow_bins)`): + Third `duration_predictor_flow_bins` of the hidden-states from the output of the convolution projection + layer in the convolutional flow module + reverse (`bool`): + Whether the model is being run in reverse mode. + tail_bound (`float`): + Upper and lower limit bound for the rational quadratic function. Outside of this `tail_bound`, the + transform behaves as an identity function. + min_bin_width (`float`): + Minimum bin value across the width dimension for the piecewise rational quadratic function. + min_bin_height (`float`): + Minimum bin value across the height dimension for the piecewise rational quadratic function. + min_derivative (`float`): + Minimum bin value across the derivatives for the piecewise rational quadratic function. + Returns: + outputs (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`: + Hidden-states as transformed by the piecewise rational quadratic function. + log_abs_det (`torch.FloatTensor` of shape `(batch_size, channels, seq_len)`: + Logarithm of the absolute value of the determinants corresponding to the `outputs`. + """ + upper_bound = tail_bound + lower_bound = -tail_bound + + if torch.min(inputs) < lower_bound or torch.max(inputs) > upper_bound: + raise ValueError("Input to a transform is not within its domain") + + num_bins = unnormalized_widths.shape[-1] + + if min_bin_width * num_bins > 1.0: + raise ValueError(f"Minimal bin width {min_bin_width} too large for the number of bins {num_bins}") + if min_bin_height * num_bins > 1.0: + raise ValueError(f"Minimal bin height {min_bin_height} too large for the number of bins {num_bins}") + + widths = nn.functional.softmax(unnormalized_widths, dim=-1) + widths = min_bin_width + (1 - min_bin_width * num_bins) * widths + cumwidths = torch.cumsum(widths, dim=-1) + cumwidths = nn.functional.pad(cumwidths, pad=(1, 0), mode="constant", value=0.0) + cumwidths = (upper_bound - lower_bound) * cumwidths + lower_bound + cumwidths[..., 0] = lower_bound + cumwidths[..., -1] = upper_bound + widths = cumwidths[..., 1:] - cumwidths[..., :-1] + + derivatives = min_derivative + nn.functional.softplus(unnormalized_derivatives) + + heights = nn.functional.softmax(unnormalized_heights, dim=-1) + heights = min_bin_height + (1 - min_bin_height * num_bins) * heights + cumheights = torch.cumsum(heights, dim=-1) + cumheights = nn.functional.pad(cumheights, pad=(1, 0), mode="constant", value=0.0) + cumheights = (upper_bound - lower_bound) * cumheights + lower_bound + cumheights[..., 0] = lower_bound + cumheights[..., -1] = upper_bound + heights = cumheights[..., 1:] - cumheights[..., :-1] + + bin_locations = cumheights if reverse else cumwidths + bin_locations[..., -1] += 1e-6 + bin_idx = torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1 + bin_idx = bin_idx[..., None] + + input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0] + input_bin_widths = widths.gather(-1, bin_idx)[..., 0] + + input_cumheights = cumheights.gather(-1, bin_idx)[..., 0] + delta = heights / widths + input_delta = delta.gather(-1, bin_idx)[..., 0] + + input_derivatives = derivatives.gather(-1, bin_idx)[..., 0] + input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0] + + input_heights = heights.gather(-1, bin_idx)[..., 0] + + intermediate1 = input_derivatives + input_derivatives_plus_one - 2 * input_delta + if not reverse: + theta = (inputs - input_cumwidths) / input_bin_widths + theta_one_minus_theta = theta * (1 - theta) + + numerator = input_heights * (input_delta * theta.pow(2) + input_derivatives * theta_one_minus_theta) + denominator = input_delta + intermediate1 * theta_one_minus_theta + outputs = input_cumheights + numerator / denominator + + derivative_numerator = input_delta.pow(2) * ( + input_derivatives_plus_one * theta.pow(2) + + 2 * input_delta * theta_one_minus_theta + + input_derivatives * (1 - theta).pow(2) + ) + log_abs_det = torch.log(derivative_numerator) - 2 * torch.log(denominator) + return outputs, log_abs_det + else: + # find the roots of a quadratic equation + intermediate2 = inputs - input_cumheights + intermediate3 = intermediate2 * intermediate1 + a = input_heights * (input_delta - input_derivatives) + intermediate3 + b = input_heights * input_derivatives - intermediate3 + c = -input_delta * intermediate2 + + discriminant = b.pow(2) - 4 * a * c + if not (discriminant >= 0).all(): + raise RuntimeError(f"invalid discriminant {discriminant}") + + root = (2 * c) / (-b - torch.sqrt(discriminant)) + outputs = root * input_bin_widths + input_cumwidths + + theta_one_minus_theta = root * (1 - root) + denominator = input_delta + intermediate1 * theta_one_minus_theta + derivative_numerator = input_delta.pow(2) * ( + input_derivatives_plus_one * root.pow(2) + + 2 * input_delta * theta_one_minus_theta + + input_derivatives * (1 - root).pow(2) + ) + log_abs_det = torch.log(derivative_numerator) - 2 * torch.log(denominator) + return outputs, -log_abs_det + + +class VitsWaveNet(torch.nn.Module): + def __init__(self, config: VitsConfig, num_layers: int): + super().__init__() + self.hidden_size = config.hidden_size + self.num_layers = num_layers + + self.in_layers = torch.nn.ModuleList() + self.res_skip_layers = torch.nn.ModuleList() + self.dropout = nn.Dropout(config.wavenet_dropout) + + if config.speaker_embedding_size != 0: + cond_layer = torch.nn.Conv1d(config.speaker_embedding_size, 2 * config.hidden_size * num_layers, 1) + self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name="weight") + + for i in range(num_layers): + dilation = config.wavenet_dilation_rate**i + padding = (config.wavenet_kernel_size * dilation - dilation) // 2 + in_layer = torch.nn.Conv1d( + in_channels=config.hidden_size, + out_channels=2 * config.hidden_size, + kernel_size=config.wavenet_kernel_size, + dilation=dilation, + padding=padding, + ) + in_layer = torch.nn.utils.weight_norm(in_layer, name="weight") + self.in_layers.append(in_layer) + + # last one is not necessary + if i < num_layers - 1: + res_skip_channels = 2 * config.hidden_size + else: + res_skip_channels = config.hidden_size + + res_skip_layer = torch.nn.Conv1d(config.hidden_size, res_skip_channels, 1) + res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name="weight") + self.res_skip_layers.append(res_skip_layer) + + def forward(self, inputs, padding_mask, global_conditioning=None): + outputs = torch.zeros_like(inputs) + num_channels_tensor = torch.IntTensor([self.hidden_size]) + + if global_conditioning is not None: + global_conditioning = self.cond_layer(global_conditioning) + + for i in range(self.num_layers): + hidden_states = self.in_layers[i](inputs) + + if global_conditioning is not None: + cond_offset = i * 2 * self.hidden_size + global_states = global_conditioning[:, cond_offset : cond_offset + 2 * self.hidden_size, :] + else: + global_states = torch.zeros_like(hidden_states) + + acts = fused_add_tanh_sigmoid_multiply(hidden_states, global_states, num_channels_tensor[0]) + acts = self.dropout(acts) + + res_skip_acts = self.res_skip_layers[i](acts) + if i < self.num_layers - 1: + res_acts = res_skip_acts[:, : self.hidden_size, :] + inputs = (inputs + res_acts) * padding_mask + outputs = outputs + res_skip_acts[:, self.hidden_size :, :] + else: + outputs = outputs + res_skip_acts + + return outputs * padding_mask + + def remove_weight_norm(self): + if self.speaker_embedding_size != 0: + torch.nn.utils.remove_weight_norm(self.cond_layer) + for layer in self.in_layers: + torch.nn.utils.remove_weight_norm(layer) + for layer in self.res_skip_layers: + torch.nn.utils.remove_weight_norm(layer) + + +class VitsPosteriorEncoder(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.out_channels = config.flow_size + + self.conv_pre = nn.Conv1d(config.spectrogram_bins, config.hidden_size, 1) + self.wavenet = VitsWaveNet(config, num_layers=config.posterior_encoder_num_wavenet_layers) + self.conv_proj = nn.Conv1d(config.hidden_size, self.out_channels * 2, 1) + + def forward(self, inputs, padding_mask, global_conditioning=None): + inputs = self.conv_pre(inputs) * padding_mask + inputs = self.wavenet(inputs, padding_mask, global_conditioning) + stats = self.conv_proj(inputs) * padding_mask + mean, log_stddev = torch.split(stats, self.out_channels, dim=1) + sampled = (mean + torch.randn_like(mean) * torch.exp(log_stddev)) * padding_mask + return sampled, mean, log_stddev + + +# Copied from transformers.models.speecht5.modeling_speecht5.HifiGanResidualBlock +class HifiGanResidualBlock(nn.Module): + def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5), leaky_relu_slope=0.1): + super().__init__() + self.leaky_relu_slope = leaky_relu_slope + + self.convs1 = nn.ModuleList( + [ + nn.Conv1d( + channels, + channels, + kernel_size, + stride=1, + dilation=dilation[i], + padding=self.get_padding(kernel_size, dilation[i]), + ) + for i in range(len(dilation)) + ] + ) + self.convs2 = nn.ModuleList( + [ + nn.Conv1d( + channels, + channels, + kernel_size, + stride=1, + dilation=1, + padding=self.get_padding(kernel_size, 1), + ) + for _ in range(len(dilation)) + ] + ) + + def get_padding(self, kernel_size, dilation=1): + return (kernel_size * dilation - dilation) // 2 + + def apply_weight_norm(self): + for layer in self.convs1: + nn.utils.weight_norm(layer) + for layer in self.convs2: + nn.utils.weight_norm(layer) + + def remove_weight_norm(self): + for layer in self.convs1: + nn.utils.remove_weight_norm(layer) + for layer in self.convs2: + nn.utils.remove_weight_norm(layer) + + def forward(self, hidden_states): + for conv1, conv2 in zip(self.convs1, self.convs2): + residual = hidden_states + hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope) + hidden_states = conv1(hidden_states) + hidden_states = nn.functional.leaky_relu(hidden_states, self.leaky_relu_slope) + hidden_states = conv2(hidden_states) + hidden_states = hidden_states + residual + return hidden_states + + +class VitsHifiGan(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.config = config + self.num_kernels = len(config.resblock_kernel_sizes) + self.num_upsamples = len(config.upsample_rates) + self.conv_pre = nn.Conv1d( + config.flow_size, + config.upsample_initial_channel, + kernel_size=7, + stride=1, + padding=3, + ) + + self.upsampler = nn.ModuleList() + for i, (upsample_rate, kernel_size) in enumerate(zip(config.upsample_rates, config.upsample_kernel_sizes)): + self.upsampler.append( + nn.ConvTranspose1d( + config.upsample_initial_channel // (2**i), + config.upsample_initial_channel // (2 ** (i + 1)), + kernel_size=kernel_size, + stride=upsample_rate, + padding=(kernel_size - upsample_rate) // 2, + ) + ) + + self.resblocks = nn.ModuleList() + for i in range(len(self.upsampler)): + channels = config.upsample_initial_channel // (2 ** (i + 1)) + for kernel_size, dilation in zip(config.resblock_kernel_sizes, config.resblock_dilation_sizes): + self.resblocks.append(HifiGanResidualBlock(channels, kernel_size, dilation, config.leaky_relu_slope)) + + self.conv_post = nn.Conv1d(channels, 1, kernel_size=7, stride=1, padding=3, bias=False) + + if config.speaker_embedding_size != 0: + self.cond = nn.Conv1d(config.speaker_embedding_size, config.upsample_initial_channel, 1) + + def apply_weight_norm(self): + for layer in self.upsampler: + nn.utils.weight_norm(layer) + for layer in self.resblocks: + layer.apply_weight_norm() + + def remove_weight_norm(self): + for layer in self.upsampler: + nn.utils.remove_weight_norm(layer) + for layer in self.resblocks: + layer.remove_weight_norm() + + def forward( + self, spectrogram: torch.FloatTensor, global_conditioning: Optional[torch.FloatTensor] = None + ) -> torch.FloatTensor: + r""" + Converts a spectrogram into a speech waveform. + + Args: + spectrogram (`torch.FloatTensor` of shape `(batch_size, config.spectrogram_bins, sequence_length)`): + Tensor containing the spectrograms. + global_conditioning (`torch.FloatTensor` of shape `(batch_size, config.speaker_embedding_size, 1)`, *optional*): + Tensor containing speaker embeddings, for multispeaker models. + + Returns: + `torch.FloatTensor`: Tensor of shape shape `(batch_size, 1, num_frames)` containing the speech waveform. + """ + hidden_states = self.conv_pre(spectrogram) + + if global_conditioning is not None: + hidden_states = hidden_states + self.cond(global_conditioning) + + for i in range(self.num_upsamples): + hidden_states = nn.functional.leaky_relu(hidden_states, self.config.leaky_relu_slope) + hidden_states = self.upsampler[i](hidden_states) + + res_state = self.resblocks[i * self.num_kernels](hidden_states) + for j in range(1, self.num_kernels): + res_state += self.resblocks[i * self.num_kernels + j](hidden_states) + hidden_states = res_state / self.num_kernels + + hidden_states = nn.functional.leaky_relu(hidden_states) + hidden_states = self.conv_post(hidden_states) + waveform = torch.tanh(hidden_states) + return waveform + + +class VitsResidualCouplingLayer(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.half_channels = config.flow_size // 2 + + self.conv_pre = nn.Conv1d(self.half_channels, config.hidden_size, 1) + self.wavenet = VitsWaveNet(config, num_layers=config.prior_encoder_num_wavenet_layers) + self.conv_post = nn.Conv1d(config.hidden_size, self.half_channels, 1) + + def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False): + first_half, second_half = torch.split(inputs, [self.half_channels] * 2, dim=1) + hidden_states = self.conv_pre(first_half) * padding_mask + hidden_states = self.wavenet(hidden_states, padding_mask, global_conditioning) + mean = self.conv_post(hidden_states) * padding_mask + log_stddev = torch.zeros_like(mean) + + if not reverse: + second_half = mean + second_half * torch.exp(log_stddev) * padding_mask + outputs = torch.cat([first_half, second_half], dim=1) + log_determinant = torch.sum(log_stddev, [1, 2]) + return outputs, log_determinant + else: + second_half = (second_half - mean) * torch.exp(-log_stddev) * padding_mask + outputs = torch.cat([first_half, second_half], dim=1) + return outputs, None + + +class VitsResidualCouplingBlock(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.flows = nn.ModuleList() + for _ in range(config.prior_encoder_num_flows): + self.flows.append(VitsResidualCouplingLayer(config)) + + def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False): + if not reverse: + for flow in self.flows: + inputs, _ = flow(inputs, padding_mask, global_conditioning) + inputs = torch.flip(inputs, [1]) + else: + for flow in reversed(self.flows): + inputs = torch.flip(inputs, [1]) + inputs, _ = flow(inputs, padding_mask, global_conditioning, reverse=True) + return inputs + + +class VitsDilatedDepthSeparableConv(nn.Module): + def __init__(self, config: VitsConfig, dropout_rate=0.0): + super().__init__() + kernel_size = config.duration_predictor_kernel_size + channels = config.hidden_size + self.num_layers = config.depth_separable_num_layers + + self.dropout = nn.Dropout(dropout_rate) + self.convs_dilated = nn.ModuleList() + self.convs_pointwise = nn.ModuleList() + self.norms_1 = nn.ModuleList() + self.norms_2 = nn.ModuleList() + for i in range(self.num_layers): + dilation = kernel_size**i + padding = (kernel_size * dilation - dilation) // 2 + self.convs_dilated.append( + nn.Conv1d( + in_channels=channels, + out_channels=channels, + kernel_size=kernel_size, + groups=channels, + dilation=dilation, + padding=padding, + ) + ) + self.convs_pointwise.append(nn.Conv1d(channels, channels, 1)) + self.norms_1.append(nn.LayerNorm(channels)) + self.norms_2.append(nn.LayerNorm(channels)) + + def forward(self, inputs, padding_mask, global_conditioning=None): + if global_conditioning is not None: + inputs = inputs + global_conditioning + + for i in range(self.num_layers): + hidden_states = self.convs_dilated[i](inputs * padding_mask) + hidden_states = self.norms_1[i](hidden_states.transpose(1, -1)).transpose(1, -1) + hidden_states = nn.functional.gelu(hidden_states) + hidden_states = self.convs_pointwise[i](hidden_states) + hidden_states = self.norms_2[i](hidden_states.transpose(1, -1)).transpose(1, -1) + hidden_states = nn.functional.gelu(hidden_states) + hidden_states = self.dropout(hidden_states) + inputs = inputs + hidden_states + + return inputs * padding_mask + + +class VitsConvFlow(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.filter_channels = config.hidden_size + self.half_channels = config.depth_separable_channels // 2 + self.num_bins = config.duration_predictor_flow_bins + self.tail_bound = config.duration_predictor_tail_bound + + self.conv_pre = nn.Conv1d(self.half_channels, self.filter_channels, 1) + self.conv_dds = VitsDilatedDepthSeparableConv(config) + self.conv_proj = nn.Conv1d(self.filter_channels, self.half_channels * (self.num_bins * 3 - 1), 1) + + def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False): + first_half, second_half = torch.split(inputs, [self.half_channels] * 2, dim=1) + + hidden_states = self.conv_pre(first_half) + hidden_states = self.conv_dds(hidden_states, padding_mask, global_conditioning) + hidden_states = self.conv_proj(hidden_states) * padding_mask + + batch_size, channels, length = first_half.shape + hidden_states = hidden_states.reshape(batch_size, channels, -1, length).permute(0, 1, 3, 2) + + unnormalized_widths = hidden_states[..., : self.num_bins] / math.sqrt(self.filter_channels) + unnormalized_heights = hidden_states[..., self.num_bins : 2 * self.num_bins] / math.sqrt(self.filter_channels) + unnormalized_derivatives = hidden_states[..., 2 * self.num_bins :] + + second_half, log_abs_det = _unconstrained_rational_quadratic_spline( + second_half, + unnormalized_widths, + unnormalized_heights, + unnormalized_derivatives, + reverse=reverse, + tail_bound=self.tail_bound, + ) + + outputs = torch.cat([first_half, second_half], dim=1) * padding_mask + if not reverse: + log_determinant = torch.sum(log_abs_det * padding_mask, [1, 2]) + return outputs, log_determinant + else: + return outputs, None + + +class VitsElementwiseAffine(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.channels = config.depth_separable_channels + self.translate = nn.Parameter(torch.zeros(self.channels, 1)) + self.log_scale = nn.Parameter(torch.zeros(self.channels, 1)) + + def forward(self, inputs, padding_mask, global_conditioning=None, reverse=False): + if not reverse: + outputs = self.translate + torch.exp(self.log_scale) * inputs + outputs = outputs * padding_mask + log_determinant = torch.sum(self.log_scale * padding_mask, [1, 2]) + return outputs, log_determinant + else: + outputs = (inputs - self.translate) * torch.exp(-self.log_scale) * padding_mask + return outputs, None + + +class VitsStochasticDurationPredictor(nn.Module): + def __init__(self, config): + super().__init__() + embed_dim = config.speaker_embedding_size + filter_channels = config.hidden_size + + self.conv_pre = nn.Conv1d(filter_channels, filter_channels, 1) + self.conv_proj = nn.Conv1d(filter_channels, filter_channels, 1) + self.conv_dds = VitsDilatedDepthSeparableConv( + config, + dropout_rate=config.duration_predictor_dropout, + ) + + if embed_dim != 0: + self.cond = nn.Conv1d(embed_dim, filter_channels, 1) + + self.flows = nn.ModuleList() + self.flows.append(VitsElementwiseAffine(config)) + for _ in range(config.duration_predictor_num_flows): + self.flows.append(VitsConvFlow(config)) + + self.post_conv_pre = nn.Conv1d(1, filter_channels, 1) + self.post_conv_proj = nn.Conv1d(filter_channels, filter_channels, 1) + self.post_conv_dds = VitsDilatedDepthSeparableConv( + config, + dropout_rate=config.duration_predictor_dropout, + ) + + self.post_flows = nn.ModuleList() + self.post_flows.append(VitsElementwiseAffine(config)) + for _ in range(config.duration_predictor_num_flows): + self.post_flows.append(VitsConvFlow(config)) + + def forward(self, inputs, padding_mask, global_conditioning=None, durations=None, reverse=False, noise_scale=1.0): + inputs = torch.detach(inputs) + inputs = self.conv_pre(inputs) + + if global_conditioning is not None: + global_conditioning = torch.detach(global_conditioning) + inputs = inputs + self.cond(global_conditioning) + + inputs = self.conv_dds(inputs, padding_mask) + inputs = self.conv_proj(inputs) * padding_mask + + if not reverse: + hidden_states = self.post_conv_pre(durations) + hidden_states = self.post_conv_dds(hidden_states, padding_mask) + hidden_states = self.post_conv_proj(hidden_states) * padding_mask + + random_posterior = ( + torch.randn(durations.size(0), 2, durations.size(2)).to(device=inputs.device, dtype=inputs.dtype) + * padding_mask + ) + log_determinant_posterior_sum = 0 + latents_posterior = random_posterior + for flow in self.post_flows: + latents_posterior, log_determinant = flow( + latents_posterior, padding_mask, global_conditioning=inputs + hidden_states + ) + latents_posterior = torch.flip(latents_posterior, [1]) + log_determinant_posterior_sum += log_determinant + + first_half, second_half = torch.split(latents_posterior, [1, 1], dim=1) + + log_determinant_posterior_sum += torch.sum( + (nn.functional.logsigmoid(first_half) + nn.functional.logsigmoid(-first_half)) * padding_mask, [1, 2] + ) + logq = ( + torch.sum(-0.5 * (math.log(2 * math.pi) + (random_posterior**2)) * padding_mask, [1, 2]) + - log_determinant_posterior_sum + ) + + first_half = (durations - torch.sigmoid(first_half)) * padding_mask + first_half = torch.log(torch.clamp_min(first_half, 1e-5)) * padding_mask + log_determinant_sum = torch.sum(-first_half, [1, 2]) + + latents = torch.cat([first_half, second_half], dim=1) + for flow in self.flows: + latents, log_determinant = flow(latents, padding_mask, global_conditioning=inputs) + latents = torch.flip(latents, [1]) + log_determinant_sum += log_determinant + + nll = ( + torch.sum(0.5 * (math.log(2 * math.pi) + (latents**2)) * padding_mask, [1, 2]) - log_determinant_sum + ) + return nll + logq + else: + flows = list(reversed(self.flows)) + flows = flows[:-2] + [flows[-1]] # remove a useless vflow + + latents = ( + torch.randn(inputs.size(0), 2, inputs.size(2)).to(device=inputs.device, dtype=inputs.dtype) + * noise_scale + ) + for flow in flows: + latents = torch.flip(latents, [1]) + latents, _ = flow(latents, padding_mask, global_conditioning=inputs, reverse=True) + + log_duration, _ = torch.split(latents, [1, 1], dim=1) + return log_duration + + +class VitsDurationPredictor(nn.Module): + def __init__(self, config): + super().__init__() + kernel_size = config.duration_predictor_kernel_size + filter_channels = config.duration_predictor_filter_channels + + self.dropout = nn.Dropout(config.duration_predictor_dropout) + self.conv_1 = nn.Conv1d(config.hidden_size, filter_channels, kernel_size, padding=kernel_size // 2) + self.norm_1 = nn.LayerNorm(filter_channels, eps=config.layer_norm_eps) + self.conv_2 = nn.Conv1d(filter_channels, filter_channels, kernel_size, padding=kernel_size // 2) + self.norm_2 = nn.LayerNorm(filter_channels, eps=config.layer_norm_eps) + self.proj = nn.Conv1d(filter_channels, 1, 1) + + if config.speaker_embedding_size != 0: + self.cond = nn.Conv1d(config.speaker_embedding_size, config.hidden_size, 1) + + def forward(self, inputs, padding_mask, global_conditioning=None): + inputs = torch.detach(inputs) + + if global_conditioning is not None: + global_conditioning = torch.detach(global_conditioning) + inputs = inputs + self.cond(global_conditioning) + + inputs = self.conv_1(inputs * padding_mask) + inputs = torch.relu(inputs) + inputs = self.norm_1(inputs.transpose(1, -1)).transpose(1, -1) + inputs = self.dropout(inputs) + + inputs = self.conv_2(inputs * padding_mask) + inputs = torch.relu(inputs) + inputs = self.norm_2(inputs.transpose(1, -1)).transpose(1, -1) + inputs = self.dropout(inputs) + + inputs = self.proj(inputs * padding_mask) + return inputs * padding_mask + + +class VitsAttention(nn.Module): + """Multi-headed attention with relative positional representation.""" + + def __init__(self, config: VitsConfig): + super().__init__() + self.embed_dim = config.hidden_size + self.num_heads = config.num_attention_heads + self.dropout = config.attention_dropout + self.window_size = config.window_size + + self.head_dim = self.embed_dim // self.num_heads + self.scaling = self.head_dim**-0.5 + + if (self.head_dim * self.num_heads) != self.embed_dim: + raise ValueError( + f"hidden_size must be divisible by num_attention_heads (got `hidden_size`: {self.embed_dim}" + f" and `num_attention_heads`: {self.num_heads})." + ) + + self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias) + self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias) + self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias) + self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_bias) + + if self.window_size: + self.emb_rel_k = nn.Parameter(torch.randn(1, self.window_size * 2 + 1, self.head_dim) * self.scaling) + self.emb_rel_v = nn.Parameter(torch.randn(1, self.window_size * 2 + 1, self.head_dim) * self.scaling) + + def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() + + def forward( + self, + hidden_states: torch.Tensor, + key_value_states: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + layer_head_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]: + """Input shape: Batch x Time x Channel""" + + # if key_value_states are provided this layer is used as a cross-attention layer + # for the decoder + + bsz, tgt_len, _ = hidden_states.size() + + # get query proj + query_states = self.q_proj(hidden_states) * self.scaling + + # self_attention + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + + proj_shape = (bsz * self.num_heads, -1, self.head_dim) + query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) + key_states = key_states.view(*proj_shape) + value_states = value_states.view(*proj_shape) + + src_len = key_states.size(1) + attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) + + if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len): + raise ValueError( + f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is" + f" {attn_weights.size()}" + ) + + if self.window_size is not None: + key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, src_len) + relative_logits = torch.matmul(query_states, key_relative_embeddings.transpose(-2, -1)) + rel_pos_bias = self._relative_position_to_absolute_position(relative_logits) + attn_weights += rel_pos_bias + + if attention_mask is not None: + if attention_mask.size() != (bsz, 1, tgt_len, src_len): + raise ValueError( + f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}" + ) + attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + + if layer_head_mask is not None: + if layer_head_mask.size() != (self.num_heads,): + raise ValueError( + f"Head mask for a single layer should be of size {(self.num_heads,)}, but is" + f" {layer_head_mask.size()}" + ) + attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + if output_attentions: + # this operation is a bit awkward, but it's required to + # make sure that attn_weights keeps its gradient. + # In order to do so, attn_weights have to be reshaped + # twice and have to be reused in the following + attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len) + else: + attn_weights_reshaped = None + + attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training) + + attn_output = torch.bmm(attn_probs, value_states) + + if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): + raise ValueError( + f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f" {attn_output.size()}" + ) + + if self.window_size is not None: + value_relative_embeddings = self._get_relative_embeddings(self.emb_rel_v, src_len) + relative_weights = self._absolute_position_to_relative_position(attn_probs) + rel_pos_bias = torch.matmul(relative_weights, value_relative_embeddings) + attn_output += rel_pos_bias + + attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) + attn_output = attn_output.transpose(1, 2) + + # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be + # partitioned aross GPUs when using tensor-parallelism. + attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) + + attn_output = self.out_proj(attn_output) + + return attn_output, attn_weights_reshaped + + def _get_relative_embeddings(self, relative_embeddings, length): + pad_length = max(length - (self.window_size + 1), 0) + if pad_length > 0: + relative_embeddings = nn.functional.pad(relative_embeddings, [0, 0, pad_length, pad_length, 0, 0]) + + slice_start_position = max((self.window_size + 1) - length, 0) + slice_end_position = slice_start_position + 2 * length - 1 + return relative_embeddings[:, slice_start_position:slice_end_position] + + def _relative_position_to_absolute_position(self, x): + batch_heads, length, _ = x.size() + + # Concat columns of pad to shift from relative to absolute indexing. + x = nn.functional.pad(x, [0, 1, 0, 0, 0, 0]) + + # Concat extra elements so to add up to shape (len+1, 2*len-1). + x_flat = x.view([batch_heads, length * 2 * length]) + x_flat = nn.functional.pad(x_flat, [0, length - 1, 0, 0]) + + # Reshape and slice out the padded elements. + x_final = x_flat.view([batch_heads, length + 1, 2 * length - 1]) + x_final = x_final[:, :length, length - 1 :] + return x_final + + def _absolute_position_to_relative_position(self, x): + batch_heads, length, _ = x.size() + + # Pad along column + x = nn.functional.pad(x, [0, length - 1, 0, 0, 0, 0]) + x_flat = x.view([batch_heads, length**2 + length * (length - 1)]) + + # Add 0's in the beginning that will skew the elements after reshape + x_flat = nn.functional.pad(x_flat, [length, 0, 0, 0]) + x_final = x_flat.view([batch_heads, length, 2 * length])[:, :, 1:] + return x_final + + +class VitsFeedForward(nn.Module): + def __init__(self, config): + super().__init__() + self.conv_1 = nn.Conv1d(config.hidden_size, config.ffn_dim, config.ffn_kernel_size) + self.conv_2 = nn.Conv1d(config.ffn_dim, config.hidden_size, config.ffn_kernel_size) + self.dropout = nn.Dropout(config.activation_dropout) + + if isinstance(config.hidden_act, str): + self.act_fn = ACT2FN[config.hidden_act] + else: + self.act_fn = config.hidden_act + + if config.ffn_kernel_size > 1: + pad_left = (config.ffn_kernel_size - 1) // 2 + pad_right = config.ffn_kernel_size // 2 + self.padding = [pad_left, pad_right, 0, 0, 0, 0] + else: + self.padding = None + + def forward(self, hidden_states, padding_mask): + hidden_states = hidden_states.permute(0, 2, 1) + padding_mask = padding_mask.permute(0, 2, 1) + + hidden_states = hidden_states * padding_mask + if self.padding is not None: + hidden_states = nn.functional.pad(hidden_states, self.padding) + + hidden_states = self.conv_1(hidden_states) + hidden_states = self.act_fn(hidden_states) + hidden_states = self.dropout(hidden_states) + + hidden_states = hidden_states * padding_mask + if self.padding is not None: + hidden_states = nn.functional.pad(hidden_states, self.padding) + + hidden_states = self.conv_2(hidden_states) + hidden_states = hidden_states * padding_mask + + hidden_states = hidden_states.permute(0, 2, 1) + return hidden_states + + +class VitsEncoderLayer(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.attention = VitsAttention(config) + self.dropout = nn.Dropout(config.hidden_dropout) + self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.feed_forward = VitsFeedForward(config) + self.final_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + + def forward( + self, + hidden_states: torch.Tensor, + padding_mask: torch.FloatTensor, + attention_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + ): + residual = hidden_states + hidden_states, attn_weights = self.attention( + hidden_states=hidden_states, + attention_mask=attention_mask, + output_attentions=output_attentions, + ) + + hidden_states = self.dropout(hidden_states) + hidden_states = self.layer_norm(residual + hidden_states) + + residual = hidden_states + hidden_states = self.feed_forward(hidden_states, padding_mask) + hidden_states = self.dropout(hidden_states) + hidden_states = self.final_layer_norm(residual + hidden_states) + + outputs = (hidden_states,) + + if output_attentions: + outputs += (attn_weights,) + + return outputs + + +class VitsEncoder(nn.Module): + def __init__(self, config: VitsConfig): + super().__init__() + self.config = config + self.layers = nn.ModuleList([VitsEncoderLayer(config) for _ in range(config.num_hidden_layers)]) + self.gradient_checkpointing = False + self.layerdrop = config.layerdrop + + def forward( + self, + hidden_states: torch.FloatTensor, + padding_mask: torch.FloatTensor, + attention_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutput]: + all_hidden_states = () if output_hidden_states else None + all_self_attentions = () if output_attentions else None + + # expand attention_mask + if attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + attention_mask = _expand_mask(attention_mask, hidden_states.dtype) + + hidden_states = hidden_states * padding_mask + + deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled() + + for encoder_layer in self.layers: + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) + dropout_probability = np.random.uniform(0, 1) + + skip_the_layer = self.training and (dropout_probability < self.layerdrop) + if not skip_the_layer or deepspeed_zero3_is_enabled: + # under deepspeed zero3 all gpus must run in sync + if self.gradient_checkpointing and self.training: + # create gradient checkpointing function + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(encoder_layer), + hidden_states, + padding_mask, + attention_mask, + ) + else: + layer_outputs = encoder_layer( + hidden_states, + attention_mask=attention_mask, + padding_mask=padding_mask, + output_attentions=output_attentions, + ) + hidden_states = layer_outputs[0] + + if skip_the_layer: + layer_outputs = (None, None) + + if output_attentions: + all_self_attentions = all_self_attentions + (layer_outputs[1],) + + hidden_states = hidden_states * padding_mask + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None) + + return BaseModelOutput( + last_hidden_state=hidden_states, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + +class VitsTextEncoder(nn.Module): + """ + Transformer encoder that uses relative positional representation instead of absolute positional encoding. + """ + + def __init__(self, config: VitsConfig): + super().__init__() + self.config = config + self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, config.pad_token_id) + self.encoder = VitsEncoder(config) + self.project = nn.Conv1d(config.hidden_size, config.flow_size * 2, kernel_size=1) + + def get_input_embeddings(self): + return self.embed_tokens + + def set_input_embeddings(self, value): + self.embed_tokens = value + + def forward( + self, + input_ids: torch.Tensor, + padding_mask: torch.FloatTensor, + attention_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = True, + ) -> Union[Tuple[torch.Tensor], VitsTextEncoderOutput]: + hidden_states = self.embed_tokens(input_ids) * math.sqrt(self.config.hidden_size) + + encoder_outputs = self.encoder( + hidden_states=hidden_states, + padding_mask=padding_mask, + attention_mask=attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = encoder_outputs[0] if not return_dict else encoder_outputs.last_hidden_state + + stats = self.project(last_hidden_state.transpose(1, 2)).transpose(1, 2) * padding_mask + prior_means, prior_log_variances = torch.split(stats, self.config.flow_size, dim=2) + + if not return_dict: + outputs = (last_hidden_state, prior_means, prior_log_variances) + encoder_outputs[1:] + return outputs + + return VitsTextEncoderOutput( + last_hidden_state=last_hidden_state, + prior_means=prior_means, + prior_log_variances=prior_log_variances, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + ) + + +class VitsPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = VitsConfig + base_model_prefix = "vits" + main_input_name = "input_ids" + supports_gradient_checkpointing = True + + def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + elif isinstance(module, nn.Conv1d): + nn.init.kaiming_normal_(module.weight) + if module.bias is not None: + k = math.sqrt(module.groups / (module.in_channels * module.kernel_size[0])) + nn.init.uniform_(module.bias, a=-k, b=k) + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, (VitsTextEncoder)): + module.gradient_checkpointing = value + + +VITS_START_DOCSTRING = r""" + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`VitsConfig`]): + Model configuration class with all the parameters of the model. Initializing with a config file does not + load the weights associated with the model, only the configuration. Check out the + [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + + +VITS_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing convolution and attention on padding token indices. Mask values selected in `[0, + 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + speaker_id (`int`, *optional*): + Which speaker embedding to use. Only used for multispeaker models. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@add_start_docstrings( + "The complete VITS model, for text-to-speech synthesis.", + VITS_START_DOCSTRING, +) +class VitsModel(VitsPreTrainedModel): + def __init__(self, config: VitsConfig): + super().__init__(config) + self.config = config + self.text_encoder = VitsTextEncoder(config) + self.flow = VitsResidualCouplingBlock(config) + self.decoder = VitsHifiGan(config) + + if config.use_stochastic_duration_prediction: + self.duration_predictor = VitsStochasticDurationPredictor(config) + else: + self.duration_predictor = VitsDurationPredictor(config) + + if config.num_speakers > 1: + self.embed_speaker = nn.Embedding(config.num_speakers, config.speaker_embedding_size) + + # This is used only for training. + self.posterior_encoder = VitsPosteriorEncoder(config) + + # These parameters control the synthesised speech properties + self.speaking_rate = config.speaking_rate + self.noise_scale = config.noise_scale + self.noise_scale_duration = config.noise_scale_duration + + # Initialize weights and apply final processing + self.post_init() + + def get_encoder(self): + return self.text_encoder + + @add_start_docstrings_to_model_forward(VITS_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=VitsModelOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + speaker_id: Optional[int] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + labels: Optional[torch.FloatTensor] = None, + ) -> Union[Tuple[Any], VitsModelOutput]: + r""" + labels (`torch.FloatTensor` of shape `(batch_size, config.spectrogram_bins, sequence_length)`, *optional*): + Float values of target spectrogram. Timesteps set to `-100.0` are ignored (masked) for the loss + computation. + + Returns: + + Example: + + ```python + >>> from transformers import VitsTokenizer, VitsModel, set_seed + >>> import torch + + >>> tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") + >>> model = VitsModel.from_pretrained("facebook/mms-tts-eng") + + >>> inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt") + + >>> set_seed(555) # make deterministic + + >>> with torch.no_grad(): + ... outputs = model(inputs["input_ids"]) + >>> outputs.waveform.shape + torch.Size([1, 45824]) + ``` + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if attention_mask is not None: + input_padding_mask = attention_mask.unsqueeze(-1).float() + else: + input_padding_mask = torch.ones_like(input_ids).unsqueeze(-1).float() + + if self.config.num_speakers > 1 and speaker_id is not None: + if not 0 <= speaker_id < self.config.num_speakers: + raise ValueError(f"Set `speaker_id` in the range 0-{self.config.num_speakers - 1}.") + speaker_embeddings = self.embed_speaker(torch.tensor([speaker_id])).unsqueeze(-1) + else: + speaker_embeddings = None + + if labels is not None: + raise NotImplementedError("Training of VITS is not supported yet.") + + text_encoder_output = self.text_encoder( + input_ids=input_ids, + padding_mask=input_padding_mask, + attention_mask=attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + hidden_states = text_encoder_output[0] if not return_dict else text_encoder_output.last_hidden_state + hidden_states = hidden_states.transpose(1, 2) + input_padding_mask = input_padding_mask.transpose(1, 2) + prior_means = text_encoder_output[1] if not return_dict else text_encoder_output.prior_means + prior_log_variances = text_encoder_output[2] if not return_dict else text_encoder_output.prior_log_variances + + if self.config.use_stochastic_duration_prediction: + log_duration = self.duration_predictor( + hidden_states, + input_padding_mask, + speaker_embeddings, + reverse=True, + noise_scale=self.noise_scale_duration, + ) + else: + log_duration = self.duration_predictor(hidden_states, input_padding_mask, speaker_embeddings) + + length_scale = 1.0 / self.speaking_rate + duration = torch.ceil(torch.exp(log_duration) * input_padding_mask * length_scale) + predicted_lengths = torch.clamp_min(torch.sum(duration, [1, 2]), 1).long() + + # Create a padding mask for the output lengths of shape (batch, 1, max_output_length) + indices = torch.arange(predicted_lengths.max(), dtype=predicted_lengths.dtype, device=predicted_lengths.device) + output_padding_mask = indices.unsqueeze(0) < predicted_lengths.unsqueeze(1) + output_padding_mask = output_padding_mask.unsqueeze(1).to(input_padding_mask.dtype) + + # Reconstruct an attention tensor of shape (batch, 1, out_length, in_length) + attn_mask = torch.unsqueeze(input_padding_mask, 2) * torch.unsqueeze(output_padding_mask, -1) + batch_size, _, output_length, input_length = attn_mask.shape + cum_duration = torch.cumsum(duration, -1).view(batch_size * input_length, 1) + indices = torch.arange(output_length, dtype=duration.dtype, device=duration.device) + valid_indices = indices.unsqueeze(0) < cum_duration + valid_indices = valid_indices.to(attn_mask.dtype).view(batch_size, input_length, output_length) + padded_indices = valid_indices - nn.functional.pad(valid_indices, [0, 0, 1, 0, 0, 0])[:, :-1] + attn = padded_indices.unsqueeze(1).transpose(2, 3) * attn_mask + + # Expand prior distribution + prior_means = torch.matmul(attn.squeeze(1), prior_means).transpose(1, 2) + prior_log_variances = torch.matmul(attn.squeeze(1), prior_log_variances).transpose(1, 2) + + prior_latents = prior_means + torch.randn_like(prior_means) * torch.exp(prior_log_variances) * self.noise_scale + latents = self.flow(prior_latents, output_padding_mask, speaker_embeddings, reverse=True) + + spectrogram = latents * output_padding_mask + waveform = self.decoder(spectrogram, speaker_embeddings) + waveform = waveform.squeeze(1) + sequence_lengths = predicted_lengths * np.prod(self.config.upsample_rates) + + if not return_dict: + outputs = (waveform, sequence_lengths, spectrogram) + text_encoder_output[3:] + return outputs + + return VitsModelOutput( + waveform=waveform, + sequence_lengths=sequence_lengths, + spectrogram=spectrogram, + hidden_states=text_encoder_output.hidden_states, + attentions=text_encoder_output.attentions, + ) diff --git a/src/transformers/models/vits/tokenization_vits.py b/src/transformers/models/vits/tokenization_vits.py new file mode 100644 index 0000000000..eb973ec7d0 --- /dev/null +++ b/src/transformers/models/vits/tokenization_vits.py @@ -0,0 +1,249 @@ +# coding=utf-8 +# Copyright 2023 The Kakao Enterprise Authors, the MMS-TTS Authors and the HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization class for VITS.""" + + +import json +import os +import re +from typing import Any, Dict, List, Optional, Tuple, Union + +from ...tokenization_utils import PreTrainedTokenizer +from ...utils import is_phonemizer_available, logging + + +if is_phonemizer_available(): + import phonemizer + + +logger = logging.get_logger(__name__) + +VOCAB_FILES_NAMES = {"vocab_file": "vocab.json"} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "facebook/mms-tts-eng": "https://huggingface.co/facebook/mms-tts-eng/resolve/main/vocab.json", + } +} + +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { + # This model does not have a maximum input length. + "facebook/mms-tts-eng": 4096, +} + + +def has_non_roman_characters(input_string): + # Find any character outside the ASCII range + non_roman_pattern = re.compile(r"[^\x00-\x7F]") + + # Search the input string for non-Roman characters + match = non_roman_pattern.search(input_string) + has_non_roman = match is not None + return has_non_roman + + +class VitsTokenizer(PreTrainedTokenizer): + """ + Construct a VITS tokenizer. Also supports MMS-TTS. + + This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to + this superclass for more information regarding those methods. + + Args: + vocab_file (`str`): + Path to the vocabulary file. + language (`str`, *optional*): + Language identifier. + add_blank (`bool`, *optional*, defaults to `True`): + Whether to insert token id 0 in between the other tokens. + normalize (`bool`, *optional*, defaults to `True`): + Whether to normalize the input text by removing all casing and punctuation. + phonemize (`bool`, *optional*, defaults to `True`): + Whether to convert the input text into phonemes. + is_uroman (`bool`, *optional*, defaults to `False`): + Whether the `uroman` Romanizer needs to be applied to the input text prior to tokenizing. + """ + + vocab_files_names = VOCAB_FILES_NAMES + pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP + max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES + model_input_names = ["input_ids", "attention_mask"] + + def __init__( + self, + vocab_file, + pad_token="", + unk_token="", + language=None, + add_blank=True, + normalize=True, + phonemize=True, + is_uroman=False, + **kwargs, + ) -> None: + super().__init__( + pad_token=pad_token, + unk_token=unk_token, + language=language, + add_blank=add_blank, + normalize=normalize, + phonemize=phonemize, + is_uroman=is_uroman, + **kwargs, + ) + + with open(vocab_file, encoding="utf-8") as vocab_handle: + self.encoder = json.load(vocab_handle) + + self.decoder = {v: k for k, v in self.encoder.items()} + self.language = language + self.add_blank = add_blank + self.normalize = normalize + self.phonemize = phonemize + + self.is_uroman = is_uroman + + @property + def vocab_size(self): + return len(self.encoder) + + def get_vocab(self): + vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} + return vocab + + def normalize_text(self, input_string): + """Lowercase the input string, respecting any special token ids that may be part or entirely upper-cased.""" + all_vocabulary = list(self.encoder.keys()) + list(self.added_tokens_encoder.keys()) + filtered_text = "" + + i = 0 + while i < len(input_string): + found_match = False + for word in all_vocabulary: + if input_string[i : i + len(word)] == word: + filtered_text += word + i += len(word) + found_match = True + break + + if not found_match: + filtered_text += input_string[i].lower() + i += 1 + + return filtered_text + + def _preprocess_char(self, text): + """Special treatment of characters in certain languages""" + if self.language == "ron": + text = text.replace("ț", "ţ") + return text + + def prepare_for_tokenization( + self, text: str, is_split_into_words: bool = False, normalize: Optional[bool] = None, **kwargs + ) -> Tuple[str, Dict[str, Any]]: + """ + Performs any necessary transformations before tokenization. + + This method should pop the arguments from kwargs and return the remaining `kwargs` as well. We test the + `kwargs` at the end of the encoding process to be sure all the arguments have been used. + + Args: + text (`str`): + The text to prepare. + is_split_into_words (`bool`, *optional*, defaults to `False`): + Whether or not the input is already pre-tokenized (e.g., split into words). If set to `True`, the + tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) + which it will tokenize. + normalize (`bool`, *optional*, defaults to `None`): + Whether or not to apply punctuation and casing normalization to the text inputs. Typically, VITS is + trained on lower-cased and un-punctuated text. Hence, normalization is used to ensure that the input + text consists only of lower-case characters. + kwargs (`Dict[str, Any]`, *optional*): + Keyword arguments to use for the tokenization. + + Returns: + `Tuple[str, Dict[str, Any]]`: The prepared text and the unused kwargs. + """ + normalize = normalize if normalize is not None else self.normalize + + if normalize: + # normalise for casing + text = self.normalize_text(text) + + filtered_text = self._preprocess_char(text) + + if has_non_roman_characters(filtered_text): + logger.warning( + "Text to the tokenizer contains non-Roman characters. Ensure the `uroman` Romanizer is " + "applied to the text prior to passing it to the tokenizer. See " + "`https://github.com/isi-nlp/uroman` for details." + ) + + if self.phonemize: + if not is_phonemizer_available(): + raise ImportError("Please install the `phonemizer` Python package to use this tokenizer.") + + filtered_text = phonemizer.phonemize( + filtered_text, + language="en-us", + backend="espeak", + strip=True, + preserve_punctuation=True, + with_stress=True, + ) + filtered_text = re.sub(r"\s+", " ", filtered_text) + elif normalize: + # strip any chars outside of the vocab (punctuation) + filtered_text = "".join(list(filter(lambda char: char in self.encoder, filtered_text))).strip() + + return filtered_text, kwargs + + def _tokenize(self, text: str) -> List[str]: + """Tokenize a string by inserting the `` token at the boundary between adjacent characters.""" + tokens = list(text) + + if self.add_blank: + interspersed = [self._convert_id_to_token(0)] * (len(tokens) * 2 + 1) + interspersed[1::2] = tokens + tokens = interspersed + + return tokens + + def convert_tokens_to_string(self, tokens: List[str]) -> str: + if self.add_blank and len(tokens) > 1: + tokens = tokens[1::2] + return "".join(tokens) + + def _convert_token_to_id(self, token): + """Converts a token (str) in an id using the vocab.""" + return self.encoder.get(token, self.encoder.get(self.unk_token)) + + def _convert_id_to_token(self, index): + """Converts an index (integer) in a token (str) using the vocab.""" + return self.decoder.get(index) + + def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Union[Tuple[str], None]: + if not os.path.isdir(save_directory): + logger.error(f"Vocabulary path ({save_directory}) should be a directory") + return + + vocab_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] + ) + + with open(vocab_file, "w", encoding="utf-8") as f: + f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n") + + return (vocab_file,) diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py index 1b26285702..ab7b7c18d6 100644 --- a/src/transformers/utils/dummy_pt_objects.py +++ b/src/transformers/utils/dummy_pt_objects.py @@ -7929,6 +7929,23 @@ class VitDetPreTrainedModel(metaclass=DummyObject): requires_backends(self, ["torch"]) +VITS_PRETRAINED_MODEL_ARCHIVE_LIST = None + + +class VitsModel(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + +class VitsPreTrainedModel(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None diff --git a/tests/models/vits/__init__.py b/tests/models/vits/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/tests/models/vits/test_modeling_vits.py b/tests/models/vits/test_modeling_vits.py new file mode 100644 index 0000000000..130814765a --- /dev/null +++ b/tests/models/vits/test_modeling_vits.py @@ -0,0 +1,377 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Testing suite for the PyTorch VITS model. """ + +import copy +import os +import tempfile +import unittest +from typing import Dict, List, Tuple + +import numpy as np + +from transformers import PretrainedConfig, VitsConfig +from transformers.testing_utils import ( + is_torch_available, + require_torch, + slow, + torch_device, +) +from transformers.trainer_utils import set_seed + +from ...test_configuration_common import ConfigTester +from ...test_modeling_common import ( + ModelTesterMixin, + global_rng, + ids_tensor, + random_attention_mask, +) + + +if is_torch_available(): + import torch + + from transformers import VitsModel, VitsTokenizer + + +CONFIG_NAME = "config.json" +GENERATION_CONFIG_NAME = "generation_config.json" + + +def _config_zero_init(config): + configs_no_init = copy.deepcopy(config) + for key in configs_no_init.__dict__.keys(): + if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key: + setattr(configs_no_init, key, 1e-10) + if isinstance(getattr(configs_no_init, key, None), PretrainedConfig): + no_init_subconfig = _config_zero_init(getattr(configs_no_init, key)) + setattr(configs_no_init, key, no_init_subconfig) + return configs_no_init + + +@require_torch +class VitsModelTester: + def __init__( + self, + parent, + batch_size=2, + seq_length=7, + is_training=False, + hidden_size=16, + num_hidden_layers=2, + num_attention_heads=2, + intermediate_size=64, + flow_size=16, + vocab_size=38, + spectrogram_bins=8, + duration_predictor_num_flows=2, + duration_predictor_filter_channels=16, + prior_encoder_num_flows=2, + upsample_initial_channel=16, + ): + self.parent = parent + self.batch_size = batch_size + self.seq_length = seq_length + self.is_training = is_training + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.flow_size = flow_size + self.vocab_size = vocab_size + self.spectrogram_bins = spectrogram_bins + self.duration_predictor_num_flows = duration_predictor_num_flows + self.duration_predictor_filter_channels = duration_predictor_filter_channels + self.prior_encoder_num_flows = prior_encoder_num_flows + self.upsample_initial_channel = upsample_initial_channel + + def prepare_config_and_inputs(self): + input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).clamp(2) + attention_mask = random_attention_mask([self.batch_size, self.seq_length]) + + config = self.get_config() + inputs_dict = { + "input_ids": input_ids, + "attention_mask": attention_mask, + } + return config, inputs_dict + + def prepare_config_and_inputs_for_common(self): + config, inputs_dict = self.prepare_config_and_inputs() + return config, inputs_dict + + def get_config(self): + return VitsConfig( + hidden_size=self.hidden_size, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + ffn_dim=self.intermediate_size, + flow_size=self.flow_size, + vocab_size=self.vocab_size, + spectrogram_bins=self.spectrogram_bins, + duration_predictor_num_flows=self.duration_predictor_num_flows, + prior_encoder_num_flows=self.prior_encoder_num_flows, + duration_predictor_filter_channels=self.duration_predictor_filter_channels, + posterior_encoder_num_wavenet_layers=self.num_hidden_layers, + upsample_initial_channel=self.upsample_initial_channel, + ) + + def create_and_check_model_forward(self, config, inputs_dict): + model = VitsModel(config=config).to(torch_device).eval() + + input_ids = inputs_dict["input_ids"] + attention_mask = inputs_dict["attention_mask"] + + result = model(input_ids, attention_mask=attention_mask) + self.parent.assertEqual(result.waveform.shape, (self.batch_size, 11008)) + + +@require_torch +class VitsModelTest(ModelTesterMixin, unittest.TestCase): + all_model_classes = (VitsModel,) if is_torch_available() else () + is_encoder_decoder = False + test_pruning = False + test_headmasking = False + test_resize_embeddings = False + test_head_masking = False + test_torchscript = False + has_attentions = False + + input_name = "input_ids" + + def setUp(self): + self.model_tester = VitsModelTester(self) + self.config_tester = ConfigTester(self, config_class=VitsConfig, hidden_size=37) + + def test_config(self): + self.config_tester.run_common_tests() + + def test_model_forward(self): + set_seed(12345) + global_rng.seed(12345) + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model_forward(*config_and_inputs) + + @unittest.skip("VITS is not deterministic") + def test_determinism(self): + pass + + def test_initialization(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + configs_no_init = _config_zero_init(config) + for model_class in self.all_model_classes: + model = model_class(config=configs_no_init) + for name, param in model.named_parameters(): + uniform_init_parms = [ + "emb_rel_k", + "emb_rel_v", + "conv_1", + "conv_2", + "conv_pre", + "conv_post", + "conv_proj", + "conv_dds", + "project", + "wavenet.in_layers", + "wavenet.res_skip_layers", + "upsampler", + "resblocks", + ] + if param.requires_grad: + if any(x in name for x in uniform_init_parms): + self.assertTrue( + -1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0, + msg=f"Parameter {name} of model {model_class} seems not properly initialized", + ) + else: + self.assertIn( + ((param.data.mean() * 1e9).round() / 1e9).item(), + [0.0, 1.0], + msg=f"Parameter {name} of model {model_class} seems not properly initialized", + ) + + @unittest.skip("VITS has no inputs_embeds") + def test_inputs_embeds(self): + pass + + @unittest.skip("VITS has no input embeddings") + def test_model_common_attributes(self): + pass + + # override since the model is not deterministic, so we need to set the seed for each forward pass + def test_model_outputs_equivalence(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + def set_nan_tensor_to_zero(t): + t[t != t] = 0 + return t + + def check_equivalence(model, tuple_inputs, dict_inputs, additional_kwargs={}): + with torch.no_grad(): + set_seed(0) + tuple_output = model(**tuple_inputs, return_dict=False, **additional_kwargs) + set_seed(0) + dict_output = model(**dict_inputs, return_dict=True, **additional_kwargs).to_tuple() + + def recursive_check(tuple_object, dict_object): + if isinstance(tuple_object, (List, Tuple)): + for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object): + recursive_check(tuple_iterable_value, dict_iterable_value) + elif isinstance(tuple_object, Dict): + for tuple_iterable_value, dict_iterable_value in zip( + tuple_object.values(), dict_object.values() + ): + recursive_check(tuple_iterable_value, dict_iterable_value) + elif tuple_object is None: + return + else: + self.assertTrue( + torch.allclose( + set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5 + ), + msg=( + "Tuple and dict output are not equal. Difference:" + f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:" + f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has" + f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}." + ), + ) + + recursive_check(tuple_output, dict_output) + + for model_class in self.all_model_classes: + model = model_class(config) + model.to(torch_device) + model.eval() + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class) + dict_inputs = self._prepare_for_class(inputs_dict, model_class) + check_equivalence(model, tuple_inputs, dict_inputs) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + check_equivalence(model, tuple_inputs, dict_inputs) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class) + dict_inputs = self._prepare_for_class(inputs_dict, model_class) + check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True}) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True}) + + if self.has_attentions: + tuple_inputs = self._prepare_for_class(inputs_dict, model_class) + dict_inputs = self._prepare_for_class(inputs_dict, model_class) + check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True}) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True}) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + check_equivalence( + model, tuple_inputs, dict_inputs, {"output_hidden_states": True, "output_attentions": True} + ) + + # override since the model is not deterministic, so we need to set the seed for each forward pass + def test_save_load(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + def check_save_load(out1, out2): + # make sure we don't have nans + out_2 = out2.cpu().numpy() + out_2[np.isnan(out_2)] = 0 + + out_1 = out1.cpu().numpy() + out_1[np.isnan(out_1)] = 0 + max_diff = np.amax(np.abs(out_1 - out_2)) + self.assertLessEqual(max_diff, 1e-5) + + for model_class in self.all_model_classes: + model = model_class(config) + model.to(torch_device) + model.eval() + with torch.no_grad(): + set_seed(0) + first = model(**self._prepare_for_class(inputs_dict, model_class))[0] + + with tempfile.TemporaryDirectory() as tmpdirname: + model.save_pretrained(tmpdirname) + + # the config file (and the generation config file, if it can generate) should be saved + self.assertTrue(os.path.exists(os.path.join(tmpdirname, CONFIG_NAME))) + self.assertEqual( + model.can_generate(), os.path.exists(os.path.join(tmpdirname, GENERATION_CONFIG_NAME)) + ) + + model = model_class.from_pretrained(tmpdirname) + model.to(torch_device) + with torch.no_grad(): + set_seed(0) + second = model(**self._prepare_for_class(inputs_dict, model_class))[0] + + if isinstance(first, tuple) and isinstance(second, tuple): + for tensor1, tensor2 in zip(first, second): + check_save_load(tensor1, tensor2) + else: + check_save_load(first, second) + + # overwrite from test_modeling_common + def _mock_init_weights(self, module): + if hasattr(module, "weight") and module.weight is not None: + module.weight.data.fill_(3) + if hasattr(module, "weight_g") and module.weight_g is not None: + module.weight_g.data.fill_(3) + if hasattr(module, "weight_v") and module.weight_v is not None: + module.weight_v.data.fill_(3) + if hasattr(module, "bias") and module.bias is not None: + module.bias.data.fill_(3) + + +@require_torch +@slow +class VitsModelIntegrationTests(unittest.TestCase): + def test_forward(self): + # GPU gives different results than CPU + torch_device = "cpu" + + model = VitsModel.from_pretrained("facebook/mms-tts-eng") + model.to(torch_device) + + tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") + + set_seed(555) # make deterministic + + input_text = "Mister quilter is the apostle of the middle classes and we are glad to welcome his gospel!" + input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(torch_device) + + with torch.no_grad(): + outputs = model(input_ids) + + self.assertEqual(outputs.waveform.shape, (1, 87040)) + # fmt: off + EXPECTED_LOGITS = torch.tensor( + [ + -0.0042, 0.0176, 0.0354, 0.0504, 0.0621, 0.0777, 0.0980, 0.1224, + 0.1475, 0.1679, 0.1817, 0.1832, 0.1713, 0.1542, 0.1384, 0.1256, + 0.1147, 0.1066, 0.1026, 0.0958, 0.0823, 0.0610, 0.0340, 0.0022, + -0.0337, -0.0677, -0.0969, -0.1178, -0.1311, -0.1363 + ] + ) + # fmt: on + self.assertTrue(torch.allclose(outputs.waveform[0, 10000:10030].cpu(), EXPECTED_LOGITS, atol=1e-4)) diff --git a/tests/models/vits/test_tokenization_vits.py b/tests/models/vits/test_tokenization_vits.py new file mode 100644 index 0000000000..53a42cd36c --- /dev/null +++ b/tests/models/vits/test_tokenization_vits.py @@ -0,0 +1,187 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tests for the VITS tokenizer.""" +import json +import os +import shutil +import tempfile +import unittest + +from transformers import VitsTokenizer +from transformers.models.vits.tokenization_vits import VOCAB_FILES_NAMES +from transformers.testing_utils import slow + +from ...test_tokenization_common import TokenizerTesterMixin + + +class VitsTokenizerTest(TokenizerTesterMixin, unittest.TestCase): + tokenizer_class = VitsTokenizer + test_rust_tokenizer = False + + def setUp(self): + super().setUp() + + vocab = ( + "k ' z y u d h e s w – 3 c p - 1 j m i X f l o 0 b r a 4 2 n _ x v t q 5 6 g ț ţ < > | ".split( + " " + ) + ) + vocab_tokens = dict(zip(vocab, range(len(vocab)))) + vocab_tokens[" "] = vocab_tokens["X"] + del vocab_tokens["X"] + + self.special_tokens_map = {"pad_token": "", "unk_token": ""} + + self.tmpdirname = tempfile.mkdtemp() + self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"]) + with open(self.vocab_file, "w", encoding="utf-8") as fp: + fp.write(json.dumps(vocab_tokens) + "\n") + + def get_tokenizer(self, **kwargs): + kwargs.update(self.special_tokens_map) + kwargs["phonemize"] = False + kwargs["normalize"] = False + return VitsTokenizer.from_pretrained(self.tmpdirname, **kwargs) + + def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5): + txt = "beyonce lives in los angeles" + ids = tokenizer.encode(txt, add_special_tokens=False) + return txt, ids + + @unittest.skip("Adding multicharacter tokens does not work with the VITS tokenizer") + def test_add_tokens_tokenizer(self): + pass + + @unittest.skip("Adding multicharacter tokens does not work with the VITS tokenizer") + def test_encode_decode_with_spaces(self): + pass + + @unittest.skip("The VITS tokenizer does not support `is_split_into_words`") + def test_pretokenized_inputs(self): + pass + + def test_save_and_load_tokenizer(self): + # safety check on max_len default value so we are sure the test works + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + self.assertNotEqual(tokenizer.model_max_length, 42) + + # Now let's start the test + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + # Isolate this from the other tests because we save additional tokens/etc + tmpdirname = tempfile.mkdtemp() + + sample_text = " He is very happy, UNwant\u00E9d,running" + before_tokens = tokenizer.encode(sample_text, add_special_tokens=False) + before_vocab = tokenizer.get_vocab() + tokenizer.save_pretrained(tmpdirname) + + after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname) + after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False) + after_vocab = after_tokenizer.get_vocab() + self.assertListEqual(before_tokens, after_tokens) + self.assertDictEqual(before_vocab, after_vocab) + + shutil.rmtree(tmpdirname) + + @unittest.skip("Adding multicharacter tokens does not work the VITS tokenizer") + def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self): + pass + + def test_ron_normalization(self): + tokenizer = self.get_tokenizer() + tokenizer.language = "ron" + + sequences = ["vițs"] + normalized_sequences = ["viţs"] + + encoded_ids = tokenizer(sequences, normalize=True)["input_ids"] + decoded_sequences = tokenizer.batch_decode(encoded_ids) + self.assertEqual(normalized_sequences, decoded_sequences) + + def test_normalization(self): + tokenizer = self.get_tokenizer() + + sequences = ["VITS; is a model for t-t-s!"] + normalized_sequences = ["vits is a model for t-t-s"] + unnormalized_sequences = [ + " is a model for t-t-s" + ] # can't handle upper-case or certain punctuations + + encoded_normalized_ids = tokenizer(sequences, normalize=True) + encoded_unnormalized_ids = tokenizer(sequences, normalize=False) + + decoded_normalized_sequences = [ + tokenizer.decode(seq, skip_special_tokens=False) for seq in encoded_normalized_ids["input_ids"] + ] + decoded_unnormalized_sequences = [ + tokenizer.decode(seq, skip_special_tokens=False) for seq in encoded_unnormalized_ids["input_ids"] + ] + + self.assertEqual(decoded_normalized_sequences, normalized_sequences) + self.assertEqual(decoded_unnormalized_sequences, unnormalized_sequences) + + @slow + def test_tokenizer_integration(self): + sequences = [ + "BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly " + "conditioning on both left and right context in all layers.", + "The quick brown fox! Jumps over the lazy dog...", + "We use k as our padding token", + ] + + normalized_sequences = [ + "bert is designed to pre-train deep bidirectional representations from unlabeled text by jointly " + "conditioning on both left and right context in all layers", + "the quick brown fox jumps over the lazy dog", + "we use k as our padding token", + ] + + # fmt: off + expected_encoding = { + 'input_ids': [ + [0, 24, 0, 7, 0, 25, 0, 33, 0, 19, 0, 18, 0, 8, 0, 19, 0, 5, 0, 7, 0, 8, 0, 18, 0, 37, 0, 29, 0, 7, 0, 5, 0, 19, 0, 33, 0, 22, 0, 19, 0, 13, 0, 25, 0, 7, 0, 14, 0, 33, 0, 25, 0, 26, 0, 18, 0, 29, 0, 19, 0, 5, 0, 7, 0, 7, 0, 13, 0, 19, 0, 24, 0, 18, 0, 5, 0, 18, 0, 25, 0, 7, 0, 12, 0, 33, 0, 18, 0, 22, 0, 29, 0, 26, 0, 21, 0, 19, 0, 25, 0, 7, 0, 13, 0, 25, 0, 7, 0, 8, 0, 7, 0, 29, 0, 33, 0, 26, 0, 33, 0, 18, 0, 22, 0, 29, 0, 8, 0, 19, 0, 20, 0, 25, 0, 22, 0, 17, 0, 19, 0, 4, 0, 29, 0, 21, 0, 26, 0, 24, 0, 7, 0, 21, 0, 7, 0, 5, 0, 19, 0, 33, 0, 7, 0, 31, 0, 33, 0, 19, 0, 24, 0, 3, 0, 19, 0, 16, 0, 22, 0, 18, 0, 29, 0, 33, 0, 21, 0, 3, 0, 19, 0, 12, 0, 22, 0, 29, 0, 5, 0, 18, 0, 33, 0, 18, 0, 22, 0, 29, 0, 18, 0, 29, 0, 37, 0, 19, 0, 22, 0, 29, 0, 19, 0, 24, 0, 22, 0, 33, 0, 6, 0, 19, 0, 21, 0, 7, 0, 20, 0, 33, 0, 19, 0, 26, 0, 29, 0, 5, 0, 19, 0, 25, 0, 18, 0, 37, 0, 6, 0, 33, 0, 19, 0, 12, 0, 22, 0, 29, 0, 33, 0, 7, 0, 31, 0, 33, 0, 19, 0, 18, 0, 29, 0, 19, 0, 26, 0, 21, 0, 21, 0, 19, 0, 21, 0, 26, 0, 3, 0, 7, 0, 25, 0, 8, 0], + [0, 33, 0, 6, 0, 7, 0, 19, 0, 34, 0, 4, 0, 18, 0, 12, 0, 0, 0, 19, 0, 24, 0, 25, 0, 22, 0, 9, 0, 29, 0, 19, 0, 20, 0, 22, 0, 31, 0, 19, 0, 16, 0, 4, 0, 17, 0, 13, 0, 8, 0, 19, 0, 22, 0, 32, 0, 7, 0, 25, 0, 19, 0, 33, 0, 6, 0, 7, 0, 19, 0, 21, 0, 26, 0, 2, 0, 3, 0, 19, 0, 5, 0, 22, 0, 37, 0, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38], + [0, 9, 0, 7, 0, 19, 0, 4, 0, 8, 0, 7, 0, 19, 0, 0, 0, 19, 0, 26, 0, 8, 0, 19, 0, 22, 0, 4, 0, 25, 0, 19, 0, 13, 0, 26, 0, 5, 0, 5, 0, 18, 0, 29, 0, 37, 0, 19, 0, 33, 0, 22, 0, 0, 0, 7, 0, 29, 0, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38], + ], + 'attention_mask': [ + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + ] + } + # fmt: on + + tokenizer_classes = [self.tokenizer_class] + if self.test_rust_tokenizer: + tokenizer_classes.append(self.rust_tokenizer_class) + + for tokenizer_class in tokenizer_classes: + tokenizer = tokenizer_class.from_pretrained( + "facebook/mms-tts-eng", + revision="d188a254c84ae6cfd24deb7a8f5c0c1d349d7d9f", # to pin the tokenizer version + ) + + encoding = tokenizer(sequences, padding=True, normalize=True) + decoded_sequences = [tokenizer.decode(seq, skip_special_tokens=True) for seq in encoding["input_ids"]] + + encoding_data = encoding.data + self.assertDictEqual(encoding_data, expected_encoding) + + for expected, decoded in zip(normalized_sequences, decoded_sequences): + self.assertEqual(expected, decoded) diff --git a/utils/check_config_attributes.py b/utils/check_config_attributes.py index f542dae5f4..0f0c5b41e4 100644 --- a/utils/check_config_attributes.py +++ b/utils/check_config_attributes.py @@ -190,6 +190,7 @@ def check_attribute_being_used(config_class, attributes, default_value, source_s "use_cache", "out_features", "out_indices", + "sampling_rate", ] attributes_used_in_generation = ["encoder_no_repeat_ngram_size"] diff --git a/utils/documentation_tests.txt b/utils/documentation_tests.txt new file mode 100644 index 0000000000..a916a2d5c9 --- /dev/null +++ b/utils/documentation_tests.txt @@ -0,0 +1,493 @@ +docs/source/en/autoclass_tutorial.md +docs/source/en/model_doc/byt5.md +docs/source/en/model_doc/donut.md +docs/source/en/model_doc/encoder-decoder.md +docs/source/en/model_doc/markuplm.md +docs/source/en/model_doc/speech_to_text.md +docs/source/en/model_doc/switch_transformers.md +docs/source/en/model_doc/t5.md +docs/source/en/model_doc/t5v1.1.md +docs/source/en/model_doc/tapex.md +docs/source/en/pipeline_tutorial.md +docs/source/en/quicktour.md +docs/source/en/task_summary.md +docs/source/es/quicktour.md +src/transformers/generation/configuration_utils.py +src/transformers/generation/tf_utils.py +src/transformers/generation/utils.py +src/transformers/models/albert/configuration_albert.py +src/transformers/models/albert/modeling_albert.py +src/transformers/models/albert/modeling_tf_albert.py +src/transformers/models/albert/tokenization_albert.py +src/transformers/models/albert/tokenization_albert_fast.py +src/transformers/models/align/processing_align.py +src/transformers/models/altclip/processing_altclip.py +src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py +src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py +src/transformers/models/auto/feature_extraction_auto.py +src/transformers/models/auto/image_processing_auto.py +src/transformers/models/auto/processing_auto.py +src/transformers/models/auto/tokenization_auto.py +src/transformers/models/bark/configuration_bark.py +src/transformers/models/bark/modeling_bark.py +src/transformers/models/bark/processing_bark.py +src/transformers/models/bart/configuration_bart.py +src/transformers/models/bart/modeling_bart.py +src/transformers/models/bart/tokenization_bart.py +src/transformers/models/bart/tokenization_bart_fast.py +src/transformers/models/barthez/tokenization_barthez.py +src/transformers/models/barthez/tokenization_barthez_fast.py +src/transformers/models/bartpho/tokenization_bartpho.py +src/transformers/models/beit/configuration_beit.py +src/transformers/models/beit/feature_extraction_beit.py +src/transformers/models/beit/image_processing_beit.py +src/transformers/models/beit/modeling_beit.py +src/transformers/models/bert/configuration_bert.py +src/transformers/models/bert/modeling_bert.py +src/transformers/models/bert/modeling_tf_bert.py +src/transformers/models/bert/tokenization_bert.py +src/transformers/models/bert/tokenization_bert_fast.py +src/transformers/models/bert/tokenization_bert_tf.py +src/transformers/models/bert_generation/configuration_bert_generation.py +src/transformers/models/bert_generation/tokenization_bert_generation.py +src/transformers/models/bert_japanese/tokenization_bert_japanese.py +src/transformers/models/bertweet/tokenization_bertweet.py +src/transformers/models/big_bird/configuration_big_bird.py +src/transformers/models/big_bird/modeling_big_bird.py +src/transformers/models/big_bird/tokenization_big_bird.py +src/transformers/models/big_bird/tokenization_big_bird_fast.py +src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py +src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py +src/transformers/models/biogpt/tokenization_biogpt.py +src/transformers/models/bit/image_processing_bit.py +src/transformers/models/blenderbot/configuration_blenderbot.py +src/transformers/models/blenderbot/modeling_blenderbot.py +src/transformers/models/blenderbot/tokenization_blenderbot.py +src/transformers/models/blenderbot/tokenization_blenderbot_fast.py +src/transformers/models/blenderbot_small/configuration_blenderbot_small.py +src/transformers/models/blenderbot_small/modeling_blenderbot_small.py +src/transformers/models/blenderbot_small/tokenization_blenderbot_small.py +src/transformers/models/blenderbot_small/tokenization_blenderbot_small_fast.py +src/transformers/models/blip/image_processing_blip.py +src/transformers/models/blip/modeling_blip.py +src/transformers/models/blip/modeling_tf_blip.py +src/transformers/models/blip/processing_blip.py +src/transformers/models/blip_2/processing_blip_2.py +src/transformers/models/bloom/configuration_bloom.py +src/transformers/models/bloom/tokenization_bloom_fast.py +src/transformers/models/bridgetower/image_processing_bridgetower.py +src/transformers/models/bridgetower/processing_bridgetower.py +src/transformers/models/byt5/tokenization_byt5.py +src/transformers/models/camembert/configuration_camembert.py +src/transformers/models/camembert/tokenization_camembert.py +src/transformers/models/camembert/tokenization_camembert_fast.py +src/transformers/models/canine/configuration_canine.py +src/transformers/models/canine/modeling_canine.py +src/transformers/models/canine/tokenization_canine.py +src/transformers/models/chinese_clip/feature_extraction_chinese_clip.py +src/transformers/models/chinese_clip/image_processing_chinese_clip.py +src/transformers/models/chinese_clip/processing_chinese_clip.py +src/transformers/models/clap/configuration_clap.py +src/transformers/models/clap/feature_extraction_clap.py +src/transformers/models/clap/modeling_clap.py +src/transformers/models/clap/processing_clap.py +src/transformers/models/clip/configuration_clip.py +src/transformers/models/clip/feature_extraction_clip.py +src/transformers/models/clip/image_processing_clip.py +src/transformers/models/clip/processing_clip.py +src/transformers/models/clip/tokenization_clip.py +src/transformers/models/clip/tokenization_clip_fast.py +src/transformers/models/clipseg/modeling_clipseg.py +src/transformers/models/clipseg/processing_clipseg.py +src/transformers/models/codegen/configuration_codegen.py +src/transformers/models/codegen/tokenization_codegen.py +src/transformers/models/codegen/tokenization_codegen_fast.py +src/transformers/models/conditional_detr/configuration_conditional_detr.py +src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py +src/transformers/models/conditional_detr/image_processing_conditional_detr.py +src/transformers/models/conditional_detr/modeling_conditional_detr.py +src/transformers/models/convbert/configuration_convbert.py +src/transformers/models/convbert/tokenization_convbert.py +src/transformers/models/convbert/tokenization_convbert_fast.py +src/transformers/models/convnext/configuration_convnext.py +src/transformers/models/convnext/feature_extraction_convnext.py +src/transformers/models/convnext/image_processing_convnext.py +src/transformers/models/convnext/modeling_convnext.py +src/transformers/models/cpm/tokenization_cpm.py +src/transformers/models/cpm/tokenization_cpm_fast.py +src/transformers/models/ctrl/configuration_ctrl.py +src/transformers/models/ctrl/modeling_ctrl.py +src/transformers/models/ctrl/tokenization_ctrl.py +src/transformers/models/cvt/configuration_cvt.py +src/transformers/models/cvt/modeling_cvt.py +src/transformers/models/data2vec/configuration_data2vec_audio.py +src/transformers/models/data2vec/configuration_data2vec_text.py +src/transformers/models/data2vec/configuration_data2vec_vision.py +src/transformers/models/data2vec/modeling_data2vec_audio.py +src/transformers/models/data2vec/modeling_data2vec_vision.py +src/transformers/models/deberta/configuration_deberta.py +src/transformers/models/deberta/modeling_deberta.py +src/transformers/models/deberta/tokenization_deberta.py +src/transformers/models/deberta/tokenization_deberta_fast.py +src/transformers/models/deberta_v2/configuration_deberta_v2.py +src/transformers/models/deberta_v2/modeling_deberta_v2.py +src/transformers/models/deberta_v2/tokenization_deberta_v2.py +src/transformers/models/deberta_v2/tokenization_deberta_v2_fast.py +src/transformers/models/decision_transformer/configuration_decision_transformer.py +src/transformers/models/deformable_detr/configuration_deformable_detr.py +src/transformers/models/deformable_detr/feature_extraction_deformable_detr.py +src/transformers/models/deformable_detr/image_processing_deformable_detr.py +src/transformers/models/deformable_detr/modeling_deformable_detr.py +src/transformers/models/deit/configuration_deit.py +src/transformers/models/deit/feature_extraction_deit.py +src/transformers/models/deit/image_processing_deit.py +src/transformers/models/deit/modeling_deit.py +src/transformers/models/deit/modeling_tf_deit.py +src/transformers/models/deta/configuration_deta.py +src/transformers/models/deta/image_processing_deta.py +src/transformers/models/deta/modeling_deta.py +src/transformers/models/detr/configuration_detr.py +src/transformers/models/detr/feature_extraction_detr.py +src/transformers/models/detr/image_processing_detr.py +src/transformers/models/detr/modeling_detr.py +src/transformers/models/dinat/configuration_dinat.py +src/transformers/models/dinat/modeling_dinat.py +src/transformers/models/distilbert/configuration_distilbert.py +src/transformers/models/distilbert/tokenization_distilbert.py +src/transformers/models/distilbert/tokenization_distilbert_fast.py +src/transformers/models/donut/feature_extraction_donut.py +src/transformers/models/donut/image_processing_donut.py +src/transformers/models/donut/processing_donut.py +src/transformers/models/dpr/configuration_dpr.py +src/transformers/models/dpr/tokenization_dpr.py +src/transformers/models/dpr/tokenization_dpr_fast.py +src/transformers/models/dpt/feature_extraction_dpt.py +src/transformers/models/dpt/image_processing_dpt.py +src/transformers/models/dpt/modeling_dpt.py +src/transformers/models/efficientformer/image_processing_efficientformer.py +src/transformers/models/efficientformer/modeling_tf_efficientformer.py +src/transformers/models/efficientnet/image_processing_efficientnet.py +src/transformers/models/electra/configuration_electra.py +src/transformers/models/electra/modeling_electra.py +src/transformers/models/electra/modeling_tf_electra.py +src/transformers/models/electra/tokenization_electra.py +src/transformers/models/electra/tokenization_electra_fast.py +src/transformers/models/encodec/feature_extraction_encodec.py +src/transformers/models/encodec/modeling_encodec.py +src/transformers/models/ernie/configuration_ernie.py +src/transformers/models/ernie_m/configuration_ernie_m.py +src/transformers/models/ernie_m/modeling_ernie_m.py +src/transformers/models/ernie_m/tokenization_ernie_m.py +src/transformers/models/esm/tokenization_esm.py +src/transformers/models/flaubert/tokenization_flaubert.py +src/transformers/models/flava/configuration_flava.py +src/transformers/models/flava/feature_extraction_flava.py +src/transformers/models/flava/image_processing_flava.py +src/transformers/models/flava/processing_flava.py +src/transformers/models/fnet/configuration_fnet.py +src/transformers/models/fnet/tokenization_fnet.py +src/transformers/models/fnet/tokenization_fnet_fast.py +src/transformers/models/fsmt/configuration_fsmt.py +src/transformers/models/fsmt/tokenization_fsmt.py +src/transformers/models/funnel/tokenization_funnel.py +src/transformers/models/funnel/tokenization_funnel_fast.py +src/transformers/models/git/modeling_git.py +src/transformers/models/git/processing_git.py +src/transformers/models/glpn/feature_extraction_glpn.py +src/transformers/models/glpn/image_processing_glpn.py +src/transformers/models/glpn/modeling_glpn.py +src/transformers/models/gpt2/configuration_gpt2.py +src/transformers/models/gpt2/modeling_gpt2.py +src/transformers/models/gpt2/tokenization_gpt2.py +src/transformers/models/gpt2/tokenization_gpt2_fast.py +src/transformers/models/gpt2/tokenization_gpt2_tf.py +src/transformers/models/gpt_neo/configuration_gpt_neo.py +src/transformers/models/gpt_neox/configuration_gpt_neox.py +src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py +src/transformers/models/gpt_neox_japanese/configuration_gpt_neox_japanese.py +src/transformers/models/gpt_neox_japanese/tokenization_gpt_neox_japanese.py +src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py +src/transformers/models/gptj/modeling_gptj.py +src/transformers/models/gptsan_japanese/tokenization_gptsan_japanese.py +src/transformers/models/groupvit/modeling_groupvit.py +src/transformers/models/groupvit/modeling_tf_groupvit.py +src/transformers/models/herbert/tokenization_herbert.py +src/transformers/models/herbert/tokenization_herbert_fast.py +src/transformers/models/hubert/modeling_hubert.py +src/transformers/models/imagegpt/configuration_imagegpt.py +src/transformers/models/imagegpt/feature_extraction_imagegpt.py +src/transformers/models/imagegpt/image_processing_imagegpt.py +src/transformers/models/imagegpt/modeling_imagegpt.py +src/transformers/models/jukebox/tokenization_jukebox.py +src/transformers/models/jukebox/tokenization_jukebox.py +src/transformers/models/layoutlm/configuration_layoutlm.py +src/transformers/models/layoutlm/modeling_layoutlm.py +src/transformers/models/layoutlm/modeling_tf_layoutlm.py +src/transformers/models/layoutlm/tokenization_layoutlm.py +src/transformers/models/layoutlm/tokenization_layoutlm_fast.py +src/transformers/models/layoutlmv2/configuration_layoutlmv2.py +src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py +src/transformers/models/layoutlmv2/image_processing_layoutlmv2.py +src/transformers/models/layoutlmv2/modeling_layoutlmv2.py +src/transformers/models/layoutlmv2/processing_layoutlmv2.py +src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py +src/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py +src/transformers/models/layoutlmv3/configuration_layoutlmv3.py +src/transformers/models/layoutlmv3/feature_extraction_layoutlmv3.py +src/transformers/models/layoutlmv3/image_processing_layoutlmv3.py +src/transformers/models/layoutlmv3/modeling_layoutlmv3.py +src/transformers/models/layoutlmv3/modeling_tf_layoutlmv3.py +src/transformers/models/layoutlmv3/processing_layoutlmv3.py +src/transformers/models/layoutlmv3/tokenization_layoutlmv3.py +src/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py +src/transformers/models/layoutxlm/processing_layoutxlm.py +src/transformers/models/layoutxlm/tokenization_layoutxlm.py +src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py +src/transformers/models/led/tokenization_led.py +src/transformers/models/led/tokenization_led_fast.py +src/transformers/models/levit/configuration_levit.py +src/transformers/models/levit/feature_extraction_levit.py +src/transformers/models/levit/image_processing_levit.py +src/transformers/models/lilt/modeling_lilt.py +src/transformers/models/llama/tokenization_llama.py +src/transformers/models/longformer/modeling_longformer.py +src/transformers/models/longformer/modeling_tf_longformer.py +src/transformers/models/longformer/tokenization_longformer.py +src/transformers/models/longformer/tokenization_longformer_fast.py +src/transformers/models/longt5/modeling_longt5.py +src/transformers/models/luke/tokenization_luke.py +src/transformers/models/lxmert/tokenization_lxmert.py +src/transformers/models/lxmert/tokenization_lxmert_fast.py +src/transformers/models/m2m_100/configuration_m2m_100.py +src/transformers/models/m2m_100/tokenization_m2m_100.py +src/transformers/models/marian/modeling_marian.py +src/transformers/models/marian/tokenization_marian.py +src/transformers/models/markuplm/modeling_markuplm.py +src/transformers/models/markuplm/processing_markuplm.py +src/transformers/models/markuplm/tokenization_markuplm.py +src/transformers/models/markuplm/tokenization_markuplm_fast.py +src/transformers/models/mask2former/configuration_mask2former.py +src/transformers/models/mask2former/image_processing_mask2former.py +src/transformers/models/mask2former/modeling_mask2former.py +src/transformers/models/maskformer/configuration_maskformer.py +src/transformers/models/maskformer/feature_extraction_maskformer.py +src/transformers/models/maskformer/image_processing_maskformer.py +src/transformers/models/maskformer/modeling_maskformer.py +src/transformers/models/mbart/configuration_mbart.py +src/transformers/models/mbart/modeling_mbart.py +src/transformers/models/mbart/modeling_tf_mbart.py +src/transformers/models/mbart/tokenization_mbart.py +src/transformers/models/mbart/tokenization_mbart_fast.py +src/transformers/models/mbart50/tokenization_mbart50.py +src/transformers/models/mbart50/tokenization_mbart50_fast.py +src/transformers/models/megatron_bert/configuration_megatron_bert.py +src/transformers/models/mgp_str/processing_mgp_str.py +src/transformers/models/mgp_str/tokenization_mgp_str.py +src/transformers/models/mluke/tokenization_mluke.py +src/transformers/models/mobilebert/configuration_mobilebert.py +src/transformers/models/mobilebert/modeling_mobilebert.py +src/transformers/models/mobilebert/modeling_tf_mobilebert.py +src/transformers/models/mobilebert/tokenization_mobilebert.py +src/transformers/models/mobilebert/tokenization_mobilebert_fast.py +src/transformers/models/mobilenet_v1/feature_extraction_mobilenet_v1.py +src/transformers/models/mobilenet_v1/image_processing_mobilenet_v1.py +src/transformers/models/mobilenet_v1/modeling_mobilenet_v1.py +src/transformers/models/mobilenet_v2/feature_extraction_mobilenet_v2.py +src/transformers/models/mobilenet_v2/image_processing_mobilenet_v2.py +src/transformers/models/mobilenet_v2/modeling_mobilenet_v2.py +src/transformers/models/mobilevit/feature_extraction_mobilevit.py +src/transformers/models/mobilevit/image_processing_mobilevit.py +src/transformers/models/mobilevit/modeling_mobilevit.py +src/transformers/models/mobilevit/modeling_tf_mobilevit.py +src/transformers/models/mobilevitv2/configuration_mobilevitv2.py +src/transformers/models/mobilevitv2/modeling_mobilevitv2.py +src/transformers/models/mpnet/tokenization_mpnet.py +src/transformers/models/mpnet/tokenization_mpnet_fast.py +src/transformers/models/musicgen/configuration_musicgen.py +src/transformers/models/musicgen/modeling_musicgen.py +src/transformers/models/musicgen/processing_musicgen.py +src/transformers/models/mvp/configuration_mvp.py +src/transformers/models/mvp/tokenization_mvp.py +src/transformers/models/mvp/tokenization_mvp_fast.py +src/transformers/models/nat/configuration_nat.py +src/transformers/models/nat/modeling_nat.py +src/transformers/models/nezha/configuration_nezha.py +src/transformers/models/nllb/tokenization_nllb.py +src/transformers/models/nllb/tokenization_nllb_fast.py +src/transformers/models/oneformer/configuration_oneformer.py +src/transformers/models/oneformer/image_processing_oneformer.py +src/transformers/models/oneformer/modeling_oneformer.py +src/transformers/models/oneformer/processing_oneformer.py +src/transformers/models/openai/configuration_openai.py +src/transformers/models/openai/tokenization_openai.py +src/transformers/models/openai/tokenization_openai_fast.py +src/transformers/models/opt/configuration_opt.py +src/transformers/models/opt/modeling_opt.py +src/transformers/models/opt/modeling_tf_opt.py +src/transformers/models/owlvit/feature_extraction_owlvit.py +src/transformers/models/owlvit/image_processing_owlvit.py +src/transformers/models/owlvit/modeling_owlvit.py +src/transformers/models/owlvit/processing_owlvit.py +src/transformers/models/pegasus/configuration_pegasus.py +src/transformers/models/pegasus/modeling_pegasus.py +src/transformers/models/pegasus/tokenization_pegasus.py +src/transformers/models/pegasus/tokenization_pegasus_fast.py +src/transformers/models/pegasus_x/configuration_pegasus_x.py +src/transformers/models/perceiver/feature_extraction_perceiver.py +src/transformers/models/perceiver/image_processing_perceiver.py +src/transformers/models/perceiver/modeling_perceiver.py +src/transformers/models/perceiver/tokenization_perceiver.py +src/transformers/models/phobert/tokenization_phobert.py +src/transformers/models/pix2struct/modeling_pix2struct.py +src/transformers/models/plbart/configuration_plbart.py +src/transformers/models/plbart/modeling_plbart.py +src/transformers/models/plbart/tokenization_plbart.py +src/transformers/models/poolformer/configuration_poolformer.py +src/transformers/models/poolformer/feature_extraction_poolformer.py +src/transformers/models/poolformer/image_processing_poolformer.py +src/transformers/models/poolformer/modeling_poolformer.py +src/transformers/models/prophetnet/tokenization_prophetnet.py +src/transformers/models/rag/tokenization_rag.py +src/transformers/models/realm/configuration_realm.py +src/transformers/models/realm/tokenization_realm.py +src/transformers/models/realm/tokenization_realm_fast.py +src/transformers/models/reformer/configuration_reformer.py +src/transformers/models/reformer/modeling_reformer.py +src/transformers/models/reformer/tokenization_reformer.py +src/transformers/models/reformer/tokenization_reformer_fast.py +src/transformers/models/regnet/modeling_regnet.py +src/transformers/models/regnet/modeling_tf_regnet.py +src/transformers/models/rembert/tokenization_rembert.py +src/transformers/models/rembert/tokenization_rembert_fast.py +src/transformers/models/resnet/configuration_resnet.py +src/transformers/models/resnet/modeling_resnet.py +src/transformers/models/resnet/modeling_tf_resnet.py +src/transformers/models/roberta/configuration_roberta.py +src/transformers/models/roberta/modeling_roberta.py +src/transformers/models/roberta/modeling_tf_roberta.py +src/transformers/models/roberta/tokenization_roberta.py +src/transformers/models/roberta/tokenization_roberta_fast.py +src/transformers/models/roberta_prelayernorm/configuration_roberta_prelayernorm.py +src/transformers/models/roberta_prelayernorm/modeling_roberta_prelayernorm.py +src/transformers/models/roberta_prelayernorm/modeling_tf_roberta_prelayernorm.py +src/transformers/models/roc_bert/modeling_roc_bert.py +src/transformers/models/roc_bert/tokenization_roc_bert.py +src/transformers/models/roformer/tokenization_roformer.py +src/transformers/models/roformer/tokenization_roformer_fast.py +src/transformers/models/roformer/tokenization_utils.py +src/transformers/models/segformer/feature_extraction_segformer.py +src/transformers/models/segformer/image_processing_segformer.py +src/transformers/models/segformer/modeling_segformer.py +src/transformers/models/segformer/modeling_tf_segformer.py +src/transformers/models/sew/configuration_sew.py +src/transformers/models/sew/modeling_sew.py +src/transformers/models/sew_d/configuration_sew_d.py +src/transformers/models/sew_d/modeling_sew_d.py +src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py +src/transformers/models/speech_to_text/configuration_speech_to_text.py +src/transformers/models/speech_to_text/feature_extraction_speech_to_text.py +src/transformers/models/speech_to_text/modeling_speech_to_text.py +src/transformers/models/speech_to_text/processing_speech_to_text.py +src/transformers/models/speech_to_text/tokenization_speech_to_text.py +src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py +src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py +src/transformers/models/speech_to_text_2/processing_speech_to_text_2.py +src/transformers/models/speech_to_text_2/tokenization_speech_to_text_2.py +src/transformers/models/speecht5/feature_extraction_speecht5.py +src/transformers/models/speecht5/modeling_speecht5.py +src/transformers/models/speecht5/processing_speecht5.py +src/transformers/models/speecht5/tokenization_speecht5.py +src/transformers/models/splinter/tokenization_splinter.py +src/transformers/models/splinter/tokenization_splinter_fast.py +src/transformers/models/squeezebert/configuration_squeezebert.py +src/transformers/models/squeezebert/tokenization_squeezebert.py +src/transformers/models/squeezebert/tokenization_squeezebert_fast.py +src/transformers/models/swin/configuration_swin.py +src/transformers/models/swin/modeling_swin.py +src/transformers/models/swin2sr/image_processing_swin2sr.py +src/transformers/models/swin2sr/modeling_swin2sr.py +src/transformers/models/swinv2/configuration_swinv2.py +src/transformers/models/t5/tokenization_t5.py +src/transformers/models/t5/tokenization_t5_fast.py +src/transformers/models/table_transformer/modeling_table_transformer.py +src/transformers/models/tapas/tokenization_tapas.py +src/transformers/models/time_series_transformer/configuration_time_series_transformer.py +src/transformers/models/time_series_transformer/modeling_time_series_transformer.py +src/transformers/models/timesformer/configuration_timesformer.py +src/transformers/models/timesformer/modeling_timesformer.py +src/transformers/models/transfo_xl/configuration_transfo_xl.py +src/transformers/models/transfo_xl/tokenization_transfo_xl.py +src/transformers/models/trocr/configuration_trocr.py +src/transformers/models/trocr/modeling_trocr.py +src/transformers/models/trocr/processing_trocr.py +src/transformers/models/tvlt/feature_extraction_tvlt.py +src/transformers/models/tvlt/image_processing_tvlt.py +src/transformers/models/tvlt/processing_tvlt.py +src/transformers/models/unispeech/configuration_unispeech.py +src/transformers/models/unispeech/modeling_unispeech.py +src/transformers/models/unispeech_sat/modeling_unispeech_sat.py +src/transformers/models/upernet/modeling_upernet.py +src/transformers/models/videomae/feature_extraction_videomae.py +src/transformers/models/videomae/image_processing_videomae.py +src/transformers/models/videomae/modeling_videomae.py +src/transformers/models/vilt/feature_extraction_vilt.py +src/transformers/models/vilt/image_processing_vilt.py +src/transformers/models/vilt/modeling_vilt.py +src/transformers/models/vilt/processing_vilt.py +src/transformers/models/vision_encoder_decoder/configuration_vision_encoder_decoder.py +src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py +src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py +src/transformers/models/vision_text_dual_encoder/modeling_tf_vision_text_dual_encoder.py +src/transformers/models/vision_text_dual_encoder/processing_vision_text_dual_encoder.py +src/transformers/models/visual_bert/configuration_visual_bert.py +src/transformers/models/vit/configuration_vit.py +src/transformers/models/vit/feature_extraction_vit.py +src/transformers/models/vit/image_processing_vit.py +src/transformers/models/vit/modeling_tf_vit.py +src/transformers/models/vit/modeling_vit.py +src/transformers/models/vit_hybrid/image_processing_vit_hybrid.py +src/transformers/models/vit_mae/configuration_vit_mae.py +src/transformers/models/vit_mae/modeling_vit_mae.py +src/transformers/models/vit_msn/modeling_vit_msn.py +src/transformers/models/vits/modeling_vits.py +src/transformers/models/vits/tokenization_vits.py +src/transformers/models/wav2vec2/configuration_wav2vec2.py +src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py +src/transformers/models/wav2vec2/modeling_wav2vec2.py +src/transformers/models/wav2vec2/processing_wav2vec2.py +src/transformers/models/wav2vec2/tokenization_wav2vec2.py +src/transformers/models/wav2vec2_conformer/configuration_wav2vec2_conformer.py +src/transformers/models/wav2vec2_conformer/modeling_wav2vec2_conformer.py +src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py +src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py +src/transformers/models/wavlm/configuration_wavlm.py +src/transformers/models/wavlm/modeling_wavlm.py +src/transformers/models/whisper/configuration_whisper.py +src/transformers/models/whisper/feature_extraction_whisper.py +src/transformers/models/whisper/modeling_tf_whisper.py +src/transformers/models/whisper/modeling_whisper.py +src/transformers/models/whisper/processing_whisper.py +src/transformers/models/whisper/tokenization_whisper.py +src/transformers/models/whisper/tokenization_whisper_fast.py +src/transformers/models/x_clip/modeling_x_clip.py +src/transformers/models/x_clip/processing_x_clip.py +src/transformers/models/xglm/tokenization_xglm.py +src/transformers/models/xglm/tokenization_xglm_fast.py +src/transformers/models/xlm/configuration_xlm.py +src/transformers/models/xlm/tokenization_xlm.py +src/transformers/models/xlm_prophetnet/tokenization_xlm_prophetnet.py +src/transformers/models/xlm_roberta/configuration_xlm_roberta.py +src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py +src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py +src/transformers/models/xlm_roberta_xl/configuration_xlm_roberta_xl.py +src/transformers/models/xlnet/configuration_xlnet.py +src/transformers/models/xlnet/tokenization_xlnet.py +src/transformers/models/xlnet/tokenization_xlnet_fast.py +src/transformers/models/xmod/configuration_xmod.py +src/transformers/models/xmod/modeling_xmod.py +src/transformers/models/yolos/configuration_yolos.py +src/transformers/models/yolos/feature_extraction_yolos.py +src/transformers/models/yolos/image_processing_yolos.py +src/transformers/models/yolos/modeling_yolos.py +src/transformers/models/yoso/configuration_yoso.py +src/transformers/pipelines/