add VITS model (#24085)
* add VITS model * let's vits * finish TextEncoder (mostly) * rename VITS to Vits * add StochasticDurationPredictor * ads flow model * add generator * correctly set vocab size * add tokenizer * remove processor & feature extractor * add PosteriorEncoder * add missing weights to SDP * also convert LJSpeech and VCTK checkpoints * add training stuff in forward * add placeholder tests for tokenizer * add placeholder tests for model * starting cleanup * let the great renaming begin! * use config * global_conditioning * more cleaning * renaming variables * more renaming * more renaming * it never ends * reticulating the splines * more renaming * HiFi-GAN * doc strings for main model * fixup * fix-copies * don't make it a PreTrainedModel * fixup * rename config options * remove training logic from forward pass * simplify relative position * use actual checkpoint * style * PR review fixes * more review changes * fixup * more unit tests * fixup * fix doc test * add integration test * improve tokenizer tests * add tokenizer integration test * fix tests on GPU (gave OOM) * conversion script can handle repos from hub * add conversion script for all MMS-TTS checkpoints * automatically create a README for the converted checkpoint * small changes to config * push README to hub * only show uroman note for checkpoints that need it * remove conversion script because code formatting breaks the readme * make WaveNet layers configurable * rename variables * simplifying the math * output attentions and hidden states * remove VitsFlip in flow model * also got rid of the other flip * fix tests * rename more variables * rename tokenizer, add phonemization * raise error when phonemizer missing * re-order config docstrings to match method * change config naming * remove redundant str -> list * fix copyright: vits authors -> kakao enterprise * (mean, log_variances) -> (prior_mean, prior_log_variances) * if return dict -> if not return dict * speed -> speaking rate * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * update fused tanh sigmoid * reduce dims in tester * audio -> output_values * audio -> output_values in tuple out * fix return type * fix return type * make _unconstrained_rational_quadratic_spline a function * all nn's to accept a config * add spectro to output * move {speaking rate, noise scale, noise scale duration} to config * path -> attn_path * idxs -> valid idxs -> padded idxs * output values -> waveform * use config for attention * make generation work * harden integration test * add spectrogram to dict output * tokenizer refactor * make style * remove 'fake' padding token * harden tokenizer tests * ron norm test * fprop / save tests deterministic * move uroman to tokenizer as much as possible * better logger message * fix vivit imports * add uroman integration test * make style * up * matthijs -> sanchit-gandhi * fix tokenizer test * make fix-copies * fix dict comprehension * fix config tests * fix model tests * make outputs consistent with reverse/not reverse * fix key concat * more model details * add author * return dict * speaker error * labels error * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/vits/convert_original_checkpoint.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove uromanize * add docstrings * add docstrings for tokenizer * upper-case skip messages * fix return dict * style * finish tests * update checkpoints * make style * remove doctest file * revert * fix docstring * fix tokenizer * remove uroman integration test * add sampling rate * fix docs / docstrings * style * add sr to model output * fix outputs * style / copies * fix docstring * fix copies * remove sr from model outputs * Update utils/documentation_tests.txt Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add sr as allowed attr --------- Co-authored-by: sanchit-gandhi <sanchit@huggingface.co> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
ef10dbce5c
commit
4ece3b9433
@@ -489,6 +489,7 @@ Current number of checkpoints: ** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||||
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
||||||
|
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
|
||||||
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
||||||
|
|||||||
@@ -466,6 +466,7 @@ Número actual de puntos de control: ** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||||
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
||||||
|
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
|
||||||
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
||||||
|
|||||||
@@ -438,6 +438,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया
|
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया
|
||||||
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा।
|
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा।
|
||||||
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा।
|
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा।
|
||||||
|
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise से) Jaehyeon Kim, Jungil Kong, Juhee Son. द्वाराअनुसंधान पत्र [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) के साथ जारी किया गया
|
||||||
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन] (https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा।
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन] (https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा।
|
||||||
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग ](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया।
|
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग ](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया।
|
||||||
|
|||||||
@@ -500,6 +500,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
|||||||
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)
|
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)
|
||||||
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
|
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
|
||||||
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141)
|
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141)
|
||||||
|
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise から) Jaehyeon Kim, Jungil Kong, Juhee Son. から公開された研究論文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)
|
||||||
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI から) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI から) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)
|
||||||
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI から) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino から公開された研究論文: [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171)
|
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI から) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino から公開された研究論文: [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171)
|
||||||
|
|||||||
@@ -415,6 +415,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
|||||||
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다.
|
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다.
|
||||||
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다.
|
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다.
|
||||||
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다.
|
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다.
|
||||||
|
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (Kakao Enterprise 에서 제공)은 Jaehyeon Kim, Jungil Kong, Juhee Son.의 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)논문과 함께 발표했습니다.
|
||||||
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI 에서) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 의 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 논문과 함께 발표했습니다.
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI 에서) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 의 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 논문과 함께 발표했습니다.
|
||||||
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 의 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다.
|
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 의 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다.
|
||||||
|
|||||||
@@ -439,6 +439,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。
|
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。
|
||||||
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。
|
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。
|
||||||
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布.
|
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布.
|
||||||
|
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (来自 Kakao Enterprise) 伴随论文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) 由 Jaehyeon Kim, Jungil Kong, Juhee Son 发布。
|
||||||
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。
|
||||||
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
|
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。
|
||||||
|
|||||||
@@ -451,6 +451,7 @@ conda install -c huggingface transformers
|
|||||||
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
1. **[VitDet](https://huggingface.co/docs/transformers/main/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||||
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
||||||
|
1. **[VITS](https://huggingface.co/docs/transformers/main/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
|
||||||
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
||||||
|
|||||||
@@ -603,6 +603,8 @@
|
|||||||
title: UniSpeech
|
title: UniSpeech
|
||||||
- local: model_doc/unispeech-sat
|
- local: model_doc/unispeech-sat
|
||||||
title: UniSpeech-SAT
|
title: UniSpeech-SAT
|
||||||
|
- local: model_doc/vits
|
||||||
|
title: VITS
|
||||||
- local: model_doc/wav2vec2
|
- local: model_doc/wav2vec2
|
||||||
title: Wav2Vec2
|
title: Wav2Vec2
|
||||||
- local: model_doc/wav2vec2-conformer
|
- local: model_doc/wav2vec2-conformer
|
||||||
|
|||||||
@@ -255,6 +255,7 @@ The documentation is organized into five sections:
|
|||||||
1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
1. **[VitDet](model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.
|
||||||
1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
1. **[ViTMAE](model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
|
||||||
1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
1. **[ViTMSN](model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
|
||||||
|
1. **[VITS](model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
|
||||||
1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
1. **[ViViT](model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid.
|
||||||
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
1. **[Wav2Vec2](model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
|
||||||
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
1. **[Wav2Vec2-Conformer](model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino.
|
||||||
@@ -475,6 +476,7 @@ Flax), PyTorch, and/or TensorFlow.
|
|||||||
| VitDet | ✅ | ❌ | ❌ |
|
| VitDet | ✅ | ❌ | ❌ |
|
||||||
| ViTMAE | ✅ | ✅ | ❌ |
|
| ViTMAE | ✅ | ✅ | ❌ |
|
||||||
| ViTMSN | ✅ | ❌ | ❌ |
|
| ViTMSN | ✅ | ❌ | ❌ |
|
||||||
|
| VITS | ✅ | ❌ | ❌ |
|
||||||
| ViViT | ✅ | ❌ | ❌ |
|
| ViViT | ✅ | ❌ | ❌ |
|
||||||
| Wav2Vec2 | ✅ | ✅ | ✅ |
|
| Wav2Vec2 | ✅ | ✅ | ✅ |
|
||||||
| Wav2Vec2-Conformer | ✅ | ❌ | ❌ |
|
| Wav2Vec2-Conformer | ✅ | ❌ | ❌ |
|
||||||
|
|||||||
114
docs/source/en/model_doc/vits.md
Normal file
114
docs/source/en/model_doc/vits.md
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||||
|
the License. You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||||
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||||
|
specific language governing permissions and limitations under the License.
|
||||||
|
-->
|
||||||
|
|
||||||
|
# VITS
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The VITS model was proposed in [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
|
||||||
|
|
||||||
|
|
||||||
|
VITS (**V**ariational **I**nference with adversarial learning for end-to-end **T**ext-to-**S**peech) is an end-to-end
|
||||||
|
speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational
|
||||||
|
autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.
|
||||||
|
|
||||||
|
A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based
|
||||||
|
text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers,
|
||||||
|
much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text
|
||||||
|
input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to
|
||||||
|
synthesise speech with different rhythms from the same input text.
|
||||||
|
|
||||||
|
The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training.
|
||||||
|
To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During
|
||||||
|
inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the
|
||||||
|
waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor,
|
||||||
|
the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform.
|
||||||
|
|
||||||
|
The abstract from the paper is the following:
|
||||||
|
|
||||||
|
*Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.*
|
||||||
|
|
||||||
|
This model can also be used with TTS checkpoints from [Massively Multilingual Speech (MMS)](https://arxiv.org/abs/2305.13516)
|
||||||
|
as these checkpoints use the same architecture and a slightly modified tokenizer.
|
||||||
|
|
||||||
|
This model was contributed by [Matthijs](https://huggingface.co/Matthijs) and [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original code can be found [here](https://github.com/jaywalnut310/vits).
|
||||||
|
|
||||||
|
## Model Usage
|
||||||
|
|
||||||
|
Both the VITS and MMS-TTS checkpoints can be used with the same API. Since the flow-based model is non-deterministic, it
|
||||||
|
is good practice to set a seed to ensure reproducibility of the outputs. For languages with a Roman alphabet,
|
||||||
|
such as English or French, the tokenizer can be used directly to pre-process the text inputs. The following code example
|
||||||
|
runs a forward pass using the MMS-TTS English checkpoint:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
from transformers import VitsTokenizer, VitsModel, set_seed
|
||||||
|
|
||||||
|
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
|
||||||
|
model = VitsModel.from_pretrained("facebook/mms-tts-eng")
|
||||||
|
|
||||||
|
inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt")
|
||||||
|
|
||||||
|
set_seed(555) # make deterministic
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(**inputs)
|
||||||
|
|
||||||
|
waveform = outputs.waveform[0]
|
||||||
|
```
|
||||||
|
|
||||||
|
The resulting waveform can be saved as a `.wav` file:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import scipy
|
||||||
|
|
||||||
|
scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=waveform)
|
||||||
|
```
|
||||||
|
|
||||||
|
Or displayed in a Jupyter Notebook / Google Colab:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from IPython.display import Audio
|
||||||
|
|
||||||
|
Audio(waveform, rate=model.config.sampling_rate)
|
||||||
|
```
|
||||||
|
|
||||||
|
For certain languages with a non-Roman alphabet, such as Arabic, Mandarin or Hindi, the [`uroman`](https://github.com/isi-nlp/uroman)
|
||||||
|
perl package is required to pre-process the text inputs to the Roman alphabet.
|
||||||
|
|
||||||
|
You can check whether you require the `uroman` package for your language by inspecting the `is_uroman` attribute of
|
||||||
|
the pre-trained `tokenizer`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import VitsTokenizer
|
||||||
|
|
||||||
|
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
|
||||||
|
print(tokenizer.is_uroman)
|
||||||
|
```
|
||||||
|
|
||||||
|
If required, you should apply the uroman package to your text inputs **prior** to passing them to the `VitsTokenizer`,
|
||||||
|
since currently the tokenizer does not support performing the pre-processing itself.
|
||||||
|
|
||||||
|
## VitsConfig
|
||||||
|
|
||||||
|
[[autodoc]] VitsConfig
|
||||||
|
|
||||||
|
## VitsTokenizer
|
||||||
|
|
||||||
|
[[autodoc]] VitsTokenizer
|
||||||
|
- __call__
|
||||||
|
- save_vocabulary
|
||||||
|
|
||||||
|
## VitsModel
|
||||||
|
|
||||||
|
[[autodoc]] VitsModel
|
||||||
|
- forward
|
||||||
@@ -587,6 +587,11 @@ _import_structure = {
|
|||||||
"models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"],
|
"models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"],
|
||||||
"models.vit_msn": ["VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMSNConfig"],
|
"models.vit_msn": ["VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMSNConfig"],
|
||||||
"models.vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"],
|
"models.vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"],
|
||||||
|
"models.vits": [
|
||||||
|
"VITS_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
|
"VitsConfig",
|
||||||
|
"VitsTokenizer",
|
||||||
|
],
|
||||||
"models.vivit": [
|
"models.vivit": [
|
||||||
"VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
"VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
"VivitConfig",
|
"VivitConfig",
|
||||||
@@ -2935,6 +2940,13 @@ else:
|
|||||||
"VitDetPreTrainedModel",
|
"VitDetPreTrainedModel",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
_import_structure["models.vits"].extend(
|
||||||
|
[
|
||||||
|
"VITS_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"VitsModel",
|
||||||
|
"VitsPreTrainedModel",
|
||||||
|
]
|
||||||
|
)
|
||||||
_import_structure["models.vivit"].extend(
|
_import_structure["models.vivit"].extend(
|
||||||
[
|
[
|
||||||
"VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
"VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
@@ -4643,6 +4655,11 @@ if TYPE_CHECKING:
|
|||||||
from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig
|
from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig
|
||||||
from .models.vit_msn import VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMSNConfig
|
from .models.vit_msn import VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMSNConfig
|
||||||
from .models.vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig
|
from .models.vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig
|
||||||
|
from .models.vits import (
|
||||||
|
VITS_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
VitsConfig,
|
||||||
|
VitsTokenizer,
|
||||||
|
)
|
||||||
from .models.vivit import VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, VivitConfig
|
from .models.vivit import VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, VivitConfig
|
||||||
from .models.wav2vec2 import (
|
from .models.wav2vec2 import (
|
||||||
WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
@@ -6595,6 +6612,11 @@ if TYPE_CHECKING:
|
|||||||
VitDetModel,
|
VitDetModel,
|
||||||
VitDetPreTrainedModel,
|
VitDetPreTrainedModel,
|
||||||
)
|
)
|
||||||
|
from .models.vits import (
|
||||||
|
VITS_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
VitsModel,
|
||||||
|
VitsPreTrainedModel,
|
||||||
|
)
|
||||||
from .models.vivit import (
|
from .models.vivit import (
|
||||||
VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
VivitForVideoClassification,
|
VivitForVideoClassification,
|
||||||
|
|||||||
@@ -211,6 +211,7 @@ from . import (
|
|||||||
vit_mae,
|
vit_mae,
|
||||||
vit_msn,
|
vit_msn,
|
||||||
vitdet,
|
vitdet,
|
||||||
|
vits,
|
||||||
vivit,
|
vivit,
|
||||||
wav2vec2,
|
wav2vec2,
|
||||||
wav2vec2_conformer,
|
wav2vec2_conformer,
|
||||||
|
|||||||
@@ -219,6 +219,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
|||||||
("vit_mae", "ViTMAEConfig"),
|
("vit_mae", "ViTMAEConfig"),
|
||||||
("vit_msn", "ViTMSNConfig"),
|
("vit_msn", "ViTMSNConfig"),
|
||||||
("vitdet", "VitDetConfig"),
|
("vitdet", "VitDetConfig"),
|
||||||
|
("vits", "VitsConfig"),
|
||||||
("vivit", "VivitConfig"),
|
("vivit", "VivitConfig"),
|
||||||
("wav2vec2", "Wav2Vec2Config"),
|
("wav2vec2", "Wav2Vec2Config"),
|
||||||
("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
|
("wav2vec2-conformer", "Wav2Vec2ConformerConfig"),
|
||||||
@@ -410,6 +411,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
|||||||
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("vitdet", "VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("vitdet", "VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
|
("vits", "VITS_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("vivit", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("vivit", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||||
@@ -643,6 +645,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
|||||||
("vit_mae", "ViTMAE"),
|
("vit_mae", "ViTMAE"),
|
||||||
("vit_msn", "ViTMSN"),
|
("vit_msn", "ViTMSN"),
|
||||||
("vitdet", "VitDet"),
|
("vitdet", "VitDet"),
|
||||||
|
("vits", "VITS"),
|
||||||
("vivit", "ViViT"),
|
("vivit", "ViViT"),
|
||||||
("wav2vec2", "Wav2Vec2"),
|
("wav2vec2", "Wav2Vec2"),
|
||||||
("wav2vec2-conformer", "Wav2Vec2-Conformer"),
|
("wav2vec2-conformer", "Wav2Vec2-Conformer"),
|
||||||
|
|||||||
@@ -205,6 +205,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
|||||||
("vit_mae", "ViTMAEModel"),
|
("vit_mae", "ViTMAEModel"),
|
||||||
("vit_msn", "ViTMSNModel"),
|
("vit_msn", "ViTMSNModel"),
|
||||||
("vitdet", "VitDetModel"),
|
("vitdet", "VitDetModel"),
|
||||||
|
("vits", "VitsModel"),
|
||||||
("vivit", "VivitModel"),
|
("vivit", "VivitModel"),
|
||||||
("wav2vec2", "Wav2Vec2Model"),
|
("wav2vec2", "Wav2Vec2Model"),
|
||||||
("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
|
("wav2vec2-conformer", "Wav2Vec2ConformerModel"),
|
||||||
|
|||||||
@@ -347,6 +347,7 @@ else:
|
|||||||
),
|
),
|
||||||
("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
|
||||||
|
("vits", ("VitsTokenizer", None)),
|
||||||
("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
|
("wav2vec2", ("Wav2Vec2CTCTokenizer", None)),
|
||||||
("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
|
("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)),
|
||||||
("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
|
("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)),
|
||||||
|
|||||||
67
src/transformers/models/vits/__init__.py
Normal file
67
src/transformers/models/vits/__init__.py
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from typing import TYPE_CHECKING
|
||||||
|
|
||||||
|
from ...utils import (
|
||||||
|
OptionalDependencyNotAvailable,
|
||||||
|
_LazyModule,
|
||||||
|
is_sentencepiece_available,
|
||||||
|
is_speech_available,
|
||||||
|
is_torch_available,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_import_structure = {
|
||||||
|
"configuration_vits": [
|
||||||
|
"VITS_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||||
|
"VitsConfig",
|
||||||
|
],
|
||||||
|
"tokenization_vits": ["VitsTokenizer"],
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
_import_structure["modeling_vits"] = [
|
||||||
|
"VITS_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||||
|
"VitsModel",
|
||||||
|
"VitsPreTrainedModel",
|
||||||
|
]
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .configuration_vits import (
|
||||||
|
VITS_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||||
|
VitsConfig,
|
||||||
|
)
|
||||||
|
from .tokenization_vits import VitsTokenizer
|
||||||
|
|
||||||
|
try:
|
||||||
|
if not is_torch_available():
|
||||||
|
raise OptionalDependencyNotAvailable()
|
||||||
|
except OptionalDependencyNotAvailable:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
from .modeling_vits import (
|
||||||
|
VITS_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||||
|
VitsModel,
|
||||||
|
VitsPreTrainedModel,
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||||
254
src/transformers/models/vits/configuration_vits.py
Normal file
254
src/transformers/models/vits/configuration_vits.py
Normal file
@@ -0,0 +1,254 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The Kakao Enterprise Authors and the HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" VITS model configuration"""
|
||||||
|
|
||||||
|
|
||||||
|
from ...configuration_utils import PretrainedConfig
|
||||||
|
from ...utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
VITS_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||||
|
"facebook/mms-tts-eng": "https://huggingface.co/facebook/mms-tts-eng/resolve/main/config.json",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class VitsConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`VitsModel`]. It is used to instantiate a VITS
|
||||||
|
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||||
|
defaults will yield a similar configuration to that of the VITS
|
||||||
|
[facebook/mms-tts-eng](https://huggingface.co/facebook/mms-tts-eng) architecture.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 38):
|
||||||
|
Vocabulary size of the VITS model. Defines the number of different tokens that can be represented by the
|
||||||
|
`inputs_ids` passed to the forward method of [`VitsModel`].
|
||||||
|
hidden_size (`int`, *optional*, defaults to 192):
|
||||||
|
Dimensionality of the text encoder layers.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 6):
|
||||||
|
Number of hidden layers in the Transformer encoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 2):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
window_size (`int`, *optional*, defaults to 4):
|
||||||
|
Window size for the relative positional embeddings in the attention layers of the Transformer encoder.
|
||||||
|
use_bias (`bool`, *optional*, defaults to `True`)
|
||||||
|
Whether to use bias in the key, query, value projection layers in the Transformer encoder.
|
||||||
|
ffn_dim (`int`, *optional*, defaults to 768):
|
||||||
|
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
|
||||||
|
layerdrop (`float`, *optional*, defaults to 0.1):
|
||||||
|
The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
|
||||||
|
for more details.
|
||||||
|
ffn_kernel_size (`int`, *optional*, defaults to 3):
|
||||||
|
Kernel size of the 1D convolution layers used by the feed-forward network in the Transformer encoder.
|
||||||
|
flow_size (`int`, *optional*, defaults to 192):
|
||||||
|
Dimensionality of the flow layers.
|
||||||
|
spectrogram_bins (`int`, *optional*, defaults to 513):
|
||||||
|
Number of frequency bins in the target spectrogram.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"relu"`):
|
||||||
|
The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
|
||||||
|
`"relu"`, `"selu"` and `"gelu_new"` are supported.
|
||||||
|
hidden_dropout (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout probability for all fully connected layers in the embeddings and encoder.
|
||||||
|
attention_dropout (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
activation_dropout (`float`, *optional*, defaults to 0.1):
|
||||||
|
The dropout ratio for activations inside the fully connected layer.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
layer_norm_eps (`float`, *optional*, defaults to 1e-5):
|
||||||
|
The epsilon used by the layer normalization layers.
|
||||||
|
use_stochastic_duration_prediction (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to use the stochastic duration prediction module or the regular duration predictor.
|
||||||
|
num_speakers (`int`, *optional*, defaults to 1):
|
||||||
|
Number of speakers if this is a multi-speaker model.
|
||||||
|
speaker_embedding_size (`int`, *optional*, defaults to 0):
|
||||||
|
Number of channels used by the speaker embeddings. Is zero for single-speaker models.
|
||||||
|
upsample_initial_channel (`int`, *optional*, defaults to 512):
|
||||||
|
The number of input channels into the HiFi-GAN upsampling network.
|
||||||
|
upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[8, 8, 2, 2]`):
|
||||||
|
A tuple of integers defining the stride of each 1D convolutional layer in the HiFi-GAN upsampling network.
|
||||||
|
The length of `upsample_rates` defines the number of convolutional layers and has to match the length of
|
||||||
|
`upsample_kernel_sizes`.
|
||||||
|
upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[16, 16, 4, 4]`):
|
||||||
|
A tuple of integers defining the kernel size of each 1D convolutional layer in the HiFi-GAN upsampling
|
||||||
|
network. The length of `upsample_kernel_sizes` defines the number of convolutional layers and has to match
|
||||||
|
the length of `upsample_rates`.
|
||||||
|
resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`):
|
||||||
|
A tuple of integers defining the kernel sizes of the 1D convolutional layers in the HiFi-GAN
|
||||||
|
multi-receptive field fusion (MRF) module.
|
||||||
|
resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`):
|
||||||
|
A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the
|
||||||
|
HiFi-GAN multi-receptive field fusion (MRF) module.
|
||||||
|
leaky_relu_slope (`float`, *optional*, defaults to 0.1):
|
||||||
|
The angle of the negative slope used by the leaky ReLU activation.
|
||||||
|
depth_separable_channels (`int`, *optional*, defaults to 2):
|
||||||
|
Number of channels to use in each depth-separable block.
|
||||||
|
depth_separable_num_layers (`int`, *optional*, defaults to 3):
|
||||||
|
Number of convolutional layers to use in each depth-separable block.
|
||||||
|
duration_predictor_flow_bins (`int`, *optional*, defaults to 10):
|
||||||
|
Number of channels to map using the unonstrained rational spline in the duration predictor model.
|
||||||
|
duration_predictor_tail_bound (`float`, *optional*, defaults to 5.0):
|
||||||
|
Value of the tail bin boundary when computing the unconstrained rational spline in the duration predictor
|
||||||
|
model.
|
||||||
|
duration_predictor_kernel_size (`int`, *optional*, defaults to 3):
|
||||||
|
Kernel size of the 1D convolution layers used in the duration predictor model.
|
||||||
|
duration_predictor_dropout (`float`, *optional*, defaults to 0.5):
|
||||||
|
The dropout ratio for the duration predictor model.
|
||||||
|
duration_predictor_num_flows (`int`, *optional*, defaults to 4):
|
||||||
|
Number of flow stages used by the duration predictor model.
|
||||||
|
duration_predictor_filter_channels (`int`, *optional*, defaults to 256):
|
||||||
|
Number of channels for the convolution layers used in the duration predictor model.
|
||||||
|
prior_encoder_num_flows (`int`, *optional*, defaults to 4):
|
||||||
|
Number of flow stages used by the prior encoder flow model.
|
||||||
|
prior_encoder_num_wavenet_layers (`int`, *optional*, defaults to 4):
|
||||||
|
Number of WaveNet layers used by the prior encoder flow model.
|
||||||
|
posterior_encoder_num_wavenet_layers (`int`, *optional*, defaults to 16):
|
||||||
|
Number of WaveNet layers used by the posterior encoder model.
|
||||||
|
wavenet_kernel_size (`int`, *optional*, defaults to 5):
|
||||||
|
Kernel size of the 1D convolution layers used in the WaveNet model.
|
||||||
|
wavenet_dilation_rate (`int`, *optional*, defaults to 1):
|
||||||
|
Dilation rates of the dilated 1D convolutional layers used in the WaveNet model.
|
||||||
|
wavenet_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the WaveNet layers.
|
||||||
|
speaking_rate (`float`, *optional*, defaults to 1.0):
|
||||||
|
Speaking rate. Larger values give faster synthesised speech.
|
||||||
|
noise_scale (`float`, *optional*, defaults to 0.667):
|
||||||
|
How random the speech prediction is. Larger values create more variation in the predicted speech.
|
||||||
|
noise_scale_duration (`float`, *optional*, defaults to 0.8):
|
||||||
|
How random the duration prediction is. Larger values create more variation in the predicted durations.
|
||||||
|
sampling_rate (`int`, *optional*, defaults to 16000):
|
||||||
|
The sampling rate at which the output audio waveform is digitalized expressed in hertz (Hz).
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import VitsModel, VitsConfig
|
||||||
|
|
||||||
|
>>> # Initializing a "facebook/mms-tts-eng" style configuration
|
||||||
|
>>> configuration = VitsConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model (with random weights) from the "facebook/mms-tts-eng" style configuration
|
||||||
|
>>> model = VitsModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
model_type = "vits"
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=38,
|
||||||
|
hidden_size=192,
|
||||||
|
num_hidden_layers=6,
|
||||||
|
num_attention_heads=2,
|
||||||
|
window_size=4,
|
||||||
|
use_bias=True,
|
||||||
|
ffn_dim=768,
|
||||||
|
layerdrop=0.1,
|
||||||
|
ffn_kernel_size=3,
|
||||||
|
flow_size=192,
|
||||||
|
spectrogram_bins=513,
|
||||||
|
hidden_act="relu",
|
||||||
|
hidden_dropout=0.1,
|
||||||
|
attention_dropout=0.1,
|
||||||
|
activation_dropout=0.1,
|
||||||
|
initializer_range=0.02,
|
||||||
|
layer_norm_eps=1e-5,
|
||||||
|
use_stochastic_duration_prediction=True,
|
||||||
|
num_speakers=1,
|
||||||
|
speaker_embedding_size=0,
|
||||||
|
upsample_initial_channel=512,
|
||||||
|
upsample_rates=[8, 8, 2, 2],
|
||||||
|
upsample_kernel_sizes=[16, 16, 4, 4],
|
||||||
|
resblock_kernel_sizes=[3, 7, 11],
|
||||||
|
resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
|
||||||
|
leaky_relu_slope=0.1,
|
||||||
|
depth_separable_channels=2,
|
||||||
|
depth_separable_num_layers=3,
|
||||||
|
duration_predictor_flow_bins=10,
|
||||||
|
duration_predictor_tail_bound=5.0,
|
||||||
|
duration_predictor_kernel_size=3,
|
||||||
|
duration_predictor_dropout=0.5,
|
||||||
|
duration_predictor_num_flows=4,
|
||||||
|
duration_predictor_filter_channels=256,
|
||||||
|
prior_encoder_num_flows=4,
|
||||||
|
prior_encoder_num_wavenet_layers=4,
|
||||||
|
posterior_encoder_num_wavenet_layers=16,
|
||||||
|
wavenet_kernel_size=5,
|
||||||
|
wavenet_dilation_rate=1,
|
||||||
|
wavenet_dropout=0.0,
|
||||||
|
speaking_rate=1.0,
|
||||||
|
noise_scale=0.667,
|
||||||
|
noise_scale_duration=0.8,
|
||||||
|
sampling_rate=16_000,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.window_size = window_size
|
||||||
|
self.use_bias = use_bias
|
||||||
|
self.ffn_dim = ffn_dim
|
||||||
|
self.layerdrop = layerdrop
|
||||||
|
self.ffn_kernel_size = ffn_kernel_size
|
||||||
|
self.flow_size = flow_size
|
||||||
|
self.spectrogram_bins = spectrogram_bins
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.hidden_dropout = hidden_dropout
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.activation_dropout = activation_dropout
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.layer_norm_eps = layer_norm_eps
|
||||||
|
self.use_stochastic_duration_prediction = use_stochastic_duration_prediction
|
||||||
|
self.num_speakers = num_speakers
|
||||||
|
self.speaker_embedding_size = speaker_embedding_size
|
||||||
|
self.upsample_initial_channel = upsample_initial_channel
|
||||||
|
self.upsample_rates = upsample_rates
|
||||||
|
self.upsample_kernel_sizes = upsample_kernel_sizes
|
||||||
|
self.resblock_kernel_sizes = resblock_kernel_sizes
|
||||||
|
self.resblock_dilation_sizes = resblock_dilation_sizes
|
||||||
|
self.leaky_relu_slope = leaky_relu_slope
|
||||||
|
self.depth_separable_channels = depth_separable_channels
|
||||||
|
self.depth_separable_num_layers = depth_separable_num_layers
|
||||||
|
self.duration_predictor_flow_bins = duration_predictor_flow_bins
|
||||||
|
self.duration_predictor_tail_bound = duration_predictor_tail_bound
|
||||||
|
self.duration_predictor_kernel_size = duration_predictor_kernel_size
|
||||||
|
self.duration_predictor_dropout = duration_predictor_dropout
|
||||||
|
self.duration_predictor_num_flows = duration_predictor_num_flows
|
||||||
|
self.duration_predictor_filter_channels = duration_predictor_filter_channels
|
||||||
|
self.prior_encoder_num_flows = prior_encoder_num_flows
|
||||||
|
self.prior_encoder_num_wavenet_layers = prior_encoder_num_wavenet_layers
|
||||||
|
self.posterior_encoder_num_wavenet_layers = posterior_encoder_num_wavenet_layers
|
||||||
|
self.wavenet_kernel_size = wavenet_kernel_size
|
||||||
|
self.wavenet_dilation_rate = wavenet_dilation_rate
|
||||||
|
self.wavenet_dropout = wavenet_dropout
|
||||||
|
self.speaking_rate = speaking_rate
|
||||||
|
self.noise_scale = noise_scale
|
||||||
|
self.noise_scale_duration = noise_scale_duration
|
||||||
|
self.sampling_rate = sampling_rate
|
||||||
|
|
||||||
|
if len(upsample_kernel_sizes) != len(upsample_rates):
|
||||||
|
raise ValueError(
|
||||||
|
f"The length of `upsample_kernel_sizes` ({len(upsample_kernel_sizes)}) must match the length of "
|
||||||
|
f"`upsample_rates` ({len(upsample_rates)})"
|
||||||
|
)
|
||||||
|
|
||||||
|
super().__init__(**kwargs)
|
||||||
390
src/transformers/models/vits/convert_original_checkpoint.py
Normal file
390
src/transformers/models/vits/convert_original_checkpoint.py
Normal file
@@ -0,0 +1,390 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Convert VITS checkpoint."""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from huggingface_hub import hf_hub_download
|
||||||
|
|
||||||
|
from transformers import VitsConfig, VitsModel, VitsTokenizer, logging
|
||||||
|
|
||||||
|
|
||||||
|
logging.set_verbosity_info()
|
||||||
|
logger = logging.get_logger("transformers.models.vits")
|
||||||
|
|
||||||
|
MAPPING_TEXT_ENCODER = {
|
||||||
|
"enc_p.emb": "text_encoder.embed_tokens",
|
||||||
|
"enc_p.encoder.attn_layers.*.conv_k": "text_encoder.encoder.layers.*.attention.k_proj",
|
||||||
|
"enc_p.encoder.attn_layers.*.conv_v": "text_encoder.encoder.layers.*.attention.v_proj",
|
||||||
|
"enc_p.encoder.attn_layers.*.conv_q": "text_encoder.encoder.layers.*.attention.q_proj",
|
||||||
|
"enc_p.encoder.attn_layers.*.conv_o": "text_encoder.encoder.layers.*.attention.out_proj",
|
||||||
|
"enc_p.encoder.attn_layers.*.emb_rel_k": "text_encoder.encoder.layers.*.attention.emb_rel_k",
|
||||||
|
"enc_p.encoder.attn_layers.*.emb_rel_v": "text_encoder.encoder.layers.*.attention.emb_rel_v",
|
||||||
|
"enc_p.encoder.norm_layers_1.*.gamma": "text_encoder.encoder.layers.*.layer_norm.weight",
|
||||||
|
"enc_p.encoder.norm_layers_1.*.beta": "text_encoder.encoder.layers.*.layer_norm.bias",
|
||||||
|
"enc_p.encoder.ffn_layers.*.conv_1": "text_encoder.encoder.layers.*.feed_forward.conv_1",
|
||||||
|
"enc_p.encoder.ffn_layers.*.conv_2": "text_encoder.encoder.layers.*.feed_forward.conv_2",
|
||||||
|
"enc_p.encoder.norm_layers_2.*.gamma": "text_encoder.encoder.layers.*.final_layer_norm.weight",
|
||||||
|
"enc_p.encoder.norm_layers_2.*.beta": "text_encoder.encoder.layers.*.final_layer_norm.bias",
|
||||||
|
"enc_p.proj": "text_encoder.project",
|
||||||
|
}
|
||||||
|
MAPPING_STOCHASTIC_DURATION_PREDICTOR = {
|
||||||
|
"dp.pre": "duration_predictor.conv_pre",
|
||||||
|
"dp.proj": "duration_predictor.conv_proj",
|
||||||
|
"dp.convs.convs_sep.*": "duration_predictor.conv_dds.convs_dilated.*",
|
||||||
|
"dp.convs.convs_1x1.*": "duration_predictor.conv_dds.convs_pointwise.*",
|
||||||
|
"dp.convs.norms_1.*.gamma": "duration_predictor.conv_dds.norms_1.*.weight",
|
||||||
|
"dp.convs.norms_1.*.beta": "duration_predictor.conv_dds.norms_1.*.bias",
|
||||||
|
"dp.convs.norms_2.*.gamma": "duration_predictor.conv_dds.norms_2.*.weight",
|
||||||
|
"dp.convs.norms_2.*.beta": "duration_predictor.conv_dds.norms_2.*.bias",
|
||||||
|
"dp.flows.0.logs": "duration_predictor.flows.0.log_scale",
|
||||||
|
"dp.flows.0.m": "duration_predictor.flows.0.translate",
|
||||||
|
"dp.flows.*.pre": "duration_predictor.flows.*.conv_pre",
|
||||||
|
"dp.flows.*.proj": "duration_predictor.flows.*.conv_proj",
|
||||||
|
"dp.flows.*.convs.convs_1x1.0": "duration_predictor.flows.*.conv_dds.convs_pointwise.0",
|
||||||
|
"dp.flows.*.convs.convs_1x1.1": "duration_predictor.flows.*.conv_dds.convs_pointwise.1",
|
||||||
|
"dp.flows.*.convs.convs_1x1.2": "duration_predictor.flows.*.conv_dds.convs_pointwise.2",
|
||||||
|
"dp.flows.*.convs.convs_sep.0": "duration_predictor.flows.*.conv_dds.convs_dilated.0",
|
||||||
|
"dp.flows.*.convs.convs_sep.1": "duration_predictor.flows.*.conv_dds.convs_dilated.1",
|
||||||
|
"dp.flows.*.convs.convs_sep.2": "duration_predictor.flows.*.conv_dds.convs_dilated.2",
|
||||||
|
"dp.flows.*.convs.norms_1.0.gamma": "duration_predictor.flows.*.conv_dds.norms_1.0.weight",
|
||||||
|
"dp.flows.*.convs.norms_1.0.beta": "duration_predictor.flows.*.conv_dds.norms_1.0.bias",
|
||||||
|
"dp.flows.*.convs.norms_1.1.gamma": "duration_predictor.flows.*.conv_dds.norms_1.1.weight",
|
||||||
|
"dp.flows.*.convs.norms_1.1.beta": "duration_predictor.flows.*.conv_dds.norms_1.1.bias",
|
||||||
|
"dp.flows.*.convs.norms_1.2.gamma": "duration_predictor.flows.*.conv_dds.norms_1.2.weight",
|
||||||
|
"dp.flows.*.convs.norms_1.2.beta": "duration_predictor.flows.*.conv_dds.norms_1.2.bias",
|
||||||
|
"dp.flows.*.convs.norms_2.0.gamma": "duration_predictor.flows.*.conv_dds.norms_2.0.weight",
|
||||||
|
"dp.flows.*.convs.norms_2.0.beta": "duration_predictor.flows.*.conv_dds.norms_2.0.bias",
|
||||||
|
"dp.flows.*.convs.norms_2.1.gamma": "duration_predictor.flows.*.conv_dds.norms_2.1.weight",
|
||||||
|
"dp.flows.*.convs.norms_2.1.beta": "duration_predictor.flows.*.conv_dds.norms_2.1.bias",
|
||||||
|
"dp.flows.*.convs.norms_2.2.gamma": "duration_predictor.flows.*.conv_dds.norms_2.2.weight",
|
||||||
|
"dp.flows.*.convs.norms_2.2.beta": "duration_predictor.flows.*.conv_dds.norms_2.2.bias",
|
||||||
|
"dp.post_pre": "duration_predictor.post_conv_pre",
|
||||||
|
"dp.post_proj": "duration_predictor.post_conv_proj",
|
||||||
|
"dp.post_convs.convs_sep.*": "duration_predictor.post_conv_dds.convs_dilated.*",
|
||||||
|
"dp.post_convs.convs_1x1.*": "duration_predictor.post_conv_dds.convs_pointwise.*",
|
||||||
|
"dp.post_convs.norms_1.*.gamma": "duration_predictor.post_conv_dds.norms_1.*.weight",
|
||||||
|
"dp.post_convs.norms_1.*.beta": "duration_predictor.post_conv_dds.norms_1.*.bias",
|
||||||
|
"dp.post_convs.norms_2.*.gamma": "duration_predictor.post_conv_dds.norms_2.*.weight",
|
||||||
|
"dp.post_convs.norms_2.*.beta": "duration_predictor.post_conv_dds.norms_2.*.bias",
|
||||||
|
"dp.post_flows.0.logs": "duration_predictor.post_flows.0.log_scale",
|
||||||
|
"dp.post_flows.0.m": "duration_predictor.post_flows.0.translate",
|
||||||
|
"dp.post_flows.*.pre": "duration_predictor.post_flows.*.conv_pre",
|
||||||
|
"dp.post_flows.*.proj": "duration_predictor.post_flows.*.conv_proj",
|
||||||
|
"dp.post_flows.*.convs.convs_1x1.0": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.0",
|
||||||
|
"dp.post_flows.*.convs.convs_1x1.1": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.1",
|
||||||
|
"dp.post_flows.*.convs.convs_1x1.2": "duration_predictor.post_flows.*.conv_dds.convs_pointwise.2",
|
||||||
|
"dp.post_flows.*.convs.convs_sep.0": "duration_predictor.post_flows.*.conv_dds.convs_dilated.0",
|
||||||
|
"dp.post_flows.*.convs.convs_sep.1": "duration_predictor.post_flows.*.conv_dds.convs_dilated.1",
|
||||||
|
"dp.post_flows.*.convs.convs_sep.2": "duration_predictor.post_flows.*.conv_dds.convs_dilated.2",
|
||||||
|
"dp.post_flows.*.convs.norms_1.0.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.0.weight",
|
||||||
|
"dp.post_flows.*.convs.norms_1.0.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.0.bias",
|
||||||
|
"dp.post_flows.*.convs.norms_1.1.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.1.weight",
|
||||||
|
"dp.post_flows.*.convs.norms_1.1.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.1.bias",
|
||||||
|
"dp.post_flows.*.convs.norms_1.2.gamma": "duration_predictor.post_flows.*.conv_dds.norms_1.2.weight",
|
||||||
|
"dp.post_flows.*.convs.norms_1.2.beta": "duration_predictor.post_flows.*.conv_dds.norms_1.2.bias",
|
||||||
|
"dp.post_flows.*.convs.norms_2.0.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.0.weight",
|
||||||
|
"dp.post_flows.*.convs.norms_2.0.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.0.bias",
|
||||||
|
"dp.post_flows.*.convs.norms_2.1.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.1.weight",
|
||||||
|
"dp.post_flows.*.convs.norms_2.1.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.1.bias",
|
||||||
|
"dp.post_flows.*.convs.norms_2.2.gamma": "duration_predictor.post_flows.*.conv_dds.norms_2.2.weight",
|
||||||
|
"dp.post_flows.*.convs.norms_2.2.beta": "duration_predictor.post_flows.*.conv_dds.norms_2.2.bias",
|
||||||
|
"dp.cond": "duration_predictor.cond", # num_speakers > 1
|
||||||
|
}
|
||||||
|
MAPPING_FLOW = {
|
||||||
|
"flow.flows.*.pre": "flow.flows.*.conv_pre",
|
||||||
|
"flow.flows.*.enc.in_layers.0": "flow.flows.*.wavenet.in_layers.0",
|
||||||
|
"flow.flows.*.enc.in_layers.1": "flow.flows.*.wavenet.in_layers.1",
|
||||||
|
"flow.flows.*.enc.in_layers.2": "flow.flows.*.wavenet.in_layers.2",
|
||||||
|
"flow.flows.*.enc.in_layers.3": "flow.flows.*.wavenet.in_layers.3",
|
||||||
|
"flow.flows.*.enc.res_skip_layers.0": "flow.flows.*.wavenet.res_skip_layers.0",
|
||||||
|
"flow.flows.*.enc.res_skip_layers.1": "flow.flows.*.wavenet.res_skip_layers.1",
|
||||||
|
"flow.flows.*.enc.res_skip_layers.2": "flow.flows.*.wavenet.res_skip_layers.2",
|
||||||
|
"flow.flows.*.enc.res_skip_layers.3": "flow.flows.*.wavenet.res_skip_layers.3",
|
||||||
|
"flow.flows.*.enc.cond_layer": "flow.flows.*.wavenet.cond_layer", # num_speakers > 1
|
||||||
|
"flow.flows.*.post": "flow.flows.*.conv_post",
|
||||||
|
}
|
||||||
|
MAPPING_GENERATOR = {
|
||||||
|
"dec.conv_pre": "decoder.conv_pre",
|
||||||
|
"dec.ups.0": "decoder.upsampler.0",
|
||||||
|
"dec.ups.1": "decoder.upsampler.1",
|
||||||
|
"dec.ups.2": "decoder.upsampler.2",
|
||||||
|
"dec.ups.3": "decoder.upsampler.3",
|
||||||
|
"dec.resblocks.*.convs1.0": "decoder.resblocks.*.convs1.0",
|
||||||
|
"dec.resblocks.*.convs1.1": "decoder.resblocks.*.convs1.1",
|
||||||
|
"dec.resblocks.*.convs1.2": "decoder.resblocks.*.convs1.2",
|
||||||
|
"dec.resblocks.*.convs2.0": "decoder.resblocks.*.convs2.0",
|
||||||
|
"dec.resblocks.*.convs2.1": "decoder.resblocks.*.convs2.1",
|
||||||
|
"dec.resblocks.*.convs2.2": "decoder.resblocks.*.convs2.2",
|
||||||
|
"dec.conv_post": "decoder.conv_post",
|
||||||
|
"dec.cond": "decoder.cond", # num_speakers > 1
|
||||||
|
}
|
||||||
|
MAPPING_POSTERIOR_ENCODER = {
|
||||||
|
"enc_q.pre": "posterior_encoder.conv_pre",
|
||||||
|
"enc_q.enc.in_layers.*": "posterior_encoder.wavenet.in_layers.*",
|
||||||
|
"enc_q.enc.res_skip_layers.*": "posterior_encoder.wavenet.res_skip_layers.*",
|
||||||
|
"enc_q.enc.cond_layer": "posterior_encoder.wavenet.cond_layer", # num_speakers > 1
|
||||||
|
"enc_q.proj": "posterior_encoder.conv_proj",
|
||||||
|
}
|
||||||
|
MAPPING = {
|
||||||
|
**MAPPING_TEXT_ENCODER,
|
||||||
|
**MAPPING_STOCHASTIC_DURATION_PREDICTOR,
|
||||||
|
**MAPPING_FLOW,
|
||||||
|
**MAPPING_GENERATOR,
|
||||||
|
**MAPPING_POSTERIOR_ENCODER,
|
||||||
|
"emb_g": "embed_speaker", # num_speakers > 1
|
||||||
|
}
|
||||||
|
TOP_LEVEL_KEYS = []
|
||||||
|
IGNORE_KEYS = []
|
||||||
|
|
||||||
|
|
||||||
|
def set_recursively(hf_pointer, key, value, full_name, weight_type):
|
||||||
|
for attribute in key.split("."):
|
||||||
|
hf_pointer = getattr(hf_pointer, attribute)
|
||||||
|
|
||||||
|
if weight_type is not None:
|
||||||
|
hf_shape = getattr(hf_pointer, weight_type).shape
|
||||||
|
else:
|
||||||
|
hf_shape = hf_pointer.shape
|
||||||
|
|
||||||
|
# strip off the kernel dimension at the end (original weights are Conv1d)
|
||||||
|
if key.endswith(".k_proj") or key.endswith(".v_proj") or key.endswith(".q_proj") or key.endswith(".out_proj"):
|
||||||
|
value = value.squeeze(-1)
|
||||||
|
|
||||||
|
if hf_shape != value.shape:
|
||||||
|
raise ValueError(
|
||||||
|
f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be"
|
||||||
|
f" {value.shape} for {full_name}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if weight_type == "weight":
|
||||||
|
hf_pointer.weight.data = value
|
||||||
|
elif weight_type == "weight_g":
|
||||||
|
hf_pointer.weight_g.data = value
|
||||||
|
elif weight_type == "weight_v":
|
||||||
|
hf_pointer.weight_v.data = value
|
||||||
|
elif weight_type == "bias":
|
||||||
|
hf_pointer.bias.data = value
|
||||||
|
elif weight_type == "running_mean":
|
||||||
|
hf_pointer.running_mean.data = value
|
||||||
|
elif weight_type == "running_var":
|
||||||
|
hf_pointer.running_var.data = value
|
||||||
|
elif weight_type == "num_batches_tracked":
|
||||||
|
hf_pointer.num_batches_tracked.data = value
|
||||||
|
else:
|
||||||
|
hf_pointer.data = value
|
||||||
|
|
||||||
|
logger.info(f"{key + ('.' + weight_type if weight_type is not None else '')} was initialized from {full_name}.")
|
||||||
|
|
||||||
|
|
||||||
|
def should_ignore(name, ignore_keys):
|
||||||
|
for key in ignore_keys:
|
||||||
|
if key.endswith(".*"):
|
||||||
|
if name.startswith(key[:-1]):
|
||||||
|
return True
|
||||||
|
elif ".*." in key:
|
||||||
|
prefix, suffix = key.split(".*.")
|
||||||
|
if prefix in name and suffix in name:
|
||||||
|
return True
|
||||||
|
elif key in name:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def recursively_load_weights(fairseq_dict, hf_model):
|
||||||
|
unused_weights = []
|
||||||
|
|
||||||
|
for name, value in fairseq_dict.items():
|
||||||
|
if should_ignore(name, IGNORE_KEYS):
|
||||||
|
logger.info(f"{name} was ignored")
|
||||||
|
continue
|
||||||
|
|
||||||
|
is_used = False
|
||||||
|
for key, mapped_key in MAPPING.items():
|
||||||
|
if key.endswith(".*"):
|
||||||
|
key = key[:-1]
|
||||||
|
elif "*" in key:
|
||||||
|
prefix, suffix = key.split(".*.")
|
||||||
|
if prefix in name and suffix in name:
|
||||||
|
key = suffix
|
||||||
|
|
||||||
|
if key in name:
|
||||||
|
is_used = True
|
||||||
|
if mapped_key.endswith(".*"):
|
||||||
|
layer_index = name.split(key)[-1].split(".")[0]
|
||||||
|
mapped_key = mapped_key.replace("*", layer_index)
|
||||||
|
elif "*" in mapped_key:
|
||||||
|
layer_index = name.split(key)[0].split(".")[-2]
|
||||||
|
|
||||||
|
# remap the layer index since we removed the Flip layers
|
||||||
|
if "flow.flows" in mapped_key:
|
||||||
|
layer_index = str(int(layer_index) // 2)
|
||||||
|
if "duration_predictor.flows" in mapped_key or "duration_predictor.post_flows" in mapped_key:
|
||||||
|
layer_index = str(int(layer_index) // 2 + 1)
|
||||||
|
|
||||||
|
mapped_key = mapped_key.replace("*", layer_index)
|
||||||
|
if "weight_g" in name:
|
||||||
|
weight_type = "weight_g"
|
||||||
|
elif "weight_v" in name:
|
||||||
|
weight_type = "weight_v"
|
||||||
|
elif "bias" in name:
|
||||||
|
weight_type = "bias"
|
||||||
|
elif "weight" in name:
|
||||||
|
weight_type = "weight"
|
||||||
|
elif "running_mean" in name:
|
||||||
|
weight_type = "running_mean"
|
||||||
|
elif "running_var" in name:
|
||||||
|
weight_type = "running_var"
|
||||||
|
elif "num_batches_tracked" in name:
|
||||||
|
weight_type = "num_batches_tracked"
|
||||||
|
else:
|
||||||
|
weight_type = None
|
||||||
|
set_recursively(hf_model, mapped_key, value, name, weight_type)
|
||||||
|
continue
|
||||||
|
if not is_used:
|
||||||
|
unused_weights.append(name)
|
||||||
|
|
||||||
|
logger.warning(f"Unused weights: {unused_weights}")
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def convert_checkpoint(
|
||||||
|
pytorch_dump_folder_path,
|
||||||
|
checkpoint_path=None,
|
||||||
|
config_path=None,
|
||||||
|
vocab_path=None,
|
||||||
|
language=None,
|
||||||
|
num_speakers=None,
|
||||||
|
sampling_rate=None,
|
||||||
|
repo_id=None,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Copy/paste/tweak model's weights to transformers design.
|
||||||
|
"""
|
||||||
|
if config_path is not None:
|
||||||
|
config = VitsConfig.from_pretrained(config_path)
|
||||||
|
else:
|
||||||
|
config = VitsConfig()
|
||||||
|
|
||||||
|
if num_speakers:
|
||||||
|
config.num_speakers = num_speakers
|
||||||
|
config.speaker_embedding_size = 256
|
||||||
|
|
||||||
|
if sampling_rate:
|
||||||
|
config.sampling_rate = sampling_rate
|
||||||
|
|
||||||
|
if checkpoint_path is None:
|
||||||
|
logger.info(f"***Converting model: facebook/mms-tts {language}***")
|
||||||
|
|
||||||
|
vocab_path = hf_hub_download(
|
||||||
|
repo_id="facebook/mms-tts",
|
||||||
|
filename="vocab.txt",
|
||||||
|
subfolder=f"models/{language}",
|
||||||
|
)
|
||||||
|
config_file = hf_hub_download(
|
||||||
|
repo_id="facebook/mms-tts",
|
||||||
|
filename="config.json",
|
||||||
|
subfolder=f"models/{language}",
|
||||||
|
)
|
||||||
|
checkpoint_path = hf_hub_download(
|
||||||
|
repo_id="facebook/mms-tts",
|
||||||
|
filename="G_100000.pth",
|
||||||
|
subfolder=f"models/{language}",
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(config_file, "r") as f:
|
||||||
|
data = f.read()
|
||||||
|
hps = json.loads(data)
|
||||||
|
|
||||||
|
is_uroman = hps["data"]["training_files"].split(".")[-1] == "uroman"
|
||||||
|
if is_uroman:
|
||||||
|
logger.warning("For this checkpoint, you should use `uroman` to convert input text before tokenizing it!")
|
||||||
|
else:
|
||||||
|
logger.info(f"***Converting model: {checkpoint_path}***")
|
||||||
|
is_uroman = False
|
||||||
|
|
||||||
|
# original VITS checkpoint
|
||||||
|
if vocab_path is None:
|
||||||
|
_pad = "_"
|
||||||
|
_punctuation = ';:,.!?¡¿—…"«»“” '
|
||||||
|
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
|
||||||
|
_letters_ipa = "ɑɐɒæɓʙβɔɕçɗɖðʤəɘɚɛɜɝɞɟʄɡɠɢʛɦɧħɥʜɨɪʝɭɬɫɮʟɱɯɰŋɳɲɴøɵɸθœɶʘɹɺɾɻʀʁɽʂʃʈʧʉʊʋⱱʌɣɤʍχʎʏʑʐʒʔʡʕʢǀǁǂǃˈˌːˑʼʴʰʱʲʷˠˤ˞↓↑→↗↘'̩'ᵻ"
|
||||||
|
symbols = _pad + _punctuation + _letters + _letters_ipa
|
||||||
|
symbol_to_id = {s: i for i, s in enumerate(symbols)}
|
||||||
|
phonemize = True
|
||||||
|
else:
|
||||||
|
# Save vocab as temporary json file
|
||||||
|
symbols = [line.replace("\n", "") for line in open(vocab_path, encoding="utf-8").readlines()]
|
||||||
|
symbol_to_id = {s: i for i, s in enumerate(symbols)}
|
||||||
|
# MMS-TTS does not use a <pad> token, so we set to the token used to space characters
|
||||||
|
_pad = symbols[0]
|
||||||
|
phonemize = False
|
||||||
|
|
||||||
|
with tempfile.NamedTemporaryFile() as tf:
|
||||||
|
with open(tf.name, "w", encoding="utf-8") as f:
|
||||||
|
f.write(json.dumps(symbol_to_id, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
|
||||||
|
|
||||||
|
tokenizer = VitsTokenizer(tf.name, language=language, phonemize=phonemize, is_uroman=is_uroman, pad_token=_pad)
|
||||||
|
|
||||||
|
config.vocab_size = len(symbols)
|
||||||
|
model = VitsModel(config)
|
||||||
|
|
||||||
|
model.decoder.apply_weight_norm()
|
||||||
|
|
||||||
|
orig_checkpoint = torch.load(checkpoint_path, map_location=torch.device("cpu"))
|
||||||
|
recursively_load_weights(orig_checkpoint["model"], model)
|
||||||
|
|
||||||
|
model.decoder.remove_weight_norm()
|
||||||
|
|
||||||
|
model.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
tokenizer.save_pretrained(pytorch_dump_folder_path)
|
||||||
|
|
||||||
|
if repo_id:
|
||||||
|
print("Pushing to the hub...")
|
||||||
|
tokenizer.push_to_hub(repo_id)
|
||||||
|
model.push_to_hub(repo_id)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("--checkpoint_path", default=None, type=str, help="Local path to original checkpoint")
|
||||||
|
parser.add_argument("--vocab_path", default=None, type=str, help="Path to vocab.txt")
|
||||||
|
parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
|
||||||
|
parser.add_argument("--language", default=None, type=str, help="Tokenizer language (three-letter code)")
|
||||||
|
parser.add_argument("--num_speakers", default=None, type=int, help="Number of speakers")
|
||||||
|
parser.add_argument(
|
||||||
|
"--sampling_rate", default=None, type=int, help="Sampling rate on which the model was trained."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
convert_checkpoint(
|
||||||
|
args.pytorch_dump_folder_path,
|
||||||
|
args.checkpoint_path,
|
||||||
|
args.config_path,
|
||||||
|
args.vocab_path,
|
||||||
|
args.language,
|
||||||
|
args.num_speakers,
|
||||||
|
args.sampling_rate,
|
||||||
|
args.push_to_hub,
|
||||||
|
)
|
||||||
1506
src/transformers/models/vits/modeling_vits.py
Normal file
1506
src/transformers/models/vits/modeling_vits.py
Normal file
File diff suppressed because it is too large
Load Diff
249
src/transformers/models/vits/tokenization_vits.py
Normal file
249
src/transformers/models/vits/tokenization_vits.py
Normal file
@@ -0,0 +1,249 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The Kakao Enterprise Authors, the MMS-TTS Authors and the HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Tokenization class for VITS."""
|
||||||
|
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from typing import Any, Dict, List, Optional, Tuple, Union
|
||||||
|
|
||||||
|
from ...tokenization_utils import PreTrainedTokenizer
|
||||||
|
from ...utils import is_phonemizer_available, logging
|
||||||
|
|
||||||
|
|
||||||
|
if is_phonemizer_available():
|
||||||
|
import phonemizer
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
VOCAB_FILES_NAMES = {"vocab_file": "vocab.json"}
|
||||||
|
|
||||||
|
PRETRAINED_VOCAB_FILES_MAP = {
|
||||||
|
"vocab_file": {
|
||||||
|
"facebook/mms-tts-eng": "https://huggingface.co/facebook/mms-tts-eng/resolve/main/vocab.json",
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
|
||||||
|
# This model does not have a maximum input length.
|
||||||
|
"facebook/mms-tts-eng": 4096,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def has_non_roman_characters(input_string):
|
||||||
|
# Find any character outside the ASCII range
|
||||||
|
non_roman_pattern = re.compile(r"[^\x00-\x7F]")
|
||||||
|
|
||||||
|
# Search the input string for non-Roman characters
|
||||||
|
match = non_roman_pattern.search(input_string)
|
||||||
|
has_non_roman = match is not None
|
||||||
|
return has_non_roman
|
||||||
|
|
||||||
|
|
||||||
|
class VitsTokenizer(PreTrainedTokenizer):
|
||||||
|
"""
|
||||||
|
Construct a VITS tokenizer. Also supports MMS-TTS.
|
||||||
|
|
||||||
|
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
|
||||||
|
this superclass for more information regarding those methods.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_file (`str`):
|
||||||
|
Path to the vocabulary file.
|
||||||
|
language (`str`, *optional*):
|
||||||
|
Language identifier.
|
||||||
|
add_blank (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to insert token id 0 in between the other tokens.
|
||||||
|
normalize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to normalize the input text by removing all casing and punctuation.
|
||||||
|
phonemize (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether to convert the input text into phonemes.
|
||||||
|
is_uroman (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether the `uroman` Romanizer needs to be applied to the input text prior to tokenizing.
|
||||||
|
"""
|
||||||
|
|
||||||
|
vocab_files_names = VOCAB_FILES_NAMES
|
||||||
|
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||||
|
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
|
||||||
|
model_input_names = ["input_ids", "attention_mask"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_file,
|
||||||
|
pad_token="<pad>",
|
||||||
|
unk_token="<unk>",
|
||||||
|
language=None,
|
||||||
|
add_blank=True,
|
||||||
|
normalize=True,
|
||||||
|
phonemize=True,
|
||||||
|
is_uroman=False,
|
||||||
|
**kwargs,
|
||||||
|
) -> None:
|
||||||
|
super().__init__(
|
||||||
|
pad_token=pad_token,
|
||||||
|
unk_token=unk_token,
|
||||||
|
language=language,
|
||||||
|
add_blank=add_blank,
|
||||||
|
normalize=normalize,
|
||||||
|
phonemize=phonemize,
|
||||||
|
is_uroman=is_uroman,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(vocab_file, encoding="utf-8") as vocab_handle:
|
||||||
|
self.encoder = json.load(vocab_handle)
|
||||||
|
|
||||||
|
self.decoder = {v: k for k, v in self.encoder.items()}
|
||||||
|
self.language = language
|
||||||
|
self.add_blank = add_blank
|
||||||
|
self.normalize = normalize
|
||||||
|
self.phonemize = phonemize
|
||||||
|
|
||||||
|
self.is_uroman = is_uroman
|
||||||
|
|
||||||
|
@property
|
||||||
|
def vocab_size(self):
|
||||||
|
return len(self.encoder)
|
||||||
|
|
||||||
|
def get_vocab(self):
|
||||||
|
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
|
||||||
|
return vocab
|
||||||
|
|
||||||
|
def normalize_text(self, input_string):
|
||||||
|
"""Lowercase the input string, respecting any special token ids that may be part or entirely upper-cased."""
|
||||||
|
all_vocabulary = list(self.encoder.keys()) + list(self.added_tokens_encoder.keys())
|
||||||
|
filtered_text = ""
|
||||||
|
|
||||||
|
i = 0
|
||||||
|
while i < len(input_string):
|
||||||
|
found_match = False
|
||||||
|
for word in all_vocabulary:
|
||||||
|
if input_string[i : i + len(word)] == word:
|
||||||
|
filtered_text += word
|
||||||
|
i += len(word)
|
||||||
|
found_match = True
|
||||||
|
break
|
||||||
|
|
||||||
|
if not found_match:
|
||||||
|
filtered_text += input_string[i].lower()
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
return filtered_text
|
||||||
|
|
||||||
|
def _preprocess_char(self, text):
|
||||||
|
"""Special treatment of characters in certain languages"""
|
||||||
|
if self.language == "ron":
|
||||||
|
text = text.replace("ț", "ţ")
|
||||||
|
return text
|
||||||
|
|
||||||
|
def prepare_for_tokenization(
|
||||||
|
self, text: str, is_split_into_words: bool = False, normalize: Optional[bool] = None, **kwargs
|
||||||
|
) -> Tuple[str, Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Performs any necessary transformations before tokenization.
|
||||||
|
|
||||||
|
This method should pop the arguments from kwargs and return the remaining `kwargs` as well. We test the
|
||||||
|
`kwargs` at the end of the encoding process to be sure all the arguments have been used.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text (`str`):
|
||||||
|
The text to prepare.
|
||||||
|
is_split_into_words (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether or not the input is already pre-tokenized (e.g., split into words). If set to `True`, the
|
||||||
|
tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
|
||||||
|
which it will tokenize.
|
||||||
|
normalize (`bool`, *optional*, defaults to `None`):
|
||||||
|
Whether or not to apply punctuation and casing normalization to the text inputs. Typically, VITS is
|
||||||
|
trained on lower-cased and un-punctuated text. Hence, normalization is used to ensure that the input
|
||||||
|
text consists only of lower-case characters.
|
||||||
|
kwargs (`Dict[str, Any]`, *optional*):
|
||||||
|
Keyword arguments to use for the tokenization.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
`Tuple[str, Dict[str, Any]]`: The prepared text and the unused kwargs.
|
||||||
|
"""
|
||||||
|
normalize = normalize if normalize is not None else self.normalize
|
||||||
|
|
||||||
|
if normalize:
|
||||||
|
# normalise for casing
|
||||||
|
text = self.normalize_text(text)
|
||||||
|
|
||||||
|
filtered_text = self._preprocess_char(text)
|
||||||
|
|
||||||
|
if has_non_roman_characters(filtered_text):
|
||||||
|
logger.warning(
|
||||||
|
"Text to the tokenizer contains non-Roman characters. Ensure the `uroman` Romanizer is "
|
||||||
|
"applied to the text prior to passing it to the tokenizer. See "
|
||||||
|
"`https://github.com/isi-nlp/uroman` for details."
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.phonemize:
|
||||||
|
if not is_phonemizer_available():
|
||||||
|
raise ImportError("Please install the `phonemizer` Python package to use this tokenizer.")
|
||||||
|
|
||||||
|
filtered_text = phonemizer.phonemize(
|
||||||
|
filtered_text,
|
||||||
|
language="en-us",
|
||||||
|
backend="espeak",
|
||||||
|
strip=True,
|
||||||
|
preserve_punctuation=True,
|
||||||
|
with_stress=True,
|
||||||
|
)
|
||||||
|
filtered_text = re.sub(r"\s+", " ", filtered_text)
|
||||||
|
elif normalize:
|
||||||
|
# strip any chars outside of the vocab (punctuation)
|
||||||
|
filtered_text = "".join(list(filter(lambda char: char in self.encoder, filtered_text))).strip()
|
||||||
|
|
||||||
|
return filtered_text, kwargs
|
||||||
|
|
||||||
|
def _tokenize(self, text: str) -> List[str]:
|
||||||
|
"""Tokenize a string by inserting the `<pad>` token at the boundary between adjacent characters."""
|
||||||
|
tokens = list(text)
|
||||||
|
|
||||||
|
if self.add_blank:
|
||||||
|
interspersed = [self._convert_id_to_token(0)] * (len(tokens) * 2 + 1)
|
||||||
|
interspersed[1::2] = tokens
|
||||||
|
tokens = interspersed
|
||||||
|
|
||||||
|
return tokens
|
||||||
|
|
||||||
|
def convert_tokens_to_string(self, tokens: List[str]) -> str:
|
||||||
|
if self.add_blank and len(tokens) > 1:
|
||||||
|
tokens = tokens[1::2]
|
||||||
|
return "".join(tokens)
|
||||||
|
|
||||||
|
def _convert_token_to_id(self, token):
|
||||||
|
"""Converts a token (str) in an id using the vocab."""
|
||||||
|
return self.encoder.get(token, self.encoder.get(self.unk_token))
|
||||||
|
|
||||||
|
def _convert_id_to_token(self, index):
|
||||||
|
"""Converts an index (integer) in a token (str) using the vocab."""
|
||||||
|
return self.decoder.get(index)
|
||||||
|
|
||||||
|
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Union[Tuple[str], None]:
|
||||||
|
if not os.path.isdir(save_directory):
|
||||||
|
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
|
||||||
|
return
|
||||||
|
|
||||||
|
vocab_file = os.path.join(
|
||||||
|
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(vocab_file, "w", encoding="utf-8") as f:
|
||||||
|
f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
|
||||||
|
|
||||||
|
return (vocab_file,)
|
||||||
@@ -7929,6 +7929,23 @@ class VitDetPreTrainedModel(metaclass=DummyObject):
|
|||||||
requires_backends(self, ["torch"])
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
VITS_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
class VitsModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
|
class VitsPreTrainedModel(metaclass=DummyObject):
|
||||||
|
_backends = ["torch"]
|
||||||
|
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
requires_backends(self, ["torch"])
|
||||||
|
|
||||||
|
|
||||||
VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
0
tests/models/vits/__init__.py
Normal file
0
tests/models/vits/__init__.py
Normal file
377
tests/models/vits/test_modeling_vits.py
Normal file
377
tests/models/vits/test_modeling_vits.py
Normal file
@@ -0,0 +1,377 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" Testing suite for the PyTorch VITS model. """
|
||||||
|
|
||||||
|
import copy
|
||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
from typing import Dict, List, Tuple
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from transformers import PretrainedConfig, VitsConfig
|
||||||
|
from transformers.testing_utils import (
|
||||||
|
is_torch_available,
|
||||||
|
require_torch,
|
||||||
|
slow,
|
||||||
|
torch_device,
|
||||||
|
)
|
||||||
|
from transformers.trainer_utils import set_seed
|
||||||
|
|
||||||
|
from ...test_configuration_common import ConfigTester
|
||||||
|
from ...test_modeling_common import (
|
||||||
|
ModelTesterMixin,
|
||||||
|
global_rng,
|
||||||
|
ids_tensor,
|
||||||
|
random_attention_mask,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if is_torch_available():
|
||||||
|
import torch
|
||||||
|
|
||||||
|
from transformers import VitsModel, VitsTokenizer
|
||||||
|
|
||||||
|
|
||||||
|
CONFIG_NAME = "config.json"
|
||||||
|
GENERATION_CONFIG_NAME = "generation_config.json"
|
||||||
|
|
||||||
|
|
||||||
|
def _config_zero_init(config):
|
||||||
|
configs_no_init = copy.deepcopy(config)
|
||||||
|
for key in configs_no_init.__dict__.keys():
|
||||||
|
if "_range" in key or "_std" in key or "initializer_factor" in key or "layer_scale" in key:
|
||||||
|
setattr(configs_no_init, key, 1e-10)
|
||||||
|
if isinstance(getattr(configs_no_init, key, None), PretrainedConfig):
|
||||||
|
no_init_subconfig = _config_zero_init(getattr(configs_no_init, key))
|
||||||
|
setattr(configs_no_init, key, no_init_subconfig)
|
||||||
|
return configs_no_init
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class VitsModelTester:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
parent,
|
||||||
|
batch_size=2,
|
||||||
|
seq_length=7,
|
||||||
|
is_training=False,
|
||||||
|
hidden_size=16,
|
||||||
|
num_hidden_layers=2,
|
||||||
|
num_attention_heads=2,
|
||||||
|
intermediate_size=64,
|
||||||
|
flow_size=16,
|
||||||
|
vocab_size=38,
|
||||||
|
spectrogram_bins=8,
|
||||||
|
duration_predictor_num_flows=2,
|
||||||
|
duration_predictor_filter_channels=16,
|
||||||
|
prior_encoder_num_flows=2,
|
||||||
|
upsample_initial_channel=16,
|
||||||
|
):
|
||||||
|
self.parent = parent
|
||||||
|
self.batch_size = batch_size
|
||||||
|
self.seq_length = seq_length
|
||||||
|
self.is_training = is_training
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.flow_size = flow_size
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.spectrogram_bins = spectrogram_bins
|
||||||
|
self.duration_predictor_num_flows = duration_predictor_num_flows
|
||||||
|
self.duration_predictor_filter_channels = duration_predictor_filter_channels
|
||||||
|
self.prior_encoder_num_flows = prior_encoder_num_flows
|
||||||
|
self.upsample_initial_channel = upsample_initial_channel
|
||||||
|
|
||||||
|
def prepare_config_and_inputs(self):
|
||||||
|
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size).clamp(2)
|
||||||
|
attention_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||||
|
|
||||||
|
config = self.get_config()
|
||||||
|
inputs_dict = {
|
||||||
|
"input_ids": input_ids,
|
||||||
|
"attention_mask": attention_mask,
|
||||||
|
}
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def prepare_config_and_inputs_for_common(self):
|
||||||
|
config, inputs_dict = self.prepare_config_and_inputs()
|
||||||
|
return config, inputs_dict
|
||||||
|
|
||||||
|
def get_config(self):
|
||||||
|
return VitsConfig(
|
||||||
|
hidden_size=self.hidden_size,
|
||||||
|
num_hidden_layers=self.num_hidden_layers,
|
||||||
|
num_attention_heads=self.num_attention_heads,
|
||||||
|
ffn_dim=self.intermediate_size,
|
||||||
|
flow_size=self.flow_size,
|
||||||
|
vocab_size=self.vocab_size,
|
||||||
|
spectrogram_bins=self.spectrogram_bins,
|
||||||
|
duration_predictor_num_flows=self.duration_predictor_num_flows,
|
||||||
|
prior_encoder_num_flows=self.prior_encoder_num_flows,
|
||||||
|
duration_predictor_filter_channels=self.duration_predictor_filter_channels,
|
||||||
|
posterior_encoder_num_wavenet_layers=self.num_hidden_layers,
|
||||||
|
upsample_initial_channel=self.upsample_initial_channel,
|
||||||
|
)
|
||||||
|
|
||||||
|
def create_and_check_model_forward(self, config, inputs_dict):
|
||||||
|
model = VitsModel(config=config).to(torch_device).eval()
|
||||||
|
|
||||||
|
input_ids = inputs_dict["input_ids"]
|
||||||
|
attention_mask = inputs_dict["attention_mask"]
|
||||||
|
|
||||||
|
result = model(input_ids, attention_mask=attention_mask)
|
||||||
|
self.parent.assertEqual(result.waveform.shape, (self.batch_size, 11008))
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
class VitsModelTest(ModelTesterMixin, unittest.TestCase):
|
||||||
|
all_model_classes = (VitsModel,) if is_torch_available() else ()
|
||||||
|
is_encoder_decoder = False
|
||||||
|
test_pruning = False
|
||||||
|
test_headmasking = False
|
||||||
|
test_resize_embeddings = False
|
||||||
|
test_head_masking = False
|
||||||
|
test_torchscript = False
|
||||||
|
has_attentions = False
|
||||||
|
|
||||||
|
input_name = "input_ids"
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
self.model_tester = VitsModelTester(self)
|
||||||
|
self.config_tester = ConfigTester(self, config_class=VitsConfig, hidden_size=37)
|
||||||
|
|
||||||
|
def test_config(self):
|
||||||
|
self.config_tester.run_common_tests()
|
||||||
|
|
||||||
|
def test_model_forward(self):
|
||||||
|
set_seed(12345)
|
||||||
|
global_rng.seed(12345)
|
||||||
|
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||||
|
self.model_tester.create_and_check_model_forward(*config_and_inputs)
|
||||||
|
|
||||||
|
@unittest.skip("VITS is not deterministic")
|
||||||
|
def test_determinism(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_initialization(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
configs_no_init = _config_zero_init(config)
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config=configs_no_init)
|
||||||
|
for name, param in model.named_parameters():
|
||||||
|
uniform_init_parms = [
|
||||||
|
"emb_rel_k",
|
||||||
|
"emb_rel_v",
|
||||||
|
"conv_1",
|
||||||
|
"conv_2",
|
||||||
|
"conv_pre",
|
||||||
|
"conv_post",
|
||||||
|
"conv_proj",
|
||||||
|
"conv_dds",
|
||||||
|
"project",
|
||||||
|
"wavenet.in_layers",
|
||||||
|
"wavenet.res_skip_layers",
|
||||||
|
"upsampler",
|
||||||
|
"resblocks",
|
||||||
|
]
|
||||||
|
if param.requires_grad:
|
||||||
|
if any(x in name for x in uniform_init_parms):
|
||||||
|
self.assertTrue(
|
||||||
|
-1.0 <= ((param.data.mean() * 1e9).round() / 1e9).item() <= 1.0,
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
self.assertIn(
|
||||||
|
((param.data.mean() * 1e9).round() / 1e9).item(),
|
||||||
|
[0.0, 1.0],
|
||||||
|
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
|
||||||
|
)
|
||||||
|
|
||||||
|
@unittest.skip("VITS has no inputs_embeds")
|
||||||
|
def test_inputs_embeds(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip("VITS has no input embeddings")
|
||||||
|
def test_model_common_attributes(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
# override since the model is not deterministic, so we need to set the seed for each forward pass
|
||||||
|
def test_model_outputs_equivalence(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
def set_nan_tensor_to_zero(t):
|
||||||
|
t[t != t] = 0
|
||||||
|
return t
|
||||||
|
|
||||||
|
def check_equivalence(model, tuple_inputs, dict_inputs, additional_kwargs={}):
|
||||||
|
with torch.no_grad():
|
||||||
|
set_seed(0)
|
||||||
|
tuple_output = model(**tuple_inputs, return_dict=False, **additional_kwargs)
|
||||||
|
set_seed(0)
|
||||||
|
dict_output = model(**dict_inputs, return_dict=True, **additional_kwargs).to_tuple()
|
||||||
|
|
||||||
|
def recursive_check(tuple_object, dict_object):
|
||||||
|
if isinstance(tuple_object, (List, Tuple)):
|
||||||
|
for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object):
|
||||||
|
recursive_check(tuple_iterable_value, dict_iterable_value)
|
||||||
|
elif isinstance(tuple_object, Dict):
|
||||||
|
for tuple_iterable_value, dict_iterable_value in zip(
|
||||||
|
tuple_object.values(), dict_object.values()
|
||||||
|
):
|
||||||
|
recursive_check(tuple_iterable_value, dict_iterable_value)
|
||||||
|
elif tuple_object is None:
|
||||||
|
return
|
||||||
|
else:
|
||||||
|
self.assertTrue(
|
||||||
|
torch.allclose(
|
||||||
|
set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5
|
||||||
|
),
|
||||||
|
msg=(
|
||||||
|
"Tuple and dict output are not equal. Difference:"
|
||||||
|
f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:"
|
||||||
|
f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has"
|
||||||
|
f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
recursive_check(tuple_output, dict_output)
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
check_equivalence(model, tuple_inputs, dict_inputs)
|
||||||
|
|
||||||
|
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
check_equivalence(model, tuple_inputs, dict_inputs)
|
||||||
|
|
||||||
|
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
|
||||||
|
|
||||||
|
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True})
|
||||||
|
|
||||||
|
if self.has_attentions:
|
||||||
|
tuple_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
dict_inputs = self._prepare_for_class(inputs_dict, model_class)
|
||||||
|
check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
|
||||||
|
|
||||||
|
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
check_equivalence(model, tuple_inputs, dict_inputs, {"output_attentions": True})
|
||||||
|
|
||||||
|
tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True)
|
||||||
|
check_equivalence(
|
||||||
|
model, tuple_inputs, dict_inputs, {"output_hidden_states": True, "output_attentions": True}
|
||||||
|
)
|
||||||
|
|
||||||
|
# override since the model is not deterministic, so we need to set the seed for each forward pass
|
||||||
|
def test_save_load(self):
|
||||||
|
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||||
|
|
||||||
|
def check_save_load(out1, out2):
|
||||||
|
# make sure we don't have nans
|
||||||
|
out_2 = out2.cpu().numpy()
|
||||||
|
out_2[np.isnan(out_2)] = 0
|
||||||
|
|
||||||
|
out_1 = out1.cpu().numpy()
|
||||||
|
out_1[np.isnan(out_1)] = 0
|
||||||
|
max_diff = np.amax(np.abs(out_1 - out_2))
|
||||||
|
self.assertLessEqual(max_diff, 1e-5)
|
||||||
|
|
||||||
|
for model_class in self.all_model_classes:
|
||||||
|
model = model_class(config)
|
||||||
|
model.to(torch_device)
|
||||||
|
model.eval()
|
||||||
|
with torch.no_grad():
|
||||||
|
set_seed(0)
|
||||||
|
first = model(**self._prepare_for_class(inputs_dict, model_class))[0]
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||||
|
model.save_pretrained(tmpdirname)
|
||||||
|
|
||||||
|
# the config file (and the generation config file, if it can generate) should be saved
|
||||||
|
self.assertTrue(os.path.exists(os.path.join(tmpdirname, CONFIG_NAME)))
|
||||||
|
self.assertEqual(
|
||||||
|
model.can_generate(), os.path.exists(os.path.join(tmpdirname, GENERATION_CONFIG_NAME))
|
||||||
|
)
|
||||||
|
|
||||||
|
model = model_class.from_pretrained(tmpdirname)
|
||||||
|
model.to(torch_device)
|
||||||
|
with torch.no_grad():
|
||||||
|
set_seed(0)
|
||||||
|
second = model(**self._prepare_for_class(inputs_dict, model_class))[0]
|
||||||
|
|
||||||
|
if isinstance(first, tuple) and isinstance(second, tuple):
|
||||||
|
for tensor1, tensor2 in zip(first, second):
|
||||||
|
check_save_load(tensor1, tensor2)
|
||||||
|
else:
|
||||||
|
check_save_load(first, second)
|
||||||
|
|
||||||
|
# overwrite from test_modeling_common
|
||||||
|
def _mock_init_weights(self, module):
|
||||||
|
if hasattr(module, "weight") and module.weight is not None:
|
||||||
|
module.weight.data.fill_(3)
|
||||||
|
if hasattr(module, "weight_g") and module.weight_g is not None:
|
||||||
|
module.weight_g.data.fill_(3)
|
||||||
|
if hasattr(module, "weight_v") and module.weight_v is not None:
|
||||||
|
module.weight_v.data.fill_(3)
|
||||||
|
if hasattr(module, "bias") and module.bias is not None:
|
||||||
|
module.bias.data.fill_(3)
|
||||||
|
|
||||||
|
|
||||||
|
@require_torch
|
||||||
|
@slow
|
||||||
|
class VitsModelIntegrationTests(unittest.TestCase):
|
||||||
|
def test_forward(self):
|
||||||
|
# GPU gives different results than CPU
|
||||||
|
torch_device = "cpu"
|
||||||
|
|
||||||
|
model = VitsModel.from_pretrained("facebook/mms-tts-eng")
|
||||||
|
model.to(torch_device)
|
||||||
|
|
||||||
|
tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng")
|
||||||
|
|
||||||
|
set_seed(555) # make deterministic
|
||||||
|
|
||||||
|
input_text = "Mister quilter is the apostle of the middle classes and we are glad to welcome his gospel!"
|
||||||
|
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(torch_device)
|
||||||
|
|
||||||
|
with torch.no_grad():
|
||||||
|
outputs = model(input_ids)
|
||||||
|
|
||||||
|
self.assertEqual(outputs.waveform.shape, (1, 87040))
|
||||||
|
# fmt: off
|
||||||
|
EXPECTED_LOGITS = torch.tensor(
|
||||||
|
[
|
||||||
|
-0.0042, 0.0176, 0.0354, 0.0504, 0.0621, 0.0777, 0.0980, 0.1224,
|
||||||
|
0.1475, 0.1679, 0.1817, 0.1832, 0.1713, 0.1542, 0.1384, 0.1256,
|
||||||
|
0.1147, 0.1066, 0.1026, 0.0958, 0.0823, 0.0610, 0.0340, 0.0022,
|
||||||
|
-0.0337, -0.0677, -0.0969, -0.1178, -0.1311, -0.1363
|
||||||
|
]
|
||||||
|
)
|
||||||
|
# fmt: on
|
||||||
|
self.assertTrue(torch.allclose(outputs.waveform[0, 10000:10030].cpu(), EXPECTED_LOGITS, atol=1e-4))
|
||||||
187
tests/models/vits/test_tokenization_vits.py
Normal file
187
tests/models/vits/test_tokenization_vits.py
Normal file
@@ -0,0 +1,187 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""Tests for the VITS tokenizer."""
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from transformers import VitsTokenizer
|
||||||
|
from transformers.models.vits.tokenization_vits import VOCAB_FILES_NAMES
|
||||||
|
from transformers.testing_utils import slow
|
||||||
|
|
||||||
|
from ...test_tokenization_common import TokenizerTesterMixin
|
||||||
|
|
||||||
|
|
||||||
|
class VitsTokenizerTest(TokenizerTesterMixin, unittest.TestCase):
|
||||||
|
tokenizer_class = VitsTokenizer
|
||||||
|
test_rust_tokenizer = False
|
||||||
|
|
||||||
|
def setUp(self):
|
||||||
|
super().setUp()
|
||||||
|
|
||||||
|
vocab = (
|
||||||
|
"k ' z y u d h e s w – 3 c p - 1 j m i X f l o 0 b r a 4 2 n _ x v t q 5 6 g ț ţ < > | <pad> <unk>".split(
|
||||||
|
" "
|
||||||
|
)
|
||||||
|
)
|
||||||
|
vocab_tokens = dict(zip(vocab, range(len(vocab))))
|
||||||
|
vocab_tokens[" "] = vocab_tokens["X"]
|
||||||
|
del vocab_tokens["X"]
|
||||||
|
|
||||||
|
self.special_tokens_map = {"pad_token": "<pad>", "unk_token": "<unk>"}
|
||||||
|
|
||||||
|
self.tmpdirname = tempfile.mkdtemp()
|
||||||
|
self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
|
||||||
|
with open(self.vocab_file, "w", encoding="utf-8") as fp:
|
||||||
|
fp.write(json.dumps(vocab_tokens) + "\n")
|
||||||
|
|
||||||
|
def get_tokenizer(self, **kwargs):
|
||||||
|
kwargs.update(self.special_tokens_map)
|
||||||
|
kwargs["phonemize"] = False
|
||||||
|
kwargs["normalize"] = False
|
||||||
|
return VitsTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||||
|
|
||||||
|
def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5):
|
||||||
|
txt = "beyonce lives in los angeles"
|
||||||
|
ids = tokenizer.encode(txt, add_special_tokens=False)
|
||||||
|
return txt, ids
|
||||||
|
|
||||||
|
@unittest.skip("Adding multicharacter tokens does not work with the VITS tokenizer")
|
||||||
|
def test_add_tokens_tokenizer(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip("Adding multicharacter tokens does not work with the VITS tokenizer")
|
||||||
|
def test_encode_decode_with_spaces(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
@unittest.skip("The VITS tokenizer does not support `is_split_into_words`")
|
||||||
|
def test_pretokenized_inputs(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_save_and_load_tokenizer(self):
|
||||||
|
# safety check on max_len default value so we are sure the test works
|
||||||
|
tokenizers = self.get_tokenizers()
|
||||||
|
for tokenizer in tokenizers:
|
||||||
|
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||||
|
self.assertNotEqual(tokenizer.model_max_length, 42)
|
||||||
|
|
||||||
|
# Now let's start the test
|
||||||
|
tokenizers = self.get_tokenizers()
|
||||||
|
for tokenizer in tokenizers:
|
||||||
|
with self.subTest(f"{tokenizer.__class__.__name__}"):
|
||||||
|
# Isolate this from the other tests because we save additional tokens/etc
|
||||||
|
tmpdirname = tempfile.mkdtemp()
|
||||||
|
|
||||||
|
sample_text = " He is very happy, UNwant\u00E9d,running"
|
||||||
|
before_tokens = tokenizer.encode(sample_text, add_special_tokens=False)
|
||||||
|
before_vocab = tokenizer.get_vocab()
|
||||||
|
tokenizer.save_pretrained(tmpdirname)
|
||||||
|
|
||||||
|
after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
|
||||||
|
after_tokens = after_tokenizer.encode(sample_text, add_special_tokens=False)
|
||||||
|
after_vocab = after_tokenizer.get_vocab()
|
||||||
|
self.assertListEqual(before_tokens, after_tokens)
|
||||||
|
self.assertDictEqual(before_vocab, after_vocab)
|
||||||
|
|
||||||
|
shutil.rmtree(tmpdirname)
|
||||||
|
|
||||||
|
@unittest.skip("Adding multicharacter tokens does not work the VITS tokenizer")
|
||||||
|
def test_special_tokens_initialization_with_non_empty_additional_special_tokens(self):
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_ron_normalization(self):
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
tokenizer.language = "ron"
|
||||||
|
|
||||||
|
sequences = ["vițs"]
|
||||||
|
normalized_sequences = ["viţs"]
|
||||||
|
|
||||||
|
encoded_ids = tokenizer(sequences, normalize=True)["input_ids"]
|
||||||
|
decoded_sequences = tokenizer.batch_decode(encoded_ids)
|
||||||
|
self.assertEqual(normalized_sequences, decoded_sequences)
|
||||||
|
|
||||||
|
def test_normalization(self):
|
||||||
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
|
sequences = ["VITS; is a model for t-t-s!"]
|
||||||
|
normalized_sequences = ["vits is a model for t-t-s"]
|
||||||
|
unnormalized_sequences = [
|
||||||
|
"<unk><unk><unk><unk><unk> is a model for t-t-s<unk>"
|
||||||
|
] # can't handle upper-case or certain punctuations
|
||||||
|
|
||||||
|
encoded_normalized_ids = tokenizer(sequences, normalize=True)
|
||||||
|
encoded_unnormalized_ids = tokenizer(sequences, normalize=False)
|
||||||
|
|
||||||
|
decoded_normalized_sequences = [
|
||||||
|
tokenizer.decode(seq, skip_special_tokens=False) for seq in encoded_normalized_ids["input_ids"]
|
||||||
|
]
|
||||||
|
decoded_unnormalized_sequences = [
|
||||||
|
tokenizer.decode(seq, skip_special_tokens=False) for seq in encoded_unnormalized_ids["input_ids"]
|
||||||
|
]
|
||||||
|
|
||||||
|
self.assertEqual(decoded_normalized_sequences, normalized_sequences)
|
||||||
|
self.assertEqual(decoded_unnormalized_sequences, unnormalized_sequences)
|
||||||
|
|
||||||
|
@slow
|
||||||
|
def test_tokenizer_integration(self):
|
||||||
|
sequences = [
|
||||||
|
"BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly "
|
||||||
|
"conditioning on both left and right context in all layers.",
|
||||||
|
"The quick brown fox! Jumps over the lazy dog...",
|
||||||
|
"We use k as our padding token",
|
||||||
|
]
|
||||||
|
|
||||||
|
normalized_sequences = [
|
||||||
|
"bert is designed to pre-train deep bidirectional representations from unlabeled text by jointly "
|
||||||
|
"conditioning on both left and right context in all layers",
|
||||||
|
"the quick brown fox jumps over the lazy dog",
|
||||||
|
"we use k as our padding token",
|
||||||
|
]
|
||||||
|
|
||||||
|
# fmt: off
|
||||||
|
expected_encoding = {
|
||||||
|
'input_ids': [
|
||||||
|
[0, 24, 0, 7, 0, 25, 0, 33, 0, 19, 0, 18, 0, 8, 0, 19, 0, 5, 0, 7, 0, 8, 0, 18, 0, 37, 0, 29, 0, 7, 0, 5, 0, 19, 0, 33, 0, 22, 0, 19, 0, 13, 0, 25, 0, 7, 0, 14, 0, 33, 0, 25, 0, 26, 0, 18, 0, 29, 0, 19, 0, 5, 0, 7, 0, 7, 0, 13, 0, 19, 0, 24, 0, 18, 0, 5, 0, 18, 0, 25, 0, 7, 0, 12, 0, 33, 0, 18, 0, 22, 0, 29, 0, 26, 0, 21, 0, 19, 0, 25, 0, 7, 0, 13, 0, 25, 0, 7, 0, 8, 0, 7, 0, 29, 0, 33, 0, 26, 0, 33, 0, 18, 0, 22, 0, 29, 0, 8, 0, 19, 0, 20, 0, 25, 0, 22, 0, 17, 0, 19, 0, 4, 0, 29, 0, 21, 0, 26, 0, 24, 0, 7, 0, 21, 0, 7, 0, 5, 0, 19, 0, 33, 0, 7, 0, 31, 0, 33, 0, 19, 0, 24, 0, 3, 0, 19, 0, 16, 0, 22, 0, 18, 0, 29, 0, 33, 0, 21, 0, 3, 0, 19, 0, 12, 0, 22, 0, 29, 0, 5, 0, 18, 0, 33, 0, 18, 0, 22, 0, 29, 0, 18, 0, 29, 0, 37, 0, 19, 0, 22, 0, 29, 0, 19, 0, 24, 0, 22, 0, 33, 0, 6, 0, 19, 0, 21, 0, 7, 0, 20, 0, 33, 0, 19, 0, 26, 0, 29, 0, 5, 0, 19, 0, 25, 0, 18, 0, 37, 0, 6, 0, 33, 0, 19, 0, 12, 0, 22, 0, 29, 0, 33, 0, 7, 0, 31, 0, 33, 0, 19, 0, 18, 0, 29, 0, 19, 0, 26, 0, 21, 0, 21, 0, 19, 0, 21, 0, 26, 0, 3, 0, 7, 0, 25, 0, 8, 0],
|
||||||
|
[0, 33, 0, 6, 0, 7, 0, 19, 0, 34, 0, 4, 0, 18, 0, 12, 0, 0, 0, 19, 0, 24, 0, 25, 0, 22, 0, 9, 0, 29, 0, 19, 0, 20, 0, 22, 0, 31, 0, 19, 0, 16, 0, 4, 0, 17, 0, 13, 0, 8, 0, 19, 0, 22, 0, 32, 0, 7, 0, 25, 0, 19, 0, 33, 0, 6, 0, 7, 0, 19, 0, 21, 0, 26, 0, 2, 0, 3, 0, 19, 0, 5, 0, 22, 0, 37, 0, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38],
|
||||||
|
[0, 9, 0, 7, 0, 19, 0, 4, 0, 8, 0, 7, 0, 19, 0, 0, 0, 19, 0, 26, 0, 8, 0, 19, 0, 22, 0, 4, 0, 25, 0, 19, 0, 13, 0, 26, 0, 5, 0, 5, 0, 18, 0, 29, 0, 37, 0, 19, 0, 33, 0, 22, 0, 0, 0, 7, 0, 29, 0, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38],
|
||||||
|
],
|
||||||
|
'attention_mask': [
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
|
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
|
||||||
|
]
|
||||||
|
}
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
tokenizer_classes = [self.tokenizer_class]
|
||||||
|
if self.test_rust_tokenizer:
|
||||||
|
tokenizer_classes.append(self.rust_tokenizer_class)
|
||||||
|
|
||||||
|
for tokenizer_class in tokenizer_classes:
|
||||||
|
tokenizer = tokenizer_class.from_pretrained(
|
||||||
|
"facebook/mms-tts-eng",
|
||||||
|
revision="d188a254c84ae6cfd24deb7a8f5c0c1d349d7d9f", # to pin the tokenizer version
|
||||||
|
)
|
||||||
|
|
||||||
|
encoding = tokenizer(sequences, padding=True, normalize=True)
|
||||||
|
decoded_sequences = [tokenizer.decode(seq, skip_special_tokens=True) for seq in encoding["input_ids"]]
|
||||||
|
|
||||||
|
encoding_data = encoding.data
|
||||||
|
self.assertDictEqual(encoding_data, expected_encoding)
|
||||||
|
|
||||||
|
for expected, decoded in zip(normalized_sequences, decoded_sequences):
|
||||||
|
self.assertEqual(expected, decoded)
|
||||||
@@ -190,6 +190,7 @@ def check_attribute_being_used(config_class, attributes, default_value, source_s
|
|||||||
"use_cache",
|
"use_cache",
|
||||||
"out_features",
|
"out_features",
|
||||||
"out_indices",
|
"out_indices",
|
||||||
|
"sampling_rate",
|
||||||
]
|
]
|
||||||
attributes_used_in_generation = ["encoder_no_repeat_ngram_size"]
|
attributes_used_in_generation = ["encoder_no_repeat_ngram_size"]
|
||||||
|
|
||||||
|
|||||||
493
utils/documentation_tests.txt
Normal file
493
utils/documentation_tests.txt
Normal file
@@ -0,0 +1,493 @@
|
|||||||
|
docs/source/en/autoclass_tutorial.md
|
||||||
|
docs/source/en/model_doc/byt5.md
|
||||||
|
docs/source/en/model_doc/donut.md
|
||||||
|
docs/source/en/model_doc/encoder-decoder.md
|
||||||
|
docs/source/en/model_doc/markuplm.md
|
||||||
|
docs/source/en/model_doc/speech_to_text.md
|
||||||
|
docs/source/en/model_doc/switch_transformers.md
|
||||||
|
docs/source/en/model_doc/t5.md
|
||||||
|
docs/source/en/model_doc/t5v1.1.md
|
||||||
|
docs/source/en/model_doc/tapex.md
|
||||||
|
docs/source/en/pipeline_tutorial.md
|
||||||
|
docs/source/en/quicktour.md
|
||||||
|
docs/source/en/task_summary.md
|
||||||
|
docs/source/es/quicktour.md
|
||||||
|
src/transformers/generation/configuration_utils.py
|
||||||
|
src/transformers/generation/tf_utils.py
|
||||||
|
src/transformers/generation/utils.py
|
||||||
|
src/transformers/models/albert/configuration_albert.py
|
||||||
|
src/transformers/models/albert/modeling_albert.py
|
||||||
|
src/transformers/models/albert/modeling_tf_albert.py
|
||||||
|
src/transformers/models/albert/tokenization_albert.py
|
||||||
|
src/transformers/models/albert/tokenization_albert_fast.py
|
||||||
|
src/transformers/models/align/processing_align.py
|
||||||
|
src/transformers/models/altclip/processing_altclip.py
|
||||||
|
src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py
|
||||||
|
src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
|
||||||
|
src/transformers/models/auto/feature_extraction_auto.py
|
||||||
|
src/transformers/models/auto/image_processing_auto.py
|
||||||
|
src/transformers/models/auto/processing_auto.py
|
||||||
|
src/transformers/models/auto/tokenization_auto.py
|
||||||
|
src/transformers/models/bark/configuration_bark.py
|
||||||
|
src/transformers/models/bark/modeling_bark.py
|
||||||
|
src/transformers/models/bark/processing_bark.py
|
||||||
|
src/transformers/models/bart/configuration_bart.py
|
||||||
|
src/transformers/models/bart/modeling_bart.py
|
||||||
|
src/transformers/models/bart/tokenization_bart.py
|
||||||
|
src/transformers/models/bart/tokenization_bart_fast.py
|
||||||
|
src/transformers/models/barthez/tokenization_barthez.py
|
||||||
|
src/transformers/models/barthez/tokenization_barthez_fast.py
|
||||||
|
src/transformers/models/bartpho/tokenization_bartpho.py
|
||||||
|
src/transformers/models/beit/configuration_beit.py
|
||||||
|
src/transformers/models/beit/feature_extraction_beit.py
|
||||||
|
src/transformers/models/beit/image_processing_beit.py
|
||||||
|
src/transformers/models/beit/modeling_beit.py
|
||||||
|
src/transformers/models/bert/configuration_bert.py
|
||||||
|
src/transformers/models/bert/modeling_bert.py
|
||||||
|
src/transformers/models/bert/modeling_tf_bert.py
|
||||||
|
src/transformers/models/bert/tokenization_bert.py
|
||||||
|
src/transformers/models/bert/tokenization_bert_fast.py
|
||||||
|
src/transformers/models/bert/tokenization_bert_tf.py
|
||||||
|
src/transformers/models/bert_generation/configuration_bert_generation.py
|
||||||
|
src/transformers/models/bert_generation/tokenization_bert_generation.py
|
||||||
|
src/transformers/models/bert_japanese/tokenization_bert_japanese.py
|
||||||
|
src/transformers/models/bertweet/tokenization_bertweet.py
|
||||||
|
src/transformers/models/big_bird/configuration_big_bird.py
|
||||||
|
src/transformers/models/big_bird/modeling_big_bird.py
|
||||||
|
src/transformers/models/big_bird/tokenization_big_bird.py
|
||||||
|
src/transformers/models/big_bird/tokenization_big_bird_fast.py
|
||||||
|
src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py
|
||||||
|
src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py
|
||||||
|
src/transformers/models/biogpt/tokenization_biogpt.py
|
||||||
|
src/transformers/models/bit/image_processing_bit.py
|
||||||
|
src/transformers/models/blenderbot/configuration_blenderbot.py
|
||||||
|
src/transformers/models/blenderbot/modeling_blenderbot.py
|
||||||
|
src/transformers/models/blenderbot/tokenization_blenderbot.py
|
||||||
|
src/transformers/models/blenderbot/tokenization_blenderbot_fast.py
|
||||||
|
src/transformers/models/blenderbot_small/configuration_blenderbot_small.py
|
||||||
|
src/transformers/models/blenderbot_small/modeling_blenderbot_small.py
|
||||||
|
src/transformers/models/blenderbot_small/tokenization_blenderbot_small.py
|
||||||
|
src/transformers/models/blenderbot_small/tokenization_blenderbot_small_fast.py
|
||||||
|
src/transformers/models/blip/image_processing_blip.py
|
||||||
|
src/transformers/models/blip/modeling_blip.py
|
||||||
|
src/transformers/models/blip/modeling_tf_blip.py
|
||||||
|
src/transformers/models/blip/processing_blip.py
|
||||||
|
src/transformers/models/blip_2/processing_blip_2.py
|
||||||
|
src/transformers/models/bloom/configuration_bloom.py
|
||||||
|
src/transformers/models/bloom/tokenization_bloom_fast.py
|
||||||
|
src/transformers/models/bridgetower/image_processing_bridgetower.py
|
||||||
|
src/transformers/models/bridgetower/processing_bridgetower.py
|
||||||
|
src/transformers/models/byt5/tokenization_byt5.py
|
||||||
|
src/transformers/models/camembert/configuration_camembert.py
|
||||||
|
src/transformers/models/camembert/tokenization_camembert.py
|
||||||
|
src/transformers/models/camembert/tokenization_camembert_fast.py
|
||||||
|
src/transformers/models/canine/configuration_canine.py
|
||||||
|
src/transformers/models/canine/modeling_canine.py
|
||||||
|
src/transformers/models/canine/tokenization_canine.py
|
||||||
|
src/transformers/models/chinese_clip/feature_extraction_chinese_clip.py
|
||||||
|
src/transformers/models/chinese_clip/image_processing_chinese_clip.py
|
||||||
|
src/transformers/models/chinese_clip/processing_chinese_clip.py
|
||||||
|
src/transformers/models/clap/configuration_clap.py
|
||||||
|
src/transformers/models/clap/feature_extraction_clap.py
|
||||||
|
src/transformers/models/clap/modeling_clap.py
|
||||||
|
src/transformers/models/clap/processing_clap.py
|
||||||
|
src/transformers/models/clip/configuration_clip.py
|
||||||
|
src/transformers/models/clip/feature_extraction_clip.py
|
||||||
|
src/transformers/models/clip/image_processing_clip.py
|
||||||
|
src/transformers/models/clip/processing_clip.py
|
||||||
|
src/transformers/models/clip/tokenization_clip.py
|
||||||
|
src/transformers/models/clip/tokenization_clip_fast.py
|
||||||
|
src/transformers/models/clipseg/modeling_clipseg.py
|
||||||
|
src/transformers/models/clipseg/processing_clipseg.py
|
||||||
|
src/transformers/models/codegen/configuration_codegen.py
|
||||||
|
src/transformers/models/codegen/tokenization_codegen.py
|
||||||
|
src/transformers/models/codegen/tokenization_codegen_fast.py
|
||||||
|
src/transformers/models/conditional_detr/configuration_conditional_detr.py
|
||||||
|
src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py
|
||||||
|
src/transformers/models/conditional_detr/image_processing_conditional_detr.py
|
||||||
|
src/transformers/models/conditional_detr/modeling_conditional_detr.py
|
||||||
|
src/transformers/models/convbert/configuration_convbert.py
|
||||||
|
src/transformers/models/convbert/tokenization_convbert.py
|
||||||
|
src/transformers/models/convbert/tokenization_convbert_fast.py
|
||||||
|
src/transformers/models/convnext/configuration_convnext.py
|
||||||
|
src/transformers/models/convnext/feature_extraction_convnext.py
|
||||||
|
src/transformers/models/convnext/image_processing_convnext.py
|
||||||
|
src/transformers/models/convnext/modeling_convnext.py
|
||||||
|
src/transformers/models/cpm/tokenization_cpm.py
|
||||||
|
src/transformers/models/cpm/tokenization_cpm_fast.py
|
||||||
|
src/transformers/models/ctrl/configuration_ctrl.py
|
||||||
|
src/transformers/models/ctrl/modeling_ctrl.py
|
||||||
|
src/transformers/models/ctrl/tokenization_ctrl.py
|
||||||
|
src/transformers/models/cvt/configuration_cvt.py
|
||||||
|
src/transformers/models/cvt/modeling_cvt.py
|
||||||
|
src/transformers/models/data2vec/configuration_data2vec_audio.py
|
||||||
|
src/transformers/models/data2vec/configuration_data2vec_text.py
|
||||||
|
src/transformers/models/data2vec/configuration_data2vec_vision.py
|
||||||
|
src/transformers/models/data2vec/modeling_data2vec_audio.py
|
||||||
|
src/transformers/models/data2vec/modeling_data2vec_vision.py
|
||||||
|
src/transformers/models/deberta/configuration_deberta.py
|
||||||
|
src/transformers/models/deberta/modeling_deberta.py
|
||||||
|
src/transformers/models/deberta/tokenization_deberta.py
|
||||||
|
src/transformers/models/deberta/tokenization_deberta_fast.py
|
||||||
|
src/transformers/models/deberta_v2/configuration_deberta_v2.py
|
||||||
|
src/transformers/models/deberta_v2/modeling_deberta_v2.py
|
||||||
|
src/transformers/models/deberta_v2/tokenization_deberta_v2.py
|
||||||
|
src/transformers/models/deberta_v2/tokenization_deberta_v2_fast.py
|
||||||
|
src/transformers/models/decision_transformer/configuration_decision_transformer.py
|
||||||
|
src/transformers/models/deformable_detr/configuration_deformable_detr.py
|
||||||
|
src/transformers/models/deformable_detr/feature_extraction_deformable_detr.py
|
||||||
|
src/transformers/models/deformable_detr/image_processing_deformable_detr.py
|
||||||
|
src/transformers/models/deformable_detr/modeling_deformable_detr.py
|
||||||
|
src/transformers/models/deit/configuration_deit.py
|
||||||
|
src/transformers/models/deit/feature_extraction_deit.py
|
||||||
|
src/transformers/models/deit/image_processing_deit.py
|
||||||
|
src/transformers/models/deit/modeling_deit.py
|
||||||
|
src/transformers/models/deit/modeling_tf_deit.py
|
||||||
|
src/transformers/models/deta/configuration_deta.py
|
||||||
|
src/transformers/models/deta/image_processing_deta.py
|
||||||
|
src/transformers/models/deta/modeling_deta.py
|
||||||
|
src/transformers/models/detr/configuration_detr.py
|
||||||
|
src/transformers/models/detr/feature_extraction_detr.py
|
||||||
|
src/transformers/models/detr/image_processing_detr.py
|
||||||
|
src/transformers/models/detr/modeling_detr.py
|
||||||
|
src/transformers/models/dinat/configuration_dinat.py
|
||||||
|
src/transformers/models/dinat/modeling_dinat.py
|
||||||
|
src/transformers/models/distilbert/configuration_distilbert.py
|
||||||
|
src/transformers/models/distilbert/tokenization_distilbert.py
|
||||||
|
src/transformers/models/distilbert/tokenization_distilbert_fast.py
|
||||||
|
src/transformers/models/donut/feature_extraction_donut.py
|
||||||
|
src/transformers/models/donut/image_processing_donut.py
|
||||||
|
src/transformers/models/donut/processing_donut.py
|
||||||
|
src/transformers/models/dpr/configuration_dpr.py
|
||||||
|
src/transformers/models/dpr/tokenization_dpr.py
|
||||||
|
src/transformers/models/dpr/tokenization_dpr_fast.py
|
||||||
|
src/transformers/models/dpt/feature_extraction_dpt.py
|
||||||
|
src/transformers/models/dpt/image_processing_dpt.py
|
||||||
|
src/transformers/models/dpt/modeling_dpt.py
|
||||||
|
src/transformers/models/efficientformer/image_processing_efficientformer.py
|
||||||
|
src/transformers/models/efficientformer/modeling_tf_efficientformer.py
|
||||||
|
src/transformers/models/efficientnet/image_processing_efficientnet.py
|
||||||
|
src/transformers/models/electra/configuration_electra.py
|
||||||
|
src/transformers/models/electra/modeling_electra.py
|
||||||
|
src/transformers/models/electra/modeling_tf_electra.py
|
||||||
|
src/transformers/models/electra/tokenization_electra.py
|
||||||
|
src/transformers/models/electra/tokenization_electra_fast.py
|
||||||
|
src/transformers/models/encodec/feature_extraction_encodec.py
|
||||||
|
src/transformers/models/encodec/modeling_encodec.py
|
||||||
|
src/transformers/models/ernie/configuration_ernie.py
|
||||||
|
src/transformers/models/ernie_m/configuration_ernie_m.py
|
||||||
|
src/transformers/models/ernie_m/modeling_ernie_m.py
|
||||||
|
src/transformers/models/ernie_m/tokenization_ernie_m.py
|
||||||
|
src/transformers/models/esm/tokenization_esm.py
|
||||||
|
src/transformers/models/flaubert/tokenization_flaubert.py
|
||||||
|
src/transformers/models/flava/configuration_flava.py
|
||||||
|
src/transformers/models/flava/feature_extraction_flava.py
|
||||||
|
src/transformers/models/flava/image_processing_flava.py
|
||||||
|
src/transformers/models/flava/processing_flava.py
|
||||||
|
src/transformers/models/fnet/configuration_fnet.py
|
||||||
|
src/transformers/models/fnet/tokenization_fnet.py
|
||||||
|
src/transformers/models/fnet/tokenization_fnet_fast.py
|
||||||
|
src/transformers/models/fsmt/configuration_fsmt.py
|
||||||
|
src/transformers/models/fsmt/tokenization_fsmt.py
|
||||||
|
src/transformers/models/funnel/tokenization_funnel.py
|
||||||
|
src/transformers/models/funnel/tokenization_funnel_fast.py
|
||||||
|
src/transformers/models/git/modeling_git.py
|
||||||
|
src/transformers/models/git/processing_git.py
|
||||||
|
src/transformers/models/glpn/feature_extraction_glpn.py
|
||||||
|
src/transformers/models/glpn/image_processing_glpn.py
|
||||||
|
src/transformers/models/glpn/modeling_glpn.py
|
||||||
|
src/transformers/models/gpt2/configuration_gpt2.py
|
||||||
|
src/transformers/models/gpt2/modeling_gpt2.py
|
||||||
|
src/transformers/models/gpt2/tokenization_gpt2.py
|
||||||
|
src/transformers/models/gpt2/tokenization_gpt2_fast.py
|
||||||
|
src/transformers/models/gpt2/tokenization_gpt2_tf.py
|
||||||
|
src/transformers/models/gpt_neo/configuration_gpt_neo.py
|
||||||
|
src/transformers/models/gpt_neox/configuration_gpt_neox.py
|
||||||
|
src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
|
||||||
|
src/transformers/models/gpt_neox_japanese/configuration_gpt_neox_japanese.py
|
||||||
|
src/transformers/models/gpt_neox_japanese/tokenization_gpt_neox_japanese.py
|
||||||
|
src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py
|
||||||
|
src/transformers/models/gptj/modeling_gptj.py
|
||||||
|
src/transformers/models/gptsan_japanese/tokenization_gptsan_japanese.py
|
||||||
|
src/transformers/models/groupvit/modeling_groupvit.py
|
||||||
|
src/transformers/models/groupvit/modeling_tf_groupvit.py
|
||||||
|
src/transformers/models/herbert/tokenization_herbert.py
|
||||||
|
src/transformers/models/herbert/tokenization_herbert_fast.py
|
||||||
|
src/transformers/models/hubert/modeling_hubert.py
|
||||||
|
src/transformers/models/imagegpt/configuration_imagegpt.py
|
||||||
|
src/transformers/models/imagegpt/feature_extraction_imagegpt.py
|
||||||
|
src/transformers/models/imagegpt/image_processing_imagegpt.py
|
||||||
|
src/transformers/models/imagegpt/modeling_imagegpt.py
|
||||||
|
src/transformers/models/jukebox/tokenization_jukebox.py
|
||||||
|
src/transformers/models/jukebox/tokenization_jukebox.py
|
||||||
|
src/transformers/models/layoutlm/configuration_layoutlm.py
|
||||||
|
src/transformers/models/layoutlm/modeling_layoutlm.py
|
||||||
|
src/transformers/models/layoutlm/modeling_tf_layoutlm.py
|
||||||
|
src/transformers/models/layoutlm/tokenization_layoutlm.py
|
||||||
|
src/transformers/models/layoutlm/tokenization_layoutlm_fast.py
|
||||||
|
src/transformers/models/layoutlmv2/configuration_layoutlmv2.py
|
||||||
|
src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py
|
||||||
|
src/transformers/models/layoutlmv2/image_processing_layoutlmv2.py
|
||||||
|
src/transformers/models/layoutlmv2/modeling_layoutlmv2.py
|
||||||
|
src/transformers/models/layoutlmv2/processing_layoutlmv2.py
|
||||||
|
src/transformers/models/layoutlmv2/tokenization_layoutlmv2.py
|
||||||
|
src/transformers/models/layoutlmv2/tokenization_layoutlmv2_fast.py
|
||||||
|
src/transformers/models/layoutlmv3/configuration_layoutlmv3.py
|
||||||
|
src/transformers/models/layoutlmv3/feature_extraction_layoutlmv3.py
|
||||||
|
src/transformers/models/layoutlmv3/image_processing_layoutlmv3.py
|
||||||
|
src/transformers/models/layoutlmv3/modeling_layoutlmv3.py
|
||||||
|
src/transformers/models/layoutlmv3/modeling_tf_layoutlmv3.py
|
||||||
|
src/transformers/models/layoutlmv3/processing_layoutlmv3.py
|
||||||
|
src/transformers/models/layoutlmv3/tokenization_layoutlmv3.py
|
||||||
|
src/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py
|
||||||
|
src/transformers/models/layoutxlm/processing_layoutxlm.py
|
||||||
|
src/transformers/models/layoutxlm/tokenization_layoutxlm.py
|
||||||
|
src/transformers/models/layoutxlm/tokenization_layoutxlm_fast.py
|
||||||
|
src/transformers/models/led/tokenization_led.py
|
||||||
|
src/transformers/models/led/tokenization_led_fast.py
|
||||||
|
src/transformers/models/levit/configuration_levit.py
|
||||||
|
src/transformers/models/levit/feature_extraction_levit.py
|
||||||
|
src/transformers/models/levit/image_processing_levit.py
|
||||||
|
src/transformers/models/lilt/modeling_lilt.py
|
||||||
|
src/transformers/models/llama/tokenization_llama.py
|
||||||
|
src/transformers/models/longformer/modeling_longformer.py
|
||||||
|
src/transformers/models/longformer/modeling_tf_longformer.py
|
||||||
|
src/transformers/models/longformer/tokenization_longformer.py
|
||||||
|
src/transformers/models/longformer/tokenization_longformer_fast.py
|
||||||
|
src/transformers/models/longt5/modeling_longt5.py
|
||||||
|
src/transformers/models/luke/tokenization_luke.py
|
||||||
|
src/transformers/models/lxmert/tokenization_lxmert.py
|
||||||
|
src/transformers/models/lxmert/tokenization_lxmert_fast.py
|
||||||
|
src/transformers/models/m2m_100/configuration_m2m_100.py
|
||||||
|
src/transformers/models/m2m_100/tokenization_m2m_100.py
|
||||||
|
src/transformers/models/marian/modeling_marian.py
|
||||||
|
src/transformers/models/marian/tokenization_marian.py
|
||||||
|
src/transformers/models/markuplm/modeling_markuplm.py
|
||||||
|
src/transformers/models/markuplm/processing_markuplm.py
|
||||||
|
src/transformers/models/markuplm/tokenization_markuplm.py
|
||||||
|
src/transformers/models/markuplm/tokenization_markuplm_fast.py
|
||||||
|
src/transformers/models/mask2former/configuration_mask2former.py
|
||||||
|
src/transformers/models/mask2former/image_processing_mask2former.py
|
||||||
|
src/transformers/models/mask2former/modeling_mask2former.py
|
||||||
|
src/transformers/models/maskformer/configuration_maskformer.py
|
||||||
|
src/transformers/models/maskformer/feature_extraction_maskformer.py
|
||||||
|
src/transformers/models/maskformer/image_processing_maskformer.py
|
||||||
|
src/transformers/models/maskformer/modeling_maskformer.py
|
||||||
|
src/transformers/models/mbart/configuration_mbart.py
|
||||||
|
src/transformers/models/mbart/modeling_mbart.py
|
||||||
|
src/transformers/models/mbart/modeling_tf_mbart.py
|
||||||
|
src/transformers/models/mbart/tokenization_mbart.py
|
||||||
|
src/transformers/models/mbart/tokenization_mbart_fast.py
|
||||||
|
src/transformers/models/mbart50/tokenization_mbart50.py
|
||||||
|
src/transformers/models/mbart50/tokenization_mbart50_fast.py
|
||||||
|
src/transformers/models/megatron_bert/configuration_megatron_bert.py
|
||||||
|
src/transformers/models/mgp_str/processing_mgp_str.py
|
||||||
|
src/transformers/models/mgp_str/tokenization_mgp_str.py
|
||||||
|
src/transformers/models/mluke/tokenization_mluke.py
|
||||||
|
src/transformers/models/mobilebert/configuration_mobilebert.py
|
||||||
|
src/transformers/models/mobilebert/modeling_mobilebert.py
|
||||||
|
src/transformers/models/mobilebert/modeling_tf_mobilebert.py
|
||||||
|
src/transformers/models/mobilebert/tokenization_mobilebert.py
|
||||||
|
src/transformers/models/mobilebert/tokenization_mobilebert_fast.py
|
||||||
|
src/transformers/models/mobilenet_v1/feature_extraction_mobilenet_v1.py
|
||||||
|
src/transformers/models/mobilenet_v1/image_processing_mobilenet_v1.py
|
||||||
|
src/transformers/models/mobilenet_v1/modeling_mobilenet_v1.py
|
||||||
|
src/transformers/models/mobilenet_v2/feature_extraction_mobilenet_v2.py
|
||||||
|
src/transformers/models/mobilenet_v2/image_processing_mobilenet_v2.py
|
||||||
|
src/transformers/models/mobilenet_v2/modeling_mobilenet_v2.py
|
||||||
|
src/transformers/models/mobilevit/feature_extraction_mobilevit.py
|
||||||
|
src/transformers/models/mobilevit/image_processing_mobilevit.py
|
||||||
|
src/transformers/models/mobilevit/modeling_mobilevit.py
|
||||||
|
src/transformers/models/mobilevit/modeling_tf_mobilevit.py
|
||||||
|
src/transformers/models/mobilevitv2/configuration_mobilevitv2.py
|
||||||
|
src/transformers/models/mobilevitv2/modeling_mobilevitv2.py
|
||||||
|
src/transformers/models/mpnet/tokenization_mpnet.py
|
||||||
|
src/transformers/models/mpnet/tokenization_mpnet_fast.py
|
||||||
|
src/transformers/models/musicgen/configuration_musicgen.py
|
||||||
|
src/transformers/models/musicgen/modeling_musicgen.py
|
||||||
|
src/transformers/models/musicgen/processing_musicgen.py
|
||||||
|
src/transformers/models/mvp/configuration_mvp.py
|
||||||
|
src/transformers/models/mvp/tokenization_mvp.py
|
||||||
|
src/transformers/models/mvp/tokenization_mvp_fast.py
|
||||||
|
src/transformers/models/nat/configuration_nat.py
|
||||||
|
src/transformers/models/nat/modeling_nat.py
|
||||||
|
src/transformers/models/nezha/configuration_nezha.py
|
||||||
|
src/transformers/models/nllb/tokenization_nllb.py
|
||||||
|
src/transformers/models/nllb/tokenization_nllb_fast.py
|
||||||
|
src/transformers/models/oneformer/configuration_oneformer.py
|
||||||
|
src/transformers/models/oneformer/image_processing_oneformer.py
|
||||||
|
src/transformers/models/oneformer/modeling_oneformer.py
|
||||||
|
src/transformers/models/oneformer/processing_oneformer.py
|
||||||
|
src/transformers/models/openai/configuration_openai.py
|
||||||
|
src/transformers/models/openai/tokenization_openai.py
|
||||||
|
src/transformers/models/openai/tokenization_openai_fast.py
|
||||||
|
src/transformers/models/opt/configuration_opt.py
|
||||||
|
src/transformers/models/opt/modeling_opt.py
|
||||||
|
src/transformers/models/opt/modeling_tf_opt.py
|
||||||
|
src/transformers/models/owlvit/feature_extraction_owlvit.py
|
||||||
|
src/transformers/models/owlvit/image_processing_owlvit.py
|
||||||
|
src/transformers/models/owlvit/modeling_owlvit.py
|
||||||
|
src/transformers/models/owlvit/processing_owlvit.py
|
||||||
|
src/transformers/models/pegasus/configuration_pegasus.py
|
||||||
|
src/transformers/models/pegasus/modeling_pegasus.py
|
||||||
|
src/transformers/models/pegasus/tokenization_pegasus.py
|
||||||
|
src/transformers/models/pegasus/tokenization_pegasus_fast.py
|
||||||
|
src/transformers/models/pegasus_x/configuration_pegasus_x.py
|
||||||
|
src/transformers/models/perceiver/feature_extraction_perceiver.py
|
||||||
|
src/transformers/models/perceiver/image_processing_perceiver.py
|
||||||
|
src/transformers/models/perceiver/modeling_perceiver.py
|
||||||
|
src/transformers/models/perceiver/tokenization_perceiver.py
|
||||||
|
src/transformers/models/phobert/tokenization_phobert.py
|
||||||
|
src/transformers/models/pix2struct/modeling_pix2struct.py
|
||||||
|
src/transformers/models/plbart/configuration_plbart.py
|
||||||
|
src/transformers/models/plbart/modeling_plbart.py
|
||||||
|
src/transformers/models/plbart/tokenization_plbart.py
|
||||||
|
src/transformers/models/poolformer/configuration_poolformer.py
|
||||||
|
src/transformers/models/poolformer/feature_extraction_poolformer.py
|
||||||
|
src/transformers/models/poolformer/image_processing_poolformer.py
|
||||||
|
src/transformers/models/poolformer/modeling_poolformer.py
|
||||||
|
src/transformers/models/prophetnet/tokenization_prophetnet.py
|
||||||
|
src/transformers/models/rag/tokenization_rag.py
|
||||||
|
src/transformers/models/realm/configuration_realm.py
|
||||||
|
src/transformers/models/realm/tokenization_realm.py
|
||||||
|
src/transformers/models/realm/tokenization_realm_fast.py
|
||||||
|
src/transformers/models/reformer/configuration_reformer.py
|
||||||
|
src/transformers/models/reformer/modeling_reformer.py
|
||||||
|
src/transformers/models/reformer/tokenization_reformer.py
|
||||||
|
src/transformers/models/reformer/tokenization_reformer_fast.py
|
||||||
|
src/transformers/models/regnet/modeling_regnet.py
|
||||||
|
src/transformers/models/regnet/modeling_tf_regnet.py
|
||||||
|
src/transformers/models/rembert/tokenization_rembert.py
|
||||||
|
src/transformers/models/rembert/tokenization_rembert_fast.py
|
||||||
|
src/transformers/models/resnet/configuration_resnet.py
|
||||||
|
src/transformers/models/resnet/modeling_resnet.py
|
||||||
|
src/transformers/models/resnet/modeling_tf_resnet.py
|
||||||
|
src/transformers/models/roberta/configuration_roberta.py
|
||||||
|
src/transformers/models/roberta/modeling_roberta.py
|
||||||
|
src/transformers/models/roberta/modeling_tf_roberta.py
|
||||||
|
src/transformers/models/roberta/tokenization_roberta.py
|
||||||
|
src/transformers/models/roberta/tokenization_roberta_fast.py
|
||||||
|
src/transformers/models/roberta_prelayernorm/configuration_roberta_prelayernorm.py
|
||||||
|
src/transformers/models/roberta_prelayernorm/modeling_roberta_prelayernorm.py
|
||||||
|
src/transformers/models/roberta_prelayernorm/modeling_tf_roberta_prelayernorm.py
|
||||||
|
src/transformers/models/roc_bert/modeling_roc_bert.py
|
||||||
|
src/transformers/models/roc_bert/tokenization_roc_bert.py
|
||||||
|
src/transformers/models/roformer/tokenization_roformer.py
|
||||||
|
src/transformers/models/roformer/tokenization_roformer_fast.py
|
||||||
|
src/transformers/models/roformer/tokenization_utils.py
|
||||||
|
src/transformers/models/segformer/feature_extraction_segformer.py
|
||||||
|
src/transformers/models/segformer/image_processing_segformer.py
|
||||||
|
src/transformers/models/segformer/modeling_segformer.py
|
||||||
|
src/transformers/models/segformer/modeling_tf_segformer.py
|
||||||
|
src/transformers/models/sew/configuration_sew.py
|
||||||
|
src/transformers/models/sew/modeling_sew.py
|
||||||
|
src/transformers/models/sew_d/configuration_sew_d.py
|
||||||
|
src/transformers/models/sew_d/modeling_sew_d.py
|
||||||
|
src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py
|
||||||
|
src/transformers/models/speech_to_text/configuration_speech_to_text.py
|
||||||
|
src/transformers/models/speech_to_text/feature_extraction_speech_to_text.py
|
||||||
|
src/transformers/models/speech_to_text/modeling_speech_to_text.py
|
||||||
|
src/transformers/models/speech_to_text/processing_speech_to_text.py
|
||||||
|
src/transformers/models/speech_to_text/tokenization_speech_to_text.py
|
||||||
|
src/transformers/models/speech_to_text_2/configuration_speech_to_text_2.py
|
||||||
|
src/transformers/models/speech_to_text_2/modeling_speech_to_text_2.py
|
||||||
|
src/transformers/models/speech_to_text_2/processing_speech_to_text_2.py
|
||||||
|
src/transformers/models/speech_to_text_2/tokenization_speech_to_text_2.py
|
||||||
|
src/transformers/models/speecht5/feature_extraction_speecht5.py
|
||||||
|
src/transformers/models/speecht5/modeling_speecht5.py
|
||||||
|
src/transformers/models/speecht5/processing_speecht5.py
|
||||||
|
src/transformers/models/speecht5/tokenization_speecht5.py
|
||||||
|
src/transformers/models/splinter/tokenization_splinter.py
|
||||||
|
src/transformers/models/splinter/tokenization_splinter_fast.py
|
||||||
|
src/transformers/models/squeezebert/configuration_squeezebert.py
|
||||||
|
src/transformers/models/squeezebert/tokenization_squeezebert.py
|
||||||
|
src/transformers/models/squeezebert/tokenization_squeezebert_fast.py
|
||||||
|
src/transformers/models/swin/configuration_swin.py
|
||||||
|
src/transformers/models/swin/modeling_swin.py
|
||||||
|
src/transformers/models/swin2sr/image_processing_swin2sr.py
|
||||||
|
src/transformers/models/swin2sr/modeling_swin2sr.py
|
||||||
|
src/transformers/models/swinv2/configuration_swinv2.py
|
||||||
|
src/transformers/models/t5/tokenization_t5.py
|
||||||
|
src/transformers/models/t5/tokenization_t5_fast.py
|
||||||
|
src/transformers/models/table_transformer/modeling_table_transformer.py
|
||||||
|
src/transformers/models/tapas/tokenization_tapas.py
|
||||||
|
src/transformers/models/time_series_transformer/configuration_time_series_transformer.py
|
||||||
|
src/transformers/models/time_series_transformer/modeling_time_series_transformer.py
|
||||||
|
src/transformers/models/timesformer/configuration_timesformer.py
|
||||||
|
src/transformers/models/timesformer/modeling_timesformer.py
|
||||||
|
src/transformers/models/transfo_xl/configuration_transfo_xl.py
|
||||||
|
src/transformers/models/transfo_xl/tokenization_transfo_xl.py
|
||||||
|
src/transformers/models/trocr/configuration_trocr.py
|
||||||
|
src/transformers/models/trocr/modeling_trocr.py
|
||||||
|
src/transformers/models/trocr/processing_trocr.py
|
||||||
|
src/transformers/models/tvlt/feature_extraction_tvlt.py
|
||||||
|
src/transformers/models/tvlt/image_processing_tvlt.py
|
||||||
|
src/transformers/models/tvlt/processing_tvlt.py
|
||||||
|
src/transformers/models/unispeech/configuration_unispeech.py
|
||||||
|
src/transformers/models/unispeech/modeling_unispeech.py
|
||||||
|
src/transformers/models/unispeech_sat/modeling_unispeech_sat.py
|
||||||
|
src/transformers/models/upernet/modeling_upernet.py
|
||||||
|
src/transformers/models/videomae/feature_extraction_videomae.py
|
||||||
|
src/transformers/models/videomae/image_processing_videomae.py
|
||||||
|
src/transformers/models/videomae/modeling_videomae.py
|
||||||
|
src/transformers/models/vilt/feature_extraction_vilt.py
|
||||||
|
src/transformers/models/vilt/image_processing_vilt.py
|
||||||
|
src/transformers/models/vilt/modeling_vilt.py
|
||||||
|
src/transformers/models/vilt/processing_vilt.py
|
||||||
|
src/transformers/models/vision_encoder_decoder/configuration_vision_encoder_decoder.py
|
||||||
|
src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py
|
||||||
|
src/transformers/models/vision_text_dual_encoder/configuration_vision_text_dual_encoder.py
|
||||||
|
src/transformers/models/vision_text_dual_encoder/modeling_tf_vision_text_dual_encoder.py
|
||||||
|
src/transformers/models/vision_text_dual_encoder/processing_vision_text_dual_encoder.py
|
||||||
|
src/transformers/models/visual_bert/configuration_visual_bert.py
|
||||||
|
src/transformers/models/vit/configuration_vit.py
|
||||||
|
src/transformers/models/vit/feature_extraction_vit.py
|
||||||
|
src/transformers/models/vit/image_processing_vit.py
|
||||||
|
src/transformers/models/vit/modeling_tf_vit.py
|
||||||
|
src/transformers/models/vit/modeling_vit.py
|
||||||
|
src/transformers/models/vit_hybrid/image_processing_vit_hybrid.py
|
||||||
|
src/transformers/models/vit_mae/configuration_vit_mae.py
|
||||||
|
src/transformers/models/vit_mae/modeling_vit_mae.py
|
||||||
|
src/transformers/models/vit_msn/modeling_vit_msn.py
|
||||||
|
src/transformers/models/vits/modeling_vits.py
|
||||||
|
src/transformers/models/vits/tokenization_vits.py
|
||||||
|
src/transformers/models/wav2vec2/configuration_wav2vec2.py
|
||||||
|
src/transformers/models/wav2vec2/feature_extraction_wav2vec2.py
|
||||||
|
src/transformers/models/wav2vec2/modeling_wav2vec2.py
|
||||||
|
src/transformers/models/wav2vec2/processing_wav2vec2.py
|
||||||
|
src/transformers/models/wav2vec2/tokenization_wav2vec2.py
|
||||||
|
src/transformers/models/wav2vec2_conformer/configuration_wav2vec2_conformer.py
|
||||||
|
src/transformers/models/wav2vec2_conformer/modeling_wav2vec2_conformer.py
|
||||||
|
src/transformers/models/wav2vec2_phoneme/tokenization_wav2vec2_phoneme.py
|
||||||
|
src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
|
||||||
|
src/transformers/models/wavlm/configuration_wavlm.py
|
||||||
|
src/transformers/models/wavlm/modeling_wavlm.py
|
||||||
|
src/transformers/models/whisper/configuration_whisper.py
|
||||||
|
src/transformers/models/whisper/feature_extraction_whisper.py
|
||||||
|
src/transformers/models/whisper/modeling_tf_whisper.py
|
||||||
|
src/transformers/models/whisper/modeling_whisper.py
|
||||||
|
src/transformers/models/whisper/processing_whisper.py
|
||||||
|
src/transformers/models/whisper/tokenization_whisper.py
|
||||||
|
src/transformers/models/whisper/tokenization_whisper_fast.py
|
||||||
|
src/transformers/models/x_clip/modeling_x_clip.py
|
||||||
|
src/transformers/models/x_clip/processing_x_clip.py
|
||||||
|
src/transformers/models/xglm/tokenization_xglm.py
|
||||||
|
src/transformers/models/xglm/tokenization_xglm_fast.py
|
||||||
|
src/transformers/models/xlm/configuration_xlm.py
|
||||||
|
src/transformers/models/xlm/tokenization_xlm.py
|
||||||
|
src/transformers/models/xlm_prophetnet/tokenization_xlm_prophetnet.py
|
||||||
|
src/transformers/models/xlm_roberta/configuration_xlm_roberta.py
|
||||||
|
src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py
|
||||||
|
src/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py
|
||||||
|
src/transformers/models/xlm_roberta_xl/configuration_xlm_roberta_xl.py
|
||||||
|
src/transformers/models/xlnet/configuration_xlnet.py
|
||||||
|
src/transformers/models/xlnet/tokenization_xlnet.py
|
||||||
|
src/transformers/models/xlnet/tokenization_xlnet_fast.py
|
||||||
|
src/transformers/models/xmod/configuration_xmod.py
|
||||||
|
src/transformers/models/xmod/modeling_xmod.py
|
||||||
|
src/transformers/models/yolos/configuration_yolos.py
|
||||||
|
src/transformers/models/yolos/feature_extraction_yolos.py
|
||||||
|
src/transformers/models/yolos/image_processing_yolos.py
|
||||||
|
src/transformers/models/yolos/modeling_yolos.py
|
||||||
|
src/transformers/models/yoso/configuration_yoso.py
|
||||||
|
src/transformers/pipelines/
|
||||||
Reference in New Issue
Block a user