Add bark (#24086)
* first raw version of the bark integration * working code on small models with single run * add converting script from suno weights 2 hf * many changes * correct past_kv output * working implementation for inference * update the converting script according to the architecture changes * add a working end-to-end inference code * remove some comments and make small changes * remove unecessary comment * add docstrings and ensure no unecessary intermediary output during audio generation * remove done TODOs * make style + add config docstrings * modification for batch inference support on the whole model * add details to .generation_audio method * add copyright * convert EncodecModel from original library to transformers implementation * add two class in order to facilitate model and sub-models loading from the hub * add support of loading the whole model * add BarkProcessor * correct modeling according to processor output * Add proper __init__ and auto support * Add up-to-date copyright/license message * add relative import instead of absolute * cleaner head_dim computation * small comment removal or changes * more verbose LayerNorm init method * specify eps for clearer comprehension * more verbose variable naming in the MLP module * remove unecessary BarkBlock parameter * clearer code in the forward pass of the BarkBlock * remove _initialize_modules method for cleaner code * Remove unnecessary methods from sub-models * move code to remove unnecessary function * rename a variable for clarity and change an assert * move code and change variable name for clarity * remove unnecessary asserts * correct small bug * correct a comment * change variable names for clarity * remove asserts * change import from absolute to relative * correct small error due to comma missing + correct import * Add attribute Bark config * add first version of tests * update attention_map * add tie_weights and resize_token_embeddings for fineModel * correct getting attention_mask in generate_text_semantic * remove Bark inference trick * leave more choices in barkProcessor * remove _no_split_modules * fixe error in forward of block and introduce clearer notations * correct converting script with last changes * make style + add draft bark.mdx * correct BarkModelTest::test_generate_text_semantic * add Bark in main README * add dummy_pt_objects for Bark * add missing models in the main init * correct test_decoder_model_past_with_large_inputs * disable torchscript test * change docstring of BarkProcessor * Add test_processor_bark * make style * correct copyrights * add bark.mdx + make style, quality and consistency * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * Remove unnecessary test method * simply logic of a test * Only check first ids for slow audio generation * split full end-to-end generation tests * remove unneccessary comment * change submodel names for clearer naming * remove ModuleDict from modeling_bark * combine two if statements * ensure that an edge misued won't happen * modify variable name * move code snippet to the right place (coarse instead of semantic) * change BarkSemanticModule -> BarkSemanticModel * align BarkProcessor with transformers paradigm * correct BarkProcessor tests with last commit changes * change _validate_voice_preset to an instance method instead of a class method * tie_weights already called with post_init * add codec_model config to configuration * update bark modeling tests with recent BarkProcessor changes * remove SubModelPretrainedModel + change speakers embeddings prompt type in BarkModel * change absolute imports to relative * remove TODO * change docstrings * add examples to docs and docstrings * make style * uses BatchFeature in BarkProcessor insteads of dict * continue improving docstrings and docs + make style * correct docstrings examples * more comprehensible speaker_embeddings load/Save * rename speaker_embeddings_dict -> speaker_embeddings * correct bark.mdx + add bark to documentation_tests * correct docstrings configuration_bark * integrate last nit suggestions * integrate BarkGeneration configs * make style * remove bark tests from documentation_tests.txt because timeout - tested manually * add proper generation config initialization * small bark.mdx documentation changes * rename bark.mdx -> bark.md * add torch.no_grad behind BarkModel.generate_audio() * replace assert by ValueError in convert_suno_to_hf.py * integrate a series of short comments from reviewer * move SemanticLogitsProcessors and remove .detach() from Bark docs and docstrings * actually remove SemanticLogitsProcessor from modeling_bark.oy * BarkProcessor returns a single output instead of tuple + correct docstrings * make style + correct bug * add initializer_range to BarkConfig + correct slow modeling tests * add .clone() to history_prompt.coarse_prompt to avoid modifying input array * Making sure no extra "`" are present * remove extra characters in modeling_bark.py * Correct output if history_prompt is None * remove TODOs * remove ravel comment * completing generation_configuration_bark.py docstrings * change docstrings - number of audio codebooks instead of Encodec codebooks * change 'bias' docstrings in configuration_bark.py * format code * rename BarkModel.generate_audio -> BarkModel.generate_speech * modify AutoConfig instead of EncodecConfig in BarkConfig * correct AutoConfig wrong init * refactor BarkModel and sub-models generate_coarse, generate_fine, generate_text_semantic * remove SemanticLogitsProcessor and replace it with SuppressTokensLogitsProcessor * move nb_codebook related config arguments to BarkFineConfig * rename bark.mdx -> bark.md * correcting BarkModelConfig from_pretrained + remove keys_to_ignore * correct bark.md with correct hub path * correct code bug in bark.md * correct list tokens_to_suppress * modify Processor to load nested speaker embeddings in a safer way * correct batch sampling in BarkFineModel.generate_fine * Apply suggestions from code review Small docstrings correction and code improvements Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * give more details about num_layers in docstrings * correct indentation mistake * correct submodelconfig order of docstring variables * put audio models in alphabetical order in utils/check_repo.my * remove useless line from test_modeling_bark.py * makes BarkCoarseModelTest inherits from (ModelTesterMixin, GenerationTesterMixin, unittest.TestCase) instead of BarkSemanticModelTest * make a Tester class for each sub-model instead of inheriting * add test_resize_embeddings=True for Bark sub-models * add Copied from transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoSelfAttention._split_heads * remove 'Copied fom Bark' comment * remove unneccessary comment * change np.min -> min in modeling_bark.py * refactored all custom layers to have Bark prefix * add attention_mask as an argument of generate_text_semantic * refactor sub-models start docstrings to have more precise config class definition * move _tied_weights_keys overriding * add docstrings to generate_xxx in modeling_bark.py * add loading whole BarkModel to convert_suno_to_hf * refactor attribute and variable names * make style convert_suno * update bark checkpoints * remove never entered if statement * move bark_modeling docstrings after BarkPretrainedModel class definition * refactor modeling_bark.py: kv -> key_values * small nits - code refactoring and removing unecessary lines from _init_weights * nits - replace inplace method by variable assigning * remove *optional* when necessary * remove some lines in generate_speech * add default value for optional parameter * Refactor preprocess_histories_before_coarse -> preprocess_histories Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * correct usage after refactoring * refactor Bark's generate_xxx -> generate and modify docstrings and tests accordingly * update docstrings python in configuration_bark.py * add bark files in utils/documentation_test.txt * correct docstrings python snippet * add the ability to use parameters in the form of e.g coarse_temperature * add semantic_max_new_tokens in python snippet in docstrings for quicker generation * Reformate sub-models kwargs in BakModel.generate Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * correct kwargs in BarkModel.generate * correct attention_mask kwarg in BarkModel.generate * add tests for sub-models args in BarkModel.generate and correct BarkFineModel.test_generate_fp16 * enrich BarkModel.generate docstrings with a description of how to use the kwargs --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
@@ -291,6 +291,7 @@ Current number of checkpoints: ** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
|
||||
|
||||
@@ -268,6 +268,7 @@ Número actual de puntos de control: ** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
|
||||
|
||||
@@ -240,6 +240,7 @@ conda install -c huggingface transformers
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ] (https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (से École polytechnique) साथ थीसिस [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) पर निर्भर Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis रिहाई।
|
||||
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research से) साथ में पेपर [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)गुयेन लुओंग ट्रान, डुओंग मिन्ह ले और डाट क्वोक गुयेन द्वारा पोस्ट किया गया।
|
||||
|
||||
@@ -302,6 +302,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (BAAI から) Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell から公開された研究論文: [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679)
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (MIT から) Yuan Gong, Yu-An Chung, James Glass から公開された研究論文: [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778)
|
||||
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (Facebook から) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer から公開された研究論文: [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (École polytechnique から) Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis から公開された研究論文: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321)
|
||||
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research から) Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen から公開された研究論文: [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)
|
||||
|
||||
@@ -217,6 +217,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
|
||||
|
||||
@@ -241,6 +241,7 @@ conda install -c huggingface transformers
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (来自 BAAI) 伴随论文 [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 由 Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell 发布。
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (来自 MIT) 伴随论文 [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 由 Yuan Gong, Yu-An Chung, James Glass 发布。
|
||||
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (来自 Facebook) 伴随论文 [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) 由 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer 发布。
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (来自 École polytechnique) 伴随论文 [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 由 Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis 发布。
|
||||
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (来自 VinAI Research) 伴随论文 [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) 由 Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen 发布。
|
||||
|
||||
@@ -253,6 +253,7 @@ conda install -c huggingface transformers
|
||||
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
|
||||
|
||||
@@ -545,6 +545,8 @@
|
||||
sections:
|
||||
- local: model_doc/audio-spectrogram-transformer
|
||||
title: Audio Spectrogram Transformer
|
||||
- local: model_doc/bark
|
||||
title: Bark
|
||||
- local: model_doc/clap
|
||||
title: CLAP
|
||||
- local: model_doc/encodec
|
||||
|
||||
@@ -57,6 +57,7 @@ The documentation is organized into five sections:
|
||||
1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
|
||||
1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
|
||||
1. **[Autoformer](model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
|
||||
1. **[Bark](model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
|
||||
1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
|
||||
1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
|
||||
1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
|
||||
@@ -282,6 +283,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| AltCLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Audio Spectrogram Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Autoformer | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| Bark | ❌ | ❌ | ✅ | ❌ | ❌ |
|
||||
| BART | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
| BEiT | ❌ | ❌ | ✅ | ❌ | ✅ |
|
||||
| BERT | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||
|
||||
141
docs/source/en/model_doc/bark.md
Normal file
141
docs/source/en/model_doc/bark.md
Normal file
@@ -0,0 +1,141 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Bark
|
||||
|
||||
## Overview
|
||||
|
||||
Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).
|
||||
|
||||
|
||||
Bark is made of 4 main models:
|
||||
|
||||
- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
|
||||
- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
|
||||
- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
|
||||
- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array.
|
||||
|
||||
It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
|
||||
|
||||
|
||||
### Tips:
|
||||
|
||||
Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
|
||||
These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoProcessor, BarkModel
|
||||
|
||||
>>> processor = AutoProcessor.from_pretrained("suno/bark")
|
||||
>>> model = BarkModel.from_pretrained("suno/bark")
|
||||
|
||||
>>> voice_preset = "v2/en_speaker_6"
|
||||
|
||||
>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
|
||||
|
||||
>>> audio_array = model.generate(**inputs)
|
||||
>>> audio_array = audio_array.cpu().numpy().squeeze()
|
||||
```
|
||||
|
||||
Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects.
|
||||
|
||||
```python
|
||||
>>> # Multilingual speech - simplified Chinese
|
||||
>>> inputs = processor("惊人的!我会说中文")
|
||||
|
||||
>>> # Multilingual speech - French - let's use a voice_preset as well
|
||||
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
|
||||
|
||||
>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
|
||||
>>> inputs = processor("♪ Hello, my dog is cute ♪")
|
||||
|
||||
>>> audio_array = model.generate(**inputs)
|
||||
>>> audio_array = audio_array.cpu().numpy().squeeze()
|
||||
```
|
||||
|
||||
The model can also produce **nonverbal communications** like laughing, sighing and crying.
|
||||
|
||||
|
||||
```python
|
||||
>>> # Adding non-speech cues to the input text
|
||||
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
|
||||
|
||||
>>> audio_array = model.generate(**inputs)
|
||||
>>> audio_array = audio_array.cpu().numpy().squeeze()
|
||||
```
|
||||
|
||||
To save the audio, simply take the sample rate from the model config and some scipy utility:
|
||||
|
||||
```python
|
||||
>>> from scipy.io.wavfile import write as write_wav
|
||||
|
||||
>>> # save audio to disk, but first take the sample rate from the model config
|
||||
>>> sample_rate = model.generation_config.sample_rate
|
||||
>>> write_wav("bark_generation.wav", sample_rate, audio_array)
|
||||
```
|
||||
|
||||
|
||||
This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
|
||||
The original code can be found [here](https://github.com/suno-ai/bark).
|
||||
|
||||
|
||||
## BarkConfig
|
||||
|
||||
[[autodoc]] BarkConfig
|
||||
- all
|
||||
|
||||
## BarkProcessor
|
||||
|
||||
[[autodoc]] BarkProcessor
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## BarkModel
|
||||
|
||||
[[autodoc]] BarkModel
|
||||
- generate
|
||||
|
||||
## BarkSemanticModel
|
||||
|
||||
[[autodoc]] BarkSemanticModel
|
||||
- forward
|
||||
|
||||
## BarkCoarseModel
|
||||
|
||||
[[autodoc]] BarkCoarseModel
|
||||
- forward
|
||||
|
||||
## BarkFineModel
|
||||
|
||||
[[autodoc]] BarkFineModel
|
||||
- forward
|
||||
|
||||
## BarkCausalModel
|
||||
|
||||
[[autodoc]] BarkCausalModel
|
||||
- forward
|
||||
|
||||
## BarkCoarseConfig
|
||||
|
||||
[[autodoc]] BarkCoarseConfig
|
||||
- all
|
||||
|
||||
## BarkFineConfig
|
||||
|
||||
[[autodoc]] BarkFineConfig
|
||||
- all
|
||||
|
||||
## BarkSemanticConfig
|
||||
|
||||
[[autodoc]] BarkSemanticConfig
|
||||
- all
|
||||
|
||||
@@ -160,6 +160,13 @@ _import_structure = {
|
||||
"AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"AutoformerConfig",
|
||||
],
|
||||
"models.bark": [
|
||||
"BarkCoarseConfig",
|
||||
"BarkConfig",
|
||||
"BarkFineConfig",
|
||||
"BarkProcessor",
|
||||
"BarkSemanticConfig",
|
||||
],
|
||||
"models.bart": ["BartConfig", "BartTokenizer"],
|
||||
"models.barthez": [],
|
||||
"models.bartpho": [],
|
||||
@@ -1136,6 +1143,17 @@ else:
|
||||
"AutoformerPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.bark"].extend(
|
||||
[
|
||||
"BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"BarkCausalModel",
|
||||
"BarkCoarseModel",
|
||||
"BarkFineModel",
|
||||
"BarkModel",
|
||||
"BarkPreTrainedModel",
|
||||
"BarkSemanticModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.bart"].extend(
|
||||
[
|
||||
"BART_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
@@ -4098,6 +4116,13 @@ if TYPE_CHECKING:
|
||||
AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
AutoformerConfig,
|
||||
)
|
||||
from .models.bark import (
|
||||
BarkCoarseConfig,
|
||||
BarkConfig,
|
||||
BarkFineConfig,
|
||||
BarkProcessor,
|
||||
BarkSemanticConfig,
|
||||
)
|
||||
from .models.bart import BartConfig, BartTokenizer
|
||||
from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
|
||||
from .models.bert import (
|
||||
@@ -4978,6 +5003,15 @@ if TYPE_CHECKING:
|
||||
AutoformerModel,
|
||||
AutoformerPreTrainedModel,
|
||||
)
|
||||
from .models.bark import (
|
||||
BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
BarkCausalModel,
|
||||
BarkCoarseModel,
|
||||
BarkFineModel,
|
||||
BarkModel,
|
||||
BarkPreTrainedModel,
|
||||
BarkSemanticModel,
|
||||
)
|
||||
from .models.bart import (
|
||||
BART_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
BartForCausalLM,
|
||||
|
||||
@@ -1148,3 +1148,39 @@ class ClassifierFreeGuidanceLogitsProcessor(LogitsProcessor):
|
||||
cond_logits, uncond_logits = scores.split(unguided_bsz, dim=0)
|
||||
scores = uncond_logits + (cond_logits - uncond_logits) * self.guidance_scale
|
||||
return scores
|
||||
|
||||
|
||||
class AlternatingCodebooksLogitsProcessor(LogitsProcessor):
|
||||
r"""
|
||||
[`LogitsProcessor`] enforcing alternated generation between the two codebooks of [`Bark`]'s fine submodel.
|
||||
|
||||
Args:
|
||||
input_start_len (`int`):
|
||||
The length of the initial input sequence.
|
||||
semantic_vocab_size (`int`):
|
||||
Vocabulary size of the semantic part, i.e number of tokens associated to the semantic vocabulary.
|
||||
codebook_size (`int`):
|
||||
Number of tokens associated to the codebook.
|
||||
"""
|
||||
|
||||
def __init__(self, input_start_len: int, semantic_vocab_size: int, codebook_size: int):
|
||||
if not isinstance(input_start_len, int) or input_start_len < 0:
|
||||
raise ValueError(f"`input_starting_length` has to be a non-negative integer, but is {input_start_len}")
|
||||
|
||||
self.input_start_len = input_start_len
|
||||
self.semantic_vocab_size = semantic_vocab_size
|
||||
self.codebook_size = codebook_size
|
||||
|
||||
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
|
||||
curr_len = input_ids.shape[-1]
|
||||
|
||||
# even -> first codebook, odd -> second codebook
|
||||
is_first_codebook = ((curr_len - self.input_start_len) % 2) == 0
|
||||
|
||||
if is_first_codebook:
|
||||
scores[:, : self.semantic_vocab_size] = -float("inf")
|
||||
scores[:, self.semantic_vocab_size + self.codebook_size :] = -float("inf")
|
||||
else:
|
||||
scores[:, : self.semantic_vocab_size + self.codebook_size] = -float("inf")
|
||||
|
||||
return scores
|
||||
|
||||
@@ -19,6 +19,7 @@ from . import (
|
||||
audio_spectrogram_transformer,
|
||||
auto,
|
||||
autoformer,
|
||||
bark,
|
||||
bart,
|
||||
barthez,
|
||||
bartpho,
|
||||
|
||||
@@ -35,6 +35,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
("altclip", "AltCLIPConfig"),
|
||||
("audio-spectrogram-transformer", "ASTConfig"),
|
||||
("autoformer", "AutoformerConfig"),
|
||||
("bark", "BarkConfig"),
|
||||
("bart", "BartConfig"),
|
||||
("beit", "BeitConfig"),
|
||||
("bert", "BertConfig"),
|
||||
@@ -237,6 +238,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||
("altclip", "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("audio-spectrogram-transformer", "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("autoformer", "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("bark", "BARK_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("bart", "BART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("bert", "BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
@@ -419,6 +421,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
("altclip", "AltCLIP"),
|
||||
("audio-spectrogram-transformer", "Audio Spectrogram Transformer"),
|
||||
("autoformer", "Autoformer"),
|
||||
("bark", "Bark"),
|
||||
("bart", "BART"),
|
||||
("barthez", "BARThez"),
|
||||
("bartpho", "BARTpho"),
|
||||
|
||||
@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
|
||||
("altclip", "AltCLIPModel"),
|
||||
("audio-spectrogram-transformer", "ASTModel"),
|
||||
("autoformer", "AutoformerModel"),
|
||||
("bark", "BarkModel"),
|
||||
("bart", "BartModel"),
|
||||
("beit", "BeitModel"),
|
||||
("bert", "BertModel"),
|
||||
|
||||
@@ -44,6 +44,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
("align", "AlignProcessor"),
|
||||
("altclip", "AltCLIPProcessor"),
|
||||
("bark", "BarkProcessor"),
|
||||
("blip", "BlipProcessor"),
|
||||
("blip-2", "Blip2Processor"),
|
||||
("bridgetower", "BridgeTowerProcessor"),
|
||||
|
||||
79
src/transformers/models/bark/__init__.py
Normal file
79
src/transformers/models/bark/__init__.py
Normal file
@@ -0,0 +1,79 @@
|
||||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import (
|
||||
OptionalDependencyNotAvailable,
|
||||
_LazyModule,
|
||||
is_torch_available,
|
||||
)
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_bark": [
|
||||
"BARK_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"BarkCoarseConfig",
|
||||
"BarkConfig",
|
||||
"BarkFineConfig",
|
||||
"BarkSemanticConfig",
|
||||
],
|
||||
"processing_bark": ["BarkProcessor"],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_bark"] = [
|
||||
"BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"BarkFineModel",
|
||||
"BarkSemanticModel",
|
||||
"BarkCoarseModel",
|
||||
"BarkModel",
|
||||
"BarkPreTrainedModel",
|
||||
"BarkCausalModel",
|
||||
]
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_bark import (
|
||||
BARK_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
BarkCoarseConfig,
|
||||
BarkConfig,
|
||||
BarkFineConfig,
|
||||
BarkSemanticConfig,
|
||||
)
|
||||
from .processing_bark import BarkProcessor
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_bark import (
|
||||
BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
BarkCausalModel,
|
||||
BarkCoarseModel,
|
||||
BarkFineModel,
|
||||
BarkModel,
|
||||
BarkPreTrainedModel,
|
||||
BarkSemanticModel,
|
||||
)
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||
348
src/transformers/models/bark/configuration_bark.py
Normal file
348
src/transformers/models/bark/configuration_bark.py
Normal file
@@ -0,0 +1,348 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" BARK model configuration"""
|
||||
|
||||
import copy
|
||||
import os
|
||||
from typing import Dict, Optional, Union
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import add_start_docstrings, logging
|
||||
from ..auto import AutoConfig
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
BARK_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"suno/bark-small": "https://huggingface.co/suno/bark-small/resolve/main/config.json",
|
||||
"suno/bark": "https://huggingface.co/suno/bark/resolve/main/config.json",
|
||||
}
|
||||
|
||||
BARK_SUBMODELCONFIG_START_DOCSTRING = """
|
||||
This is the configuration class to store the configuration of a [`{model}`]. It is used to instantiate the model
|
||||
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||
defaults will yield a similar configuration to that of the Bark [suno/bark](https://huggingface.co/suno/bark)
|
||||
architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
block_size (`int`, *optional*, defaults to 1024):
|
||||
The maximum sequence length that this model might ever be used with. Typically set this to something large
|
||||
just in case (e.g., 512 or 1024 or 2048).
|
||||
input_vocab_size (`int`, *optional*, defaults to 10_048):
|
||||
Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the
|
||||
`inputs_ids` passed when calling [`{model}`]. Defaults to 10_048 but should be carefully thought with
|
||||
regards to the chosen sub-model.
|
||||
output_vocab_size (`int`, *optional*, defaults to 10_048):
|
||||
Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented
|
||||
by the: `output_ids` when passing forward a [`{model}`]. Defaults to 10_048 but should be carefully thought
|
||||
with regards to the chosen sub-model.
|
||||
num_layers (`int`, *optional*, defaults to 12):
|
||||
Number of hidden layers in the given sub-model.
|
||||
num_heads (`int`, *optional*, defaults to 12):
|
||||
Number of attention heads for each attention layer in the Transformer architecture.
|
||||
hidden_size (`int`, *optional*, defaults to 768):
|
||||
Dimensionality of the "intermediate" (often named feed-forward) layer in the architecture.
|
||||
dropout (`float`, *optional*, defaults to 0.0):
|
||||
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
|
||||
bias (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to use bias in the linear layers and layer norm layers.
|
||||
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
"""
|
||||
|
||||
|
||||
class BarkSubModelConfig(PretrainedConfig):
|
||||
model_type = "bark_module"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
||||
attribute_map = {
|
||||
"num_attention_heads": "num_heads",
|
||||
"num_hidden_layers": "num_layers",
|
||||
"vocab_size": "input_vocab_size",
|
||||
"window_size": "block_size",
|
||||
}
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
block_size=1024,
|
||||
input_vocab_size=10_048,
|
||||
output_vocab_size=10_048,
|
||||
num_layers=12,
|
||||
num_heads=12,
|
||||
hidden_size=768,
|
||||
dropout=0.0,
|
||||
bias=True, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
|
||||
initializer_range=0.02,
|
||||
use_cache=True,
|
||||
**kwargs,
|
||||
):
|
||||
self.block_size = block_size
|
||||
self.input_vocab_size = input_vocab_size
|
||||
self.output_vocab_size = output_vocab_size
|
||||
self.num_layers = num_layers
|
||||
self.num_heads = num_heads
|
||||
self.hidden_size = hidden_size
|
||||
self.dropout = dropout
|
||||
self.bias = bias
|
||||
self.use_cache = use_cache
|
||||
self.initializer_range = initializer_range
|
||||
|
||||
super().__init__(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(
|
||||
cls,
|
||||
pretrained_model_name_or_path: Union[str, os.PathLike],
|
||||
cache_dir: Optional[Union[str, os.PathLike]] = None,
|
||||
force_download: bool = False,
|
||||
local_files_only: bool = False,
|
||||
token: Optional[Union[str, bool]] = None,
|
||||
revision: str = "main",
|
||||
**kwargs,
|
||||
) -> "PretrainedConfig":
|
||||
kwargs["cache_dir"] = cache_dir
|
||||
kwargs["force_download"] = force_download
|
||||
kwargs["local_files_only"] = local_files_only
|
||||
kwargs["revision"] = revision
|
||||
|
||||
cls._set_token_in_kwargs(kwargs, token)
|
||||
|
||||
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
|
||||
|
||||
# get the config dict if we are loading from Bark
|
||||
if config_dict.get("model_type") == "bark":
|
||||
config_dict = config_dict[f"{cls.model_type}_config"]
|
||||
|
||||
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
|
||||
logger.warning(
|
||||
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
|
||||
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
|
||||
)
|
||||
|
||||
return cls.from_dict(config_dict, **kwargs)
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkSemanticConfig", model="BarkSemanticModel"),
|
||||
"""
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import BarkSemanticConfig, BarkSemanticModel
|
||||
|
||||
>>> # Initializing a Bark sub-module style configuration
|
||||
>>> configuration = BarkSemanticConfig()
|
||||
|
||||
>>> # Initializing a model (with random weights) from the suno/bark style configuration
|
||||
>>> model = BarkSemanticModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```""",
|
||||
)
|
||||
class BarkSemanticConfig(BarkSubModelConfig):
|
||||
model_type = "semantic"
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkCoarseConfig", model="BarkCoarseModel"),
|
||||
"""
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import BarkCoarseConfig, BarkCoarseModel
|
||||
|
||||
>>> # Initializing a Bark sub-module style configuration
|
||||
>>> configuration = BarkCoarseConfig()
|
||||
|
||||
>>> # Initializing a model (with random weights) from the suno/bark style configuration
|
||||
>>> model = BarkCoarseModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```""",
|
||||
)
|
||||
class BarkCoarseConfig(BarkSubModelConfig):
|
||||
model_type = "coarse_acoustics"
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkFineConfig", model="BarkFineModel"),
|
||||
"""
|
||||
n_codes_total (`int`, *optional*, defaults to 8):
|
||||
The total number of audio codebooks predicted. Used in the fine acoustics sub-model.
|
||||
n_codes_given (`int`, *optional*, defaults to 1):
|
||||
The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics
|
||||
sub-models.
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import BarkFineConfig, BarkFineModel
|
||||
|
||||
>>> # Initializing a Bark sub-module style configuration
|
||||
>>> configuration = BarkFineConfig()
|
||||
|
||||
>>> # Initializing a model (with random weights) from the suno/bark style configuration
|
||||
>>> model = BarkFineModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```""",
|
||||
)
|
||||
class BarkFineConfig(BarkSubModelConfig):
|
||||
model_type = "fine_acoustics"
|
||||
|
||||
def __init__(self, tie_word_embeddings=True, n_codes_total=8, n_codes_given=1, **kwargs):
|
||||
self.n_codes_total = n_codes_total
|
||||
self.n_codes_given = n_codes_given
|
||||
|
||||
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
|
||||
|
||||
|
||||
class BarkConfig(PretrainedConfig):
|
||||
"""
|
||||
This is the configuration class to store the configuration of a [`BarkModel`]. It is used to instantiate a Bark
|
||||
model according to the specified sub-models configurations, defining the model architecture.
|
||||
|
||||
Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark
|
||||
[suno/bark](https://huggingface.co/suno/bark) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Args:
|
||||
semantic_config ([`BarkSemanticConfig`], *optional*):
|
||||
Configuration of the underlying semantic sub-model.
|
||||
coarse_acoustics_config ([`BarkCoarseConfig`], *optional*):
|
||||
Configuration of the underlying coarse acoustics sub-model.
|
||||
fine_acoustics_config ([`BarkFineConfig`], *optional*):
|
||||
Configuration of the underlying fine acoustics sub-model.
|
||||
codec_config ([`AutoConfig`], *optional*):
|
||||
Configuration of the underlying codec sub-model.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
>>> from transformers import (
|
||||
... BarkSemanticConfig,
|
||||
... BarkCoarseConfig,
|
||||
... BarkFineConfig,
|
||||
... BarkModel,
|
||||
... BarkConfig,
|
||||
... AutoConfig,
|
||||
... )
|
||||
|
||||
>>> # Initializing Bark sub-modules configurations.
|
||||
>>> semantic_config = BarkSemanticConfig()
|
||||
>>> coarse_acoustics_config = BarkCoarseConfig()
|
||||
>>> fine_acoustics_config = BarkFineConfig()
|
||||
>>> codec_config = AutoConfig.from_pretrained("facebook/encodec_24khz")
|
||||
|
||||
|
||||
>>> # Initializing a Bark module style configuration
|
||||
>>> configuration = BarkConfig.from_sub_model_configs(
|
||||
... semantic_config, coarse_acoustics_config, fine_acoustics_config, codec_config
|
||||
... )
|
||||
|
||||
>>> # Initializing a model (with random weights)
|
||||
>>> model = BarkModel(configuration)
|
||||
|
||||
>>> # Accessing the model configuration
|
||||
>>> configuration = model.config
|
||||
```
|
||||
"""
|
||||
|
||||
model_type = "bark"
|
||||
is_composition = True
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
semantic_config: Dict = None,
|
||||
coarse_acoustics_config: Dict = None,
|
||||
fine_acoustics_config: Dict = None,
|
||||
codec_config: Dict = None,
|
||||
initializer_range=0.02,
|
||||
**kwargs,
|
||||
):
|
||||
if semantic_config is None:
|
||||
semantic_config = {}
|
||||
logger.info("semantic_config is None. initializing the semantic model with default values.")
|
||||
|
||||
if coarse_acoustics_config is None:
|
||||
coarse_acoustics_config = {}
|
||||
logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
|
||||
|
||||
if fine_acoustics_config is None:
|
||||
fine_acoustics_config = {}
|
||||
logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
|
||||
|
||||
if codec_config is None:
|
||||
codec_config = {}
|
||||
logger.info("codec_config is None. initializing the codec model with default values.")
|
||||
|
||||
self.semantic_config = BarkSemanticConfig(**semantic_config)
|
||||
self.coarse_acoustics_config = BarkCoarseConfig(**coarse_acoustics_config)
|
||||
self.fine_acoustics_config = BarkFineConfig(**fine_acoustics_config)
|
||||
self.codec_config = AutoConfig.for_model(**codec_config)
|
||||
|
||||
self.initializer_range = initializer_range
|
||||
|
||||
super().__init__(**kwargs)
|
||||
|
||||
@classmethod
|
||||
def from_sub_model_configs(
|
||||
cls,
|
||||
semantic_config: BarkSemanticConfig,
|
||||
coarse_acoustics_config: BarkCoarseConfig,
|
||||
fine_acoustics_config: BarkFineConfig,
|
||||
codec_config: AutoConfig,
|
||||
**kwargs,
|
||||
):
|
||||
r"""
|
||||
Instantiate a [`BarkConfig`] (or a derived class) from bark sub-models configuration.
|
||||
|
||||
Returns:
|
||||
[`BarkConfig`]: An instance of a configuration object
|
||||
"""
|
||||
return cls(
|
||||
semantic_config=semantic_config.to_dict(),
|
||||
coarse_acoustics_config=coarse_acoustics_config.to_dict(),
|
||||
fine_acoustics_config=fine_acoustics_config.to_dict(),
|
||||
codec_config=codec_config.to_dict(),
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def to_dict(self):
|
||||
"""
|
||||
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
|
||||
|
||||
Returns:
|
||||
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
|
||||
output["semantic_config"] = self.semantic_config.to_dict()
|
||||
output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
|
||||
output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
|
||||
output["codec_config"] = self.codec_config.to_dict()
|
||||
|
||||
output["model_type"] = self.__class__.model_type
|
||||
return output
|
||||
262
src/transformers/models/bark/convert_suno_to_hf.py
Normal file
262
src/transformers/models/bark/convert_suno_to_hf.py
Normal file
@@ -0,0 +1,262 @@
|
||||
"""Convert Bark checkpoint."""
|
||||
import argparse
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
from bark.generation import _load_model as _bark_load_model
|
||||
from huggingface_hub import hf_hub_download
|
||||
|
||||
from transformers import EncodecConfig, EncodecModel, set_seed
|
||||
from transformers.models.bark.configuration_bark import (
|
||||
BarkCoarseConfig,
|
||||
BarkConfig,
|
||||
BarkFineConfig,
|
||||
BarkSemanticConfig,
|
||||
)
|
||||
from transformers.models.bark.generation_configuration_bark import (
|
||||
BarkCoarseGenerationConfig,
|
||||
BarkFineGenerationConfig,
|
||||
BarkGenerationConfig,
|
||||
BarkSemanticGenerationConfig,
|
||||
)
|
||||
from transformers.models.bark.modeling_bark import BarkCoarseModel, BarkFineModel, BarkModel, BarkSemanticModel
|
||||
from transformers.utils import logging
|
||||
|
||||
|
||||
logging.set_verbosity_info()
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
set_seed(770)
|
||||
|
||||
|
||||
new_layer_name_dict = {
|
||||
"c_attn": "att_proj",
|
||||
"c_proj": "out_proj",
|
||||
"c_fc": "in_proj",
|
||||
"transformer.": "",
|
||||
"h.": "layers.",
|
||||
"ln_1": "layernorm_1",
|
||||
"ln_2": "layernorm_2",
|
||||
"ln_f": "layernorm_final",
|
||||
"wpe": "position_embeds_layer",
|
||||
"wte": "input_embeds_layer",
|
||||
}
|
||||
|
||||
|
||||
REMOTE_MODEL_PATHS = {
|
||||
"text_small": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "text.pt",
|
||||
},
|
||||
"coarse_small": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "coarse.pt",
|
||||
},
|
||||
"fine_small": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "fine.pt",
|
||||
},
|
||||
"text": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "text_2.pt",
|
||||
},
|
||||
"coarse": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "coarse_2.pt",
|
||||
},
|
||||
"fine": {
|
||||
"repo_id": "suno/bark",
|
||||
"file_name": "fine_2.pt",
|
||||
},
|
||||
}
|
||||
|
||||
CUR_PATH = os.path.dirname(os.path.abspath(__file__))
|
||||
default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache")
|
||||
CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "suno", "bark_v0")
|
||||
|
||||
|
||||
def _get_ckpt_path(model_type, use_small=False):
|
||||
key = model_type
|
||||
if use_small:
|
||||
key += "_small"
|
||||
return os.path.join(CACHE_DIR, REMOTE_MODEL_PATHS[key]["file_name"])
|
||||
|
||||
|
||||
def _download(from_hf_path, file_name):
|
||||
os.makedirs(CACHE_DIR, exist_ok=True)
|
||||
hf_hub_download(repo_id=from_hf_path, filename=file_name, local_dir=CACHE_DIR)
|
||||
|
||||
|
||||
def _load_model(ckpt_path, device, use_small=False, model_type="text"):
|
||||
if model_type == "text":
|
||||
ModelClass = BarkSemanticModel
|
||||
ConfigClass = BarkSemanticConfig
|
||||
GenerationConfigClass = BarkSemanticGenerationConfig
|
||||
elif model_type == "coarse":
|
||||
ModelClass = BarkCoarseModel
|
||||
ConfigClass = BarkCoarseConfig
|
||||
GenerationConfigClass = BarkCoarseGenerationConfig
|
||||
elif model_type == "fine":
|
||||
ModelClass = BarkFineModel
|
||||
ConfigClass = BarkFineConfig
|
||||
GenerationConfigClass = BarkFineGenerationConfig
|
||||
else:
|
||||
raise NotImplementedError()
|
||||
model_key = f"{model_type}_small" if use_small else model_type
|
||||
model_info = REMOTE_MODEL_PATHS[model_key]
|
||||
if not os.path.exists(ckpt_path):
|
||||
logger.info(f"{model_type} model not found, downloading into `{CACHE_DIR}`.")
|
||||
_download(model_info["repo_id"], model_info["file_name"])
|
||||
checkpoint = torch.load(ckpt_path, map_location=device)
|
||||
# this is a hack
|
||||
model_args = checkpoint["model_args"]
|
||||
if "input_vocab_size" not in model_args:
|
||||
model_args["input_vocab_size"] = model_args["vocab_size"]
|
||||
model_args["output_vocab_size"] = model_args["vocab_size"]
|
||||
del model_args["vocab_size"]
|
||||
|
||||
# convert Bark model arguments to HF Bark model arguments
|
||||
model_args["num_heads"] = model_args.pop("n_head")
|
||||
model_args["hidden_size"] = model_args.pop("n_embd")
|
||||
model_args["num_layers"] = model_args.pop("n_layer")
|
||||
|
||||
model_config = ConfigClass(**checkpoint["model_args"])
|
||||
model = ModelClass(config=model_config)
|
||||
model_generation_config = GenerationConfigClass()
|
||||
|
||||
model.generation_config = model_generation_config
|
||||
state_dict = checkpoint["model"]
|
||||
# fixup checkpoint
|
||||
unwanted_prefix = "_orig_mod."
|
||||
for k, v in list(state_dict.items()):
|
||||
if k.startswith(unwanted_prefix):
|
||||
# replace part of the key with corresponding layer name in HF implementation
|
||||
new_k = k[len(unwanted_prefix) :]
|
||||
for old_layer_name in new_layer_name_dict:
|
||||
new_k = new_k.replace(old_layer_name, new_layer_name_dict[old_layer_name])
|
||||
|
||||
state_dict[new_k] = state_dict.pop(k)
|
||||
|
||||
extra_keys = set(state_dict.keys()) - set(model.state_dict().keys())
|
||||
extra_keys = {k for k in extra_keys if not k.endswith(".attn.bias")}
|
||||
missing_keys = set(model.state_dict().keys()) - set(state_dict.keys())
|
||||
missing_keys = {k for k in missing_keys if not k.endswith(".attn.bias")}
|
||||
if len(extra_keys) != 0:
|
||||
raise ValueError(f"extra keys found: {extra_keys}")
|
||||
if len(missing_keys) != 0:
|
||||
raise ValueError(f"missing keys: {missing_keys}")
|
||||
model.load_state_dict(state_dict, strict=False)
|
||||
n_params = model.num_parameters(exclude_embeddings=True)
|
||||
val_loss = checkpoint["best_val_loss"].item()
|
||||
logger.info(f"model loaded: {round(n_params/1e6,1)}M params, {round(val_loss,3)} loss")
|
||||
model.eval()
|
||||
model.to(device)
|
||||
del checkpoint, state_dict
|
||||
|
||||
return model
|
||||
|
||||
|
||||
def load_model(pytorch_dump_folder_path, use_small=False, model_type="text"):
|
||||
if model_type not in ("text", "coarse", "fine"):
|
||||
raise NotImplementedError()
|
||||
|
||||
device = "cpu" # do conversion on cpu
|
||||
|
||||
ckpt_path = _get_ckpt_path(model_type, use_small=use_small)
|
||||
model = _load_model(ckpt_path, device, model_type=model_type, use_small=use_small)
|
||||
|
||||
# load bark initial model
|
||||
bark_model = _bark_load_model(ckpt_path, "cpu", model_type=model_type, use_small=use_small)
|
||||
|
||||
if model_type == "text":
|
||||
bark_model = bark_model["model"]
|
||||
|
||||
if model.num_parameters(exclude_embeddings=True) != bark_model.get_num_params():
|
||||
raise ValueError("initial and new models don't have the same number of parameters")
|
||||
|
||||
# check if same output as the bark model
|
||||
batch_size = 5
|
||||
sequence_length = 10
|
||||
|
||||
if model_type in ["text", "coarse"]:
|
||||
vec = torch.randint(256, (batch_size, sequence_length), dtype=torch.int)
|
||||
output_old_model = bark_model(vec)[0]
|
||||
|
||||
output_new_model_total = model(vec)
|
||||
|
||||
# take last logits
|
||||
output_new_model = output_new_model_total.logits[:, [-1], :]
|
||||
|
||||
else:
|
||||
prediction_codeboook_channel = 3
|
||||
n_codes_total = 8
|
||||
vec = torch.randint(256, (batch_size, sequence_length, n_codes_total), dtype=torch.int)
|
||||
|
||||
output_new_model_total = model(prediction_codeboook_channel, vec)
|
||||
output_old_model = bark_model(prediction_codeboook_channel, vec)
|
||||
|
||||
output_new_model = output_new_model_total.logits
|
||||
|
||||
# output difference should come from the difference of self-attention implementation design
|
||||
if output_new_model.shape != output_old_model.shape:
|
||||
raise ValueError("initial and new outputs don't have the same shape")
|
||||
if (output_new_model - output_old_model).abs().max().item() > 1e-3:
|
||||
raise ValueError("initial and new outputs are not equal")
|
||||
|
||||
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
|
||||
model.save_pretrained(pytorch_dump_folder_path)
|
||||
|
||||
|
||||
def load_whole_bark_model(
|
||||
semantic_path,
|
||||
coarse_path,
|
||||
fine_path,
|
||||
append_text,
|
||||
hub_path,
|
||||
folder_path,
|
||||
):
|
||||
pytorch_dump_folder_path = os.path.join(folder_path, append_text)
|
||||
|
||||
semanticConfig = BarkSemanticConfig.from_pretrained(os.path.join(semantic_path, "config.json"))
|
||||
coarseAcousticConfig = BarkCoarseConfig.from_pretrained(os.path.join(coarse_path, "config.json"))
|
||||
fineAcousticConfig = BarkFineConfig.from_pretrained(os.path.join(fine_path, "config.json"))
|
||||
codecConfig = EncodecConfig.from_pretrained("facebook/encodec_24khz")
|
||||
|
||||
semantic = BarkSemanticModel.from_pretrained(semantic_path)
|
||||
coarseAcoustic = BarkCoarseModel.from_pretrained(coarse_path)
|
||||
fineAcoustic = BarkFineModel.from_pretrained(fine_path)
|
||||
codec = EncodecModel.from_pretrained("facebook/encodec_24khz")
|
||||
|
||||
bark_config = BarkConfig.from_sub_model_configs(
|
||||
semanticConfig, coarseAcousticConfig, fineAcousticConfig, codecConfig
|
||||
)
|
||||
|
||||
bark_generation_config = BarkGenerationConfig.from_sub_model_configs(
|
||||
semantic.generation_config, coarseAcoustic.generation_config, fineAcoustic.generation_config
|
||||
)
|
||||
|
||||
bark = BarkModel(bark_config)
|
||||
|
||||
bark.semantic = semantic
|
||||
bark.coarse_acoustics = coarseAcoustic
|
||||
bark.fine_acoustics = fineAcoustic
|
||||
bark.codec_model = codec
|
||||
|
||||
bark.generation_config = bark_generation_config
|
||||
|
||||
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
|
||||
bark.save_pretrained(pytorch_dump_folder_path, repo_id=hub_path, push_to_hub=True)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser()
|
||||
# Required parameters
|
||||
|
||||
parser.add_argument("model_type", type=str, help="text, coarse or fine.")
|
||||
parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
|
||||
parser.add_argument("--is_small", action="store_true", help="convert the small version instead of the large.")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
load_model(args.pytorch_dump_folder_path, model_type=args.model_type, use_small=args.is_small)
|
||||
318
src/transformers/models/bark/generation_configuration_bark.py
Normal file
318
src/transformers/models/bark/generation_configuration_bark.py
Normal file
@@ -0,0 +1,318 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" BARK model generation configuration"""
|
||||
|
||||
import copy
|
||||
from typing import Dict
|
||||
|
||||
from ...generation.configuration_utils import GenerationConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class BarkSemanticGenerationConfig(GenerationConfig):
|
||||
model_type = "semantic"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
eos_token_id=10_000,
|
||||
renormalize_logits=True,
|
||||
max_new_tokens=768,
|
||||
output_scores=False,
|
||||
return_dict_in_generate=False,
|
||||
output_hidden_states=False,
|
||||
output_attentions=False,
|
||||
temperature=0.7,
|
||||
do_sample=True,
|
||||
text_encoding_offset=10_048,
|
||||
text_pad_token=129_595,
|
||||
semantic_infer_token=129_599,
|
||||
semantic_vocab_size=10_000,
|
||||
max_input_semantic_length=256,
|
||||
semantic_rate_hz=49.9,
|
||||
**kwargs,
|
||||
):
|
||||
"""Class that holds a generation configuration for [`BarkSemanticModel`].
|
||||
|
||||
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
|
||||
documentation from [`GenerationConfig`] for more information.
|
||||
|
||||
Args:
|
||||
eos_token_id (`int`, *optional*, defaults to 10_000):
|
||||
The id of the *end-of-sequence* token.
|
||||
renormalize_logits (`bool`, *optional*, defaults to `True`):
|
||||
Whether to renormalize the logits after applying all the logits processors or warpers (including the
|
||||
custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
|
||||
score logits are normalized but some logit processors or warpers break the normalization.
|
||||
max_new_tokens (`int`, *optional*, defaults to 768):
|
||||
The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
|
||||
output_scores (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
|
||||
return_dict_in_generate (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
|
||||
output_hidden_states (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
|
||||
for more details.
|
||||
output_attentions (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
|
||||
returned tensors for more details.
|
||||
temperature (`float`, *optional*, defaults to 0.7):
|
||||
The value used to modulate the next token probabilities.
|
||||
do_sample (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to use sampling ; use greedy decoding otherwise.
|
||||
text_encoding_offset (`int`, *optional*, defaults to 10_048):
|
||||
Text encoding offset.
|
||||
text_pad_token (`int`, *optional*, defaults to 129_595):
|
||||
Text pad token.
|
||||
semantic_infer_token (`int`, *optional*, defaults to 129_599):
|
||||
Semantic infer token.
|
||||
semantic_vocab_size (`int`, *optional*, defaults to 10_000):
|
||||
Semantic vocab size.
|
||||
max_input_semantic_length (`int`, *optional*, defaults to 256):
|
||||
Max length of semantic input vector.
|
||||
semantic_rate_hz (`float`, *optional*, defaults to 49.9):
|
||||
Semantic rate in Hertz.
|
||||
"""
|
||||
super().__init__(
|
||||
temperature=temperature,
|
||||
do_sample=do_sample,
|
||||
eos_token_id=eos_token_id,
|
||||
renormalize_logits=renormalize_logits,
|
||||
max_new_tokens=max_new_tokens,
|
||||
output_scores=output_scores,
|
||||
return_dict_in_generate=return_dict_in_generate,
|
||||
output_hidden_states=output_hidden_states,
|
||||
output_attentions=output_attentions,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self.text_encoding_offset = text_encoding_offset
|
||||
self.text_pad_token = text_pad_token
|
||||
self.semantic_pad_token = eos_token_id
|
||||
self.semantic_infer_token = semantic_infer_token
|
||||
self.semantic_vocab_size = semantic_vocab_size
|
||||
self.max_input_semantic_length = max_input_semantic_length
|
||||
self.semantic_rate_hz = semantic_rate_hz
|
||||
|
||||
|
||||
class BarkCoarseGenerationConfig(GenerationConfig):
|
||||
model_type = "coarse_acoustics"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
renormalize_logits=True,
|
||||
output_scores=False,
|
||||
return_dict_in_generate=False,
|
||||
output_hidden_states=False,
|
||||
output_attentions=False,
|
||||
temperature=0.7,
|
||||
do_sample=True,
|
||||
coarse_semantic_pad_token=12_048,
|
||||
coarse_rate_hz=75,
|
||||
n_coarse_codebooks=2,
|
||||
coarse_infer_token=12_050,
|
||||
max_coarse_input_length=256,
|
||||
max_coarse_history: int = 630,
|
||||
sliding_window_len: int = 60,
|
||||
**kwargs,
|
||||
):
|
||||
"""Class that holds a generation configuration for [`BarkCoarseModel`].
|
||||
|
||||
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
|
||||
documentation from [`GenerationConfig`] for more information.
|
||||
|
||||
Args:
|
||||
renormalize_logits (`bool`, *optional*, defaults to `True`):
|
||||
Whether to renormalize the logits after applying all the logits processors or warpers (including the
|
||||
custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
|
||||
score logits are normalized but some logit processors or warpers break the normalization.
|
||||
output_scores (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
|
||||
return_dict_in_generate (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
|
||||
output_hidden_states (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
|
||||
for more details.
|
||||
output_attentions (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
|
||||
returned tensors for more details.
|
||||
temperature (`float`, *optional*, defaults to 0.7):
|
||||
The value used to modulate the next token probabilities.
|
||||
do_sample (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to use sampling ; use greedy decoding otherwise.
|
||||
coarse_semantic_pad_token (`int`, *optional*, defaults to 12_048):
|
||||
Coarse semantic pad token.
|
||||
coarse_rate_hz (`int`, *optional*, defaults to 75):
|
||||
Coarse rate in Hertz.
|
||||
n_coarse_codebooks (`int`, *optional*, defaults to 2):
|
||||
Number of coarse codebooks.
|
||||
coarse_infer_token (`int`, *optional*, defaults to 12_050):
|
||||
Coarse infer token.
|
||||
max_coarse_input_length (`int`, *optional*, defaults to 256):
|
||||
Max length of input coarse vector.
|
||||
max_coarse_history (`int`, *optional*, defaults to 630):
|
||||
Max length of the output of the coarse acoustics model used in the fine generation step.
|
||||
sliding_window_len (`int`, *optional*, defaults to 60):
|
||||
The coarse generation step uses a sliding window to generate raw audio.
|
||||
"""
|
||||
super().__init__(
|
||||
temperature=temperature,
|
||||
do_sample=do_sample,
|
||||
renormalize_logits=renormalize_logits,
|
||||
output_scores=output_scores,
|
||||
return_dict_in_generate=return_dict_in_generate,
|
||||
output_hidden_states=output_hidden_states,
|
||||
output_attentions=output_attentions,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self.coarse_semantic_pad_token = coarse_semantic_pad_token
|
||||
self.coarse_rate_hz = coarse_rate_hz
|
||||
self.n_coarse_codebooks = n_coarse_codebooks
|
||||
self.coarse_infer_token = coarse_infer_token
|
||||
self.max_coarse_input_length = max_coarse_input_length
|
||||
self.max_coarse_history = max_coarse_history
|
||||
self.sliding_window_len = sliding_window_len
|
||||
|
||||
|
||||
class BarkFineGenerationConfig(GenerationConfig):
|
||||
model_type = "fine_acoustics"
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
temperature=0.5,
|
||||
max_fine_history_length=512,
|
||||
max_fine_input_length=1024,
|
||||
n_fine_codebooks=8,
|
||||
**kwargs,
|
||||
):
|
||||
"""Class that holds a generation configuration for [`BarkFineModel`].
|
||||
|
||||
[`BarkFineModel`] is an autoencoder model, so should not usually be used for generation. However, under the
|
||||
hood, it uses `temperature` when used by [`BarkModel`]
|
||||
|
||||
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
|
||||
documentation from [`GenerationConfig`] for more information.
|
||||
|
||||
Args:
|
||||
temperature (`float`, *optional*, defaults to 0.5):
|
||||
The value used to modulate the next token probabilities.
|
||||
max_fine_history_length (`int`, *optional*, defaults to 512):
|
||||
Max length of the fine history vector.
|
||||
max_fine_input_length (`int`, *optional*, defaults to 1024):
|
||||
Max length of fine input vector.
|
||||
n_fine_codebooks (`int`, *optional*, defaults to 8):
|
||||
Number of codebooks used.
|
||||
"""
|
||||
super().__init__(temperature=temperature)
|
||||
|
||||
self.max_fine_history_length = max_fine_history_length
|
||||
self.max_fine_input_length = max_fine_input_length
|
||||
self.n_fine_codebooks = n_fine_codebooks
|
||||
|
||||
|
||||
class BarkGenerationConfig(GenerationConfig):
|
||||
model_type = "bark"
|
||||
is_composition = True
|
||||
|
||||
# TODO (joao): nested from_dict
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
semantic_config: Dict = None,
|
||||
coarse_acoustics_config: Dict = None,
|
||||
fine_acoustics_config: Dict = None,
|
||||
sample_rate=24_000,
|
||||
codebook_size=1024,
|
||||
**kwargs,
|
||||
):
|
||||
"""Class that holds a generation configuration for [`BarkModel`].
|
||||
|
||||
The [`BarkModel`] does not have a `generate` method, but uses this class to generate speeches with a nested
|
||||
[`BarkGenerationConfig`] which uses [`BarkSemanticGenerationConfig`], [`BarkCoarseGenerationConfig`],
|
||||
[`BarkFineGenerationConfig`].
|
||||
|
||||
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
|
||||
documentation from [`GenerationConfig`] for more information.
|
||||
|
||||
Args:
|
||||
semantic_config (`Dict`, *optional*):
|
||||
Semantic generation configuration.
|
||||
coarse_acoustics_config (`Dict`, *optional*):
|
||||
Coarse generation configuration.
|
||||
fine_acoustics_config (`Dict`, *optional*):
|
||||
Fine generation configuration.
|
||||
sample_rate (`int`, *optional*, defaults to 24_000):
|
||||
Sample rate.
|
||||
codebook_size (`int`, *optional*, defaults to 1024):
|
||||
Vector length for each codebook.
|
||||
"""
|
||||
if semantic_config is None:
|
||||
semantic_config = {}
|
||||
logger.info("semantic_config is None. initializing the semantic model with default values.")
|
||||
|
||||
if coarse_acoustics_config is None:
|
||||
coarse_acoustics_config = {}
|
||||
logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
|
||||
|
||||
if fine_acoustics_config is None:
|
||||
fine_acoustics_config = {}
|
||||
logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
|
||||
|
||||
self.semantic_config = BarkSemanticGenerationConfig(**semantic_config)
|
||||
self.coarse_acoustics_config = BarkCoarseGenerationConfig(**coarse_acoustics_config)
|
||||
self.fine_acoustics_config = BarkFineGenerationConfig(**fine_acoustics_config)
|
||||
|
||||
self.sample_rate = sample_rate
|
||||
self.codebook_size = codebook_size
|
||||
|
||||
@classmethod
|
||||
def from_sub_model_configs(
|
||||
cls,
|
||||
semantic_config: BarkSemanticGenerationConfig,
|
||||
coarse_acoustics_config: BarkCoarseGenerationConfig,
|
||||
fine_acoustics_config: BarkFineGenerationConfig,
|
||||
**kwargs,
|
||||
):
|
||||
r"""
|
||||
Instantiate a [`BarkGenerationConfig`] (or a derived class) from bark sub-models generation configuration.
|
||||
|
||||
Returns:
|
||||
[`BarkGenerationConfig`]: An instance of a configuration object
|
||||
"""
|
||||
return cls(
|
||||
semantic_config=semantic_config.to_dict(),
|
||||
coarse_acoustics_config=coarse_acoustics_config.to_dict(),
|
||||
fine_acoustics_config=fine_acoustics_config.to_dict(),
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
def to_dict(self):
|
||||
"""
|
||||
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
|
||||
|
||||
Returns:
|
||||
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
|
||||
output["semantic_config"] = self.semantic_config.to_dict()
|
||||
output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
|
||||
output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
|
||||
|
||||
output["model_type"] = self.__class__.model_type
|
||||
return output
|
||||
1503
src/transformers/models/bark/modeling_bark.py
Normal file
1503
src/transformers/models/bark/modeling_bark.py
Normal file
File diff suppressed because it is too large
Load Diff
286
src/transformers/models/bark/processing_bark.py
Normal file
286
src/transformers/models/bark/processing_bark.py
Normal file
@@ -0,0 +1,286 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Processor class for Bark
|
||||
"""
|
||||
import json
|
||||
import os
|
||||
from typing import Optional
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...feature_extraction_utils import BatchFeature
|
||||
from ...processing_utils import ProcessorMixin
|
||||
from ...utils import logging
|
||||
from ...utils.hub import get_file_from_repo
|
||||
from ..auto import AutoTokenizer
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class BarkProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor.
|
||||
|
||||
Args:
|
||||
tokenizer ([`PreTrainedTokenizer`]):
|
||||
An instance of [`PreTrainedTokenizer`].
|
||||
speaker_embeddings (`Dict[Dict[str]]`, *optional*, defaults to `None`):
|
||||
Optional nested speaker embeddings dictionary. The first level contains voice preset names (e.g
|
||||
`"en_speaker_4"`). The second level contains `"semantic_prompt"`, `"coarse_prompt"` and `"fine_prompt"`
|
||||
embeddings. The values correspond to the path of the corresponding `np.ndarray`. See
|
||||
[here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) for
|
||||
a list of `voice_preset_names`.
|
||||
|
||||
"""
|
||||
tokenizer_class = "AutoTokenizer"
|
||||
attributes = ["tokenizer"]
|
||||
|
||||
preset_shape = {
|
||||
"semantic_prompt": 1,
|
||||
"coarse_prompt": 2,
|
||||
"fine_prompt": 2,
|
||||
}
|
||||
|
||||
def __init__(self, tokenizer, speaker_embeddings=None):
|
||||
super().__init__(tokenizer)
|
||||
|
||||
self.speaker_embeddings = speaker_embeddings
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(
|
||||
cls, pretrained_processor_name_or_path, speaker_embeddings_dict_path="speaker_embeddings_path.json", **kwargs
|
||||
):
|
||||
r"""
|
||||
Instantiate a Bark processor associated with a pretrained model.
|
||||
|
||||
Args:
|
||||
pretrained_model_name_or_path (`str` or `os.PathLike`):
|
||||
This can be either:
|
||||
|
||||
- a string, the *model id* of a pretrained [`BarkProcessor`] hosted inside a model repo on
|
||||
huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
|
||||
namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
|
||||
- a path to a *directory* containing a processor saved using the [`~BarkProcessor.save_pretrained`]
|
||||
method, e.g., `./my_model_directory/`.
|
||||
speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`):
|
||||
The name of the `.json` file containing the speaker_embeddings dictionnary located in
|
||||
`pretrained_model_name_or_path`. If `None`, no speaker_embeddings is loaded.
|
||||
**kwargs
|
||||
Additional keyword arguments passed along to both
|
||||
[`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`].
|
||||
"""
|
||||
|
||||
if speaker_embeddings_dict_path is not None:
|
||||
speaker_embeddings_path = get_file_from_repo(
|
||||
pretrained_processor_name_or_path,
|
||||
speaker_embeddings_dict_path,
|
||||
subfolder=kwargs.pop("subfolder", None),
|
||||
cache_dir=kwargs.pop("cache_dir", None),
|
||||
force_download=kwargs.pop("force_download", False),
|
||||
proxies=kwargs.pop("proxies", None),
|
||||
resume_download=kwargs.pop("resume_download", False),
|
||||
local_files_only=kwargs.pop("local_files_only", False),
|
||||
use_auth_token=kwargs.pop("use_auth_token", None),
|
||||
revision=kwargs.pop("revision", None),
|
||||
)
|
||||
if speaker_embeddings_path is None:
|
||||
logger.warning(
|
||||
f"""`{os.path.join(pretrained_processor_name_or_path,speaker_embeddings_dict_path)}` does not exists
|
||||
, no preloaded speaker embeddings will be used - Make sure to provide a correct path to the json
|
||||
dictionnary if wanted, otherwise set `speaker_embeddings_dict_path=None`."""
|
||||
)
|
||||
speaker_embeddings = None
|
||||
else:
|
||||
with open(speaker_embeddings_path) as speaker_embeddings_json:
|
||||
speaker_embeddings = json.load(speaker_embeddings_json)
|
||||
else:
|
||||
speaker_embeddings = None
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(pretrained_processor_name_or_path, **kwargs)
|
||||
|
||||
return cls(tokenizer=tokenizer, speaker_embeddings=speaker_embeddings)
|
||||
|
||||
def save_pretrained(
|
||||
self,
|
||||
save_directory,
|
||||
speaker_embeddings_dict_path="speaker_embeddings_path.json",
|
||||
speaker_embeddings_directory="speaker_embeddings",
|
||||
push_to_hub: bool = False,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
Saves the attributes of this processor (tokenizer...) in the specified directory so that it can be reloaded
|
||||
using the [`~BarkProcessor.from_pretrained`] method.
|
||||
|
||||
Args:
|
||||
save_directory (`str` or `os.PathLike`):
|
||||
Directory where the tokenizer files and the speaker embeddings will be saved (directory will be created
|
||||
if it does not exist).
|
||||
speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`):
|
||||
The name of the `.json` file that will contains the speaker_embeddings nested path dictionnary, if it
|
||||
exists, and that will be located in `pretrained_model_name_or_path/speaker_embeddings_directory`.
|
||||
speaker_embeddings_directory (`str`, *optional*, defaults to `"speaker_embeddings/"`):
|
||||
The name of the folder in which the speaker_embeddings arrays will be saved.
|
||||
push_to_hub (`bool`, *optional*, defaults to `False`):
|
||||
Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
|
||||
repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
|
||||
namespace).
|
||||
kwargs:
|
||||
Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
|
||||
"""
|
||||
if self.speaker_embeddings is not None:
|
||||
os.makedirs(os.path.join(save_directory, speaker_embeddings_directory, "v2"), exist_ok=True)
|
||||
|
||||
embeddings_dict = {}
|
||||
|
||||
embeddings_dict["repo_or_path"] = save_directory
|
||||
|
||||
for prompt_key in self.speaker_embeddings:
|
||||
if prompt_key != "repo_or_path":
|
||||
voice_preset = self._load_voice_preset(prompt_key)
|
||||
|
||||
tmp_dict = {}
|
||||
for key in self.speaker_embeddings[prompt_key]:
|
||||
np.save(
|
||||
os.path.join(
|
||||
embeddings_dict["repo_or_path"], speaker_embeddings_directory, f"{prompt_key}_{key}"
|
||||
),
|
||||
voice_preset[key],
|
||||
allow_pickle=False,
|
||||
)
|
||||
tmp_dict[key] = os.path.join(speaker_embeddings_directory, f"{prompt_key}_{key}.npy")
|
||||
|
||||
embeddings_dict[prompt_key] = tmp_dict
|
||||
|
||||
with open(os.path.join(save_directory, speaker_embeddings_dict_path), "w") as fp:
|
||||
json.dump(embeddings_dict, fp)
|
||||
|
||||
super().save_pretrained(save_directory, push_to_hub, **kwargs)
|
||||
|
||||
def _load_voice_preset(self, voice_preset: str = None, **kwargs):
|
||||
voice_preset_paths = self.speaker_embeddings[voice_preset]
|
||||
|
||||
voice_preset_dict = {}
|
||||
for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]:
|
||||
if key not in voice_preset_paths:
|
||||
raise ValueError(
|
||||
f"Voice preset unrecognized, missing {key} as a key in self.speaker_embeddings[{voice_preset}]."
|
||||
)
|
||||
|
||||
path = get_file_from_repo(
|
||||
self.speaker_embeddings.get("repo_or_path", "/"),
|
||||
voice_preset_paths[key],
|
||||
subfolder=kwargs.pop("subfolder", None),
|
||||
cache_dir=kwargs.pop("cache_dir", None),
|
||||
force_download=kwargs.pop("force_download", False),
|
||||
proxies=kwargs.pop("proxies", None),
|
||||
resume_download=kwargs.pop("resume_download", False),
|
||||
local_files_only=kwargs.pop("local_files_only", False),
|
||||
use_auth_token=kwargs.pop("use_auth_token", None),
|
||||
revision=kwargs.pop("revision", None),
|
||||
)
|
||||
if path is None:
|
||||
raise ValueError(
|
||||
f"""`{os.path.join(self.speaker_embeddings.get("repo_or_path", "/"),voice_preset_paths[key])}` does not exists
|
||||
, no preloaded voice preset will be used - Make sure to provide correct paths to the {voice_preset}
|
||||
embeddings."""
|
||||
)
|
||||
|
||||
voice_preset_dict[key] = np.load(path)
|
||||
|
||||
return voice_preset_dict
|
||||
|
||||
def _validate_voice_preset_dict(self, voice_preset: Optional[dict] = None):
|
||||
for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]:
|
||||
if key not in voice_preset:
|
||||
raise ValueError(f"Voice preset unrecognized, missing {key} as a key.")
|
||||
|
||||
if not isinstance(voice_preset[key], np.ndarray):
|
||||
raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.")
|
||||
|
||||
if len(voice_preset[key].shape) != self.preset_shape[key]:
|
||||
raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.")
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
text=None,
|
||||
voice_preset=None,
|
||||
return_tensors="pt",
|
||||
max_length=256,
|
||||
add_special_tokens=False,
|
||||
return_attention_mask=True,
|
||||
return_token_type_ids=False,
|
||||
**kwargs,
|
||||
):
|
||||
"""
|
||||
Main method to prepare for the model one or several sequences(s). This method forwards the `text` and `kwargs`
|
||||
arguments to the AutoTokenizer's [`~AutoTokenizer.__call__`] to encode the text. The method also proposes a
|
||||
voice preset which is a dictionary of arrays that conditions `Bark`'s output. `kwargs` arguments are forwarded
|
||||
to the tokenizer and to `cached_file` method if `voice_preset` is a valid filename.
|
||||
|
||||
Args:
|
||||
text (`str`, `List[str]`, `List[List[str]]`):
|
||||
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
|
||||
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
|
||||
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
|
||||
voice_preset (`str`, `Dict[np.ndarray]`):
|
||||
The voice preset, i.e the speaker embeddings. It can either be a valid voice_preset name, e.g
|
||||
`"en_speaker_1"`, or directly a dictionnary of `np.ndarray` embeddings for each submodel of `Bark`. Or
|
||||
it can be a valid file name of a local `.npz` single voice preset.
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors of a particular framework. Acceptable values are:
|
||||
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return NumPy `np.ndarray` objects.
|
||||
|
||||
Returns:
|
||||
Tuple([`BatchEncoding`], [`BatchFeature`]): A tuple composed of a [`BatchEncoding`], i.e the output of the
|
||||
`tokenizer` and a [`BatchFeature`], i.e the voice preset with the right tensors type.
|
||||
"""
|
||||
if voice_preset is not None and not isinstance(voice_preset, dict):
|
||||
if (
|
||||
isinstance(voice_preset, str)
|
||||
and self.speaker_embeddings is not None
|
||||
and voice_preset in self.speaker_embeddings
|
||||
):
|
||||
voice_preset = self._load_voice_preset(voice_preset)
|
||||
|
||||
else:
|
||||
if isinstance(voice_preset, str) and not voice_preset.endswith(".npz"):
|
||||
voice_preset = voice_preset + ".npz"
|
||||
|
||||
voice_preset = np.load(voice_preset)
|
||||
|
||||
if voice_preset is not None:
|
||||
self._validate_voice_preset_dict(voice_preset, **kwargs)
|
||||
voice_preset = BatchFeature(data=voice_preset, tensor_type=return_tensors)
|
||||
|
||||
encoded_text = self.tokenizer(
|
||||
text,
|
||||
return_tensors=return_tensors,
|
||||
padding="max_length",
|
||||
max_length=max_length,
|
||||
return_attention_mask=return_attention_mask,
|
||||
return_token_type_ids=return_token_type_ids,
|
||||
add_special_tokens=add_special_tokens,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
if voice_preset is not None:
|
||||
encoded_text["history_prompt"] = voice_preset
|
||||
|
||||
return encoded_text
|
||||
@@ -816,6 +816,51 @@ class AutoformerPreTrainedModel(metaclass=DummyObject):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
BARK_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class BarkCausalModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class BarkCoarseModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class BarkFineModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class BarkModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class BarkPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class BarkSemanticModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
BART_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
||||
0
tests/models/bark/__init__.py
Normal file
0
tests/models/bark/__init__.py
Normal file
991
tests/models/bark/test_modeling_bark.py
Normal file
991
tests/models/bark/test_modeling_bark.py
Normal file
@@ -0,0 +1,991 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch Bark model. """
|
||||
|
||||
|
||||
import copy
|
||||
import inspect
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from transformers import (
|
||||
BarkCoarseConfig,
|
||||
BarkFineConfig,
|
||||
BarkSemanticConfig,
|
||||
is_torch_available,
|
||||
)
|
||||
from transformers.models.bark.generation_configuration_bark import (
|
||||
BarkCoarseGenerationConfig,
|
||||
BarkFineGenerationConfig,
|
||||
BarkSemanticGenerationConfig,
|
||||
)
|
||||
from transformers.testing_utils import require_torch, slow, torch_device
|
||||
from transformers.utils import cached_property
|
||||
|
||||
from ...generation.test_utils import GenerationTesterMixin
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import (
|
||||
BarkCausalModel,
|
||||
BarkCoarseModel,
|
||||
BarkFineModel,
|
||||
BarkModel,
|
||||
BarkProcessor,
|
||||
BarkSemanticModel,
|
||||
)
|
||||
|
||||
|
||||
class BarkSemanticModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=2,
|
||||
seq_length=4,
|
||||
is_training=False, # for now training is not supported
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=33,
|
||||
output_vocab_size=33,
|
||||
hidden_size=16,
|
||||
num_hidden_layers=2,
|
||||
num_attention_heads=2,
|
||||
intermediate_size=15,
|
||||
dropout=0.1,
|
||||
window_size=256,
|
||||
initializer_range=0.02,
|
||||
n_codes_total=8, # for BarkFineModel
|
||||
n_codes_given=1, # for BarkFineModel
|
||||
config_class=None,
|
||||
model_class=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.output_vocab_size = output_vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.window_size = window_size
|
||||
self.initializer_range = initializer_range
|
||||
self.bos_token_id = output_vocab_size - 1
|
||||
self.eos_token_id = output_vocab_size - 1
|
||||
self.pad_token_id = output_vocab_size - 1
|
||||
|
||||
self.n_codes_total = n_codes_total
|
||||
self.n_codes_given = n_codes_given
|
||||
|
||||
self.is_encoder_decoder = False
|
||||
self.config_class = config_class
|
||||
self.model_class = model_class
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
|
||||
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"head_mask": head_mask,
|
||||
"attention_mask": input_mask,
|
||||
}
|
||||
|
||||
return config, inputs_dict
|
||||
|
||||
def get_config(self):
|
||||
return self.config_class(
|
||||
vocab_size=self.vocab_size,
|
||||
output_vocab_size=self.output_vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_layers=self.num_hidden_layers,
|
||||
num_heads=self.num_attention_heads,
|
||||
use_cache=True,
|
||||
bos_token_id=self.bos_token_id,
|
||||
eos_token_id=self.eos_token_id,
|
||||
pad_token_id=self.pad_token_id,
|
||||
window_size=self.window_size,
|
||||
)
|
||||
|
||||
def get_pipeline_config(self):
|
||||
config = self.get_config()
|
||||
config.vocab_size = 300
|
||||
return config
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config, inputs_dict = self.prepare_config_and_inputs()
|
||||
return config, inputs_dict
|
||||
|
||||
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
|
||||
model = self.model_class(config=config).to(torch_device).eval()
|
||||
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
attention_mask = inputs_dict["attention_mask"]
|
||||
|
||||
# first forward pass
|
||||
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
|
||||
|
||||
output, past_key_values = outputs.to_tuple()
|
||||
|
||||
# create hypothetical multiple next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
|
||||
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
|
||||
|
||||
# append to next input_ids and
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
|
||||
|
||||
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
|
||||
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
|
||||
"logits"
|
||||
]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
|
||||
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
# test no attention_mask works
|
||||
outputs = model(input_ids, use_cache=True)
|
||||
_, past_key_values = outputs.to_tuple()
|
||||
output_from_no_past = model(next_input_ids)["logits"]
|
||||
|
||||
output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
|
||||
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
|
||||
class BarkCoarseModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=2,
|
||||
seq_length=4,
|
||||
is_training=False, # for now training is not supported
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=33,
|
||||
output_vocab_size=33,
|
||||
hidden_size=16,
|
||||
num_hidden_layers=2,
|
||||
num_attention_heads=2,
|
||||
intermediate_size=15,
|
||||
dropout=0.1,
|
||||
window_size=256,
|
||||
initializer_range=0.02,
|
||||
n_codes_total=8, # for BarkFineModel
|
||||
n_codes_given=1, # for BarkFineModel
|
||||
config_class=None,
|
||||
model_class=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.output_vocab_size = output_vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.window_size = window_size
|
||||
self.initializer_range = initializer_range
|
||||
self.bos_token_id = output_vocab_size - 1
|
||||
self.eos_token_id = output_vocab_size - 1
|
||||
self.pad_token_id = output_vocab_size - 1
|
||||
|
||||
self.n_codes_total = n_codes_total
|
||||
self.n_codes_given = n_codes_given
|
||||
|
||||
self.is_encoder_decoder = False
|
||||
self.config_class = config_class
|
||||
self.model_class = model_class
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
|
||||
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"head_mask": head_mask,
|
||||
"attention_mask": input_mask,
|
||||
}
|
||||
|
||||
return config, inputs_dict
|
||||
|
||||
def get_config(self):
|
||||
return self.config_class(
|
||||
vocab_size=self.vocab_size,
|
||||
output_vocab_size=self.output_vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_layers=self.num_hidden_layers,
|
||||
num_heads=self.num_attention_heads,
|
||||
use_cache=True,
|
||||
bos_token_id=self.bos_token_id,
|
||||
eos_token_id=self.eos_token_id,
|
||||
pad_token_id=self.pad_token_id,
|
||||
window_size=self.window_size,
|
||||
)
|
||||
|
||||
def get_pipeline_config(self):
|
||||
config = self.get_config()
|
||||
config.vocab_size = 300
|
||||
return config
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config, inputs_dict = self.prepare_config_and_inputs()
|
||||
return config, inputs_dict
|
||||
|
||||
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
|
||||
model = self.model_class(config=config).to(torch_device).eval()
|
||||
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
attention_mask = inputs_dict["attention_mask"]
|
||||
|
||||
# first forward pass
|
||||
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
|
||||
|
||||
output, past_key_values = outputs.to_tuple()
|
||||
|
||||
# create hypothetical multiple next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
|
||||
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
|
||||
|
||||
# append to next input_ids and
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
|
||||
|
||||
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
|
||||
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
|
||||
"logits"
|
||||
]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
|
||||
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
# test no attention_mask works
|
||||
outputs = model(input_ids, use_cache=True)
|
||||
_, past_key_values = outputs.to_tuple()
|
||||
output_from_no_past = model(next_input_ids)["logits"]
|
||||
|
||||
output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
|
||||
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
|
||||
class BarkFineModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
batch_size=2,
|
||||
seq_length=4,
|
||||
is_training=False, # for now training is not supported
|
||||
use_input_mask=True,
|
||||
use_labels=True,
|
||||
vocab_size=33,
|
||||
output_vocab_size=33,
|
||||
hidden_size=16,
|
||||
num_hidden_layers=2,
|
||||
num_attention_heads=2,
|
||||
intermediate_size=15,
|
||||
dropout=0.1,
|
||||
window_size=256,
|
||||
initializer_range=0.02,
|
||||
n_codes_total=8, # for BarkFineModel
|
||||
n_codes_given=1, # for BarkFineModel
|
||||
config_class=None,
|
||||
model_class=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.seq_length = seq_length
|
||||
self.is_training = is_training
|
||||
self.use_input_mask = use_input_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.output_vocab_size = output_vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.intermediate_size = intermediate_size
|
||||
self.dropout = dropout
|
||||
self.window_size = window_size
|
||||
self.initializer_range = initializer_range
|
||||
self.bos_token_id = output_vocab_size - 1
|
||||
self.eos_token_id = output_vocab_size - 1
|
||||
self.pad_token_id = output_vocab_size - 1
|
||||
|
||||
self.n_codes_total = n_codes_total
|
||||
self.n_codes_given = n_codes_given
|
||||
|
||||
self.is_encoder_decoder = False
|
||||
self.config_class = config_class
|
||||
self.model_class = model_class
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.seq_length, self.n_codes_total], self.vocab_size)
|
||||
|
||||
input_mask = None
|
||||
if self.use_input_mask:
|
||||
input_mask = random_attention_mask([self.batch_size, self.seq_length])
|
||||
|
||||
config = self.get_config()
|
||||
|
||||
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
|
||||
|
||||
# randint between self.n_codes_given - 1 and self.n_codes_total - 1
|
||||
codebook_idx = ids_tensor((1,), self.n_codes_total - self.n_codes_given).item() + self.n_codes_given
|
||||
|
||||
inputs_dict = {
|
||||
"codebook_idx": codebook_idx,
|
||||
"input_ids": input_ids,
|
||||
"head_mask": head_mask,
|
||||
"attention_mask": input_mask,
|
||||
}
|
||||
|
||||
return config, inputs_dict
|
||||
|
||||
def get_config(self):
|
||||
return self.config_class(
|
||||
vocab_size=self.vocab_size,
|
||||
output_vocab_size=self.output_vocab_size,
|
||||
hidden_size=self.hidden_size,
|
||||
num_layers=self.num_hidden_layers,
|
||||
num_heads=self.num_attention_heads,
|
||||
use_cache=True,
|
||||
bos_token_id=self.bos_token_id,
|
||||
eos_token_id=self.eos_token_id,
|
||||
pad_token_id=self.pad_token_id,
|
||||
window_size=self.window_size,
|
||||
)
|
||||
|
||||
def get_pipeline_config(self):
|
||||
config = self.get_config()
|
||||
config.vocab_size = 300
|
||||
return config
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config, inputs_dict = self.prepare_config_and_inputs()
|
||||
return config, inputs_dict
|
||||
|
||||
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
|
||||
model = self.model_class(config=config).to(torch_device).eval()
|
||||
|
||||
input_ids = inputs_dict["input_ids"]
|
||||
attention_mask = inputs_dict["attention_mask"]
|
||||
|
||||
# first forward pass
|
||||
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
|
||||
|
||||
output, past_key_values = outputs.to_tuple()
|
||||
|
||||
# create hypothetical multiple next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
|
||||
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
|
||||
|
||||
# append to next input_ids and
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
|
||||
|
||||
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
|
||||
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
|
||||
"logits"
|
||||
]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
|
||||
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
# test no attention_mask works
|
||||
outputs = model(input_ids, use_cache=True)
|
||||
_, past_key_values = outputs.to_tuple()
|
||||
output_from_no_past = model(next_input_ids)["logits"]
|
||||
|
||||
output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
|
||||
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
|
||||
@require_torch
|
||||
class BarkSemanticModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (BarkSemanticModel,) if is_torch_available() else ()
|
||||
all_generative_model_classes = (BarkCausalModel,) if is_torch_available() else ()
|
||||
|
||||
is_encoder_decoder = False
|
||||
fx_compatible = False
|
||||
test_missing_keys = False
|
||||
test_pruning = False
|
||||
test_model_parallel = False
|
||||
# no model_parallel for now
|
||||
|
||||
test_resize_embeddings = True
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = BarkSemanticModelTester(
|
||||
self, config_class=BarkSemanticConfig, model_class=BarkSemanticModel
|
||||
)
|
||||
self.config_tester = ConfigTester(self, config_class=BarkSemanticConfig, n_embd=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_save_load_strict(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
model.save_pretrained(tmpdirname)
|
||||
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
|
||||
self.assertEqual(info["missing_keys"], [])
|
||||
|
||||
def test_decoder_model_past_with_large_inputs(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
|
||||
|
||||
def test_inputs_embeds(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
input_ids = inputs["input_ids"]
|
||||
del inputs["input_ids"]
|
||||
|
||||
wte = model.get_input_embeddings()
|
||||
inputs["input_embeds"] = wte(input_ids)
|
||||
|
||||
with torch.no_grad():
|
||||
model(**inputs)[0]
|
||||
|
||||
def test_generate_fp16(self):
|
||||
config, input_dict = self.model_tester.prepare_config_and_inputs()
|
||||
input_ids = input_dict["input_ids"]
|
||||
attention_mask = input_ids.ne(1).to(torch_device)
|
||||
model = self.all_generative_model_classes[0](config).eval().to(torch_device)
|
||||
if torch_device == "cuda":
|
||||
model.half()
|
||||
model.generate(input_ids, attention_mask=attention_mask)
|
||||
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
|
||||
|
||||
|
||||
@require_torch
|
||||
class BarkCoarseModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||
# Same tester as BarkSemanticModelTest, except for model_class and config_class
|
||||
all_model_classes = (BarkCoarseModel,) if is_torch_available() else ()
|
||||
all_generative_model_classes = (BarkCausalModel,) if is_torch_available() else ()
|
||||
|
||||
is_encoder_decoder = False
|
||||
fx_compatible = False
|
||||
test_missing_keys = False
|
||||
test_pruning = False
|
||||
test_model_parallel = False
|
||||
# no model_parallel for now
|
||||
|
||||
test_resize_embeddings = True
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = BarkCoarseModelTester(self, config_class=BarkCoarseConfig, model_class=BarkCoarseModel)
|
||||
self.config_tester = ConfigTester(self, config_class=BarkCoarseConfig, n_embd=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_save_load_strict(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
model.save_pretrained(tmpdirname)
|
||||
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
|
||||
self.assertEqual(info["missing_keys"], [])
|
||||
|
||||
def test_decoder_model_past_with_large_inputs(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
|
||||
|
||||
def test_inputs_embeds(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
input_ids = inputs["input_ids"]
|
||||
del inputs["input_ids"]
|
||||
|
||||
wte = model.get_input_embeddings()
|
||||
inputs["input_embeds"] = wte(input_ids)
|
||||
|
||||
with torch.no_grad():
|
||||
model(**inputs)[0]
|
||||
|
||||
def test_generate_fp16(self):
|
||||
config, input_dict = self.model_tester.prepare_config_and_inputs()
|
||||
input_ids = input_dict["input_ids"]
|
||||
attention_mask = input_ids.ne(1).to(torch_device)
|
||||
model = self.all_generative_model_classes[0](config).eval().to(torch_device)
|
||||
if torch_device == "cuda":
|
||||
model.half()
|
||||
model.generate(input_ids, attention_mask=attention_mask)
|
||||
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
|
||||
|
||||
|
||||
@require_torch
|
||||
class BarkFineModelTest(ModelTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (BarkFineModel,) if is_torch_available() else ()
|
||||
|
||||
is_encoder_decoder = False
|
||||
fx_compatible = False
|
||||
test_missing_keys = False
|
||||
test_pruning = False
|
||||
# no model_parallel for now
|
||||
test_model_parallel = False
|
||||
|
||||
# torchscript disabled for now because forward with an int
|
||||
test_torchscript = False
|
||||
|
||||
test_resize_embeddings = True
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = BarkFineModelTester(self, config_class=BarkFineConfig, model_class=BarkFineModel)
|
||||
self.config_tester = ConfigTester(self, config_class=BarkFineConfig, n_embd=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_save_load_strict(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
model.save_pretrained(tmpdirname)
|
||||
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
|
||||
self.assertEqual(info["missing_keys"], [])
|
||||
|
||||
def test_inputs_embeds(self):
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
input_ids = inputs["input_ids"]
|
||||
del inputs["input_ids"]
|
||||
|
||||
wte = model.get_input_embeddings()[inputs_dict["codebook_idx"]]
|
||||
|
||||
inputs["input_embeds"] = wte(input_ids[:, :, inputs_dict["codebook_idx"]])
|
||||
|
||||
with torch.no_grad():
|
||||
model(**inputs)[0]
|
||||
|
||||
def test_generate_fp16(self):
|
||||
config, input_dict = self.model_tester.prepare_config_and_inputs()
|
||||
input_ids = input_dict["input_ids"]
|
||||
# take first codebook channel
|
||||
|
||||
model = self.all_model_classes[0](config).eval().to(torch_device)
|
||||
if torch_device == "cuda":
|
||||
model.half()
|
||||
|
||||
# toy generation_configs
|
||||
semantic_generation_config = BarkSemanticGenerationConfig(semantic_vocab_size=0)
|
||||
coarse_generation_config = BarkCoarseGenerationConfig(n_coarse_codebooks=config.n_codes_given)
|
||||
fine_generation_config = BarkFineGenerationConfig(
|
||||
max_fine_history_length=config.block_size // 2,
|
||||
max_fine_input_length=config.block_size,
|
||||
n_fine_codebooks=config.n_codes_total,
|
||||
)
|
||||
codebook_size = config.vocab_size - 1
|
||||
|
||||
model.generate(
|
||||
input_ids,
|
||||
history_prompt=None,
|
||||
temperature=None,
|
||||
semantic_generation_config=semantic_generation_config,
|
||||
coarse_generation_config=coarse_generation_config,
|
||||
fine_generation_config=fine_generation_config,
|
||||
codebook_size=codebook_size,
|
||||
)
|
||||
|
||||
model.generate(
|
||||
input_ids,
|
||||
history_prompt=None,
|
||||
temperature=0.7,
|
||||
semantic_generation_config=semantic_generation_config,
|
||||
coarse_generation_config=coarse_generation_config,
|
||||
fine_generation_config=fine_generation_config,
|
||||
codebook_size=codebook_size,
|
||||
)
|
||||
|
||||
def test_forward_signature(self):
|
||||
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
signature = inspect.signature(model.forward)
|
||||
# signature.parameters is an OrderedDict => so arg_names order is deterministic
|
||||
arg_names = [*signature.parameters.keys()]
|
||||
|
||||
expected_arg_names = ["codebook_idx", "input_ids"]
|
||||
self.assertListEqual(arg_names[:2], expected_arg_names)
|
||||
|
||||
def test_model_common_attributes(self):
|
||||
# one embedding layer per codebook
|
||||
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
model = model_class(config)
|
||||
self.assertIsInstance(model.get_input_embeddings()[0], (torch.nn.Embedding))
|
||||
model.set_input_embeddings(
|
||||
torch.nn.ModuleList([torch.nn.Embedding(10, 10) for _ in range(config.n_codes_total)])
|
||||
)
|
||||
x = model.get_output_embeddings()
|
||||
self.assertTrue(x is None or isinstance(x[0], torch.nn.Linear))
|
||||
|
||||
def test_resize_tokens_embeddings(self):
|
||||
# resizing tokens_embeddings of a ModuleList
|
||||
original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
if not self.test_resize_embeddings:
|
||||
return
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config = copy.deepcopy(original_config)
|
||||
model = model_class(config)
|
||||
model.to(torch_device)
|
||||
|
||||
if self.model_tester.is_training is False:
|
||||
model.eval()
|
||||
|
||||
model_vocab_size = config.vocab_size
|
||||
# Retrieve the embeddings and clone theme
|
||||
model_embed_list = model.resize_token_embeddings(model_vocab_size)
|
||||
cloned_embeddings_list = [model_embed.weight.clone() for model_embed in model_embed_list]
|
||||
|
||||
# Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
|
||||
model_embed_list = model.resize_token_embeddings(model_vocab_size + 10)
|
||||
self.assertEqual(model.config.vocab_size, model_vocab_size + 10)
|
||||
|
||||
# Check that it actually resizes the embeddings matrix for each codebook
|
||||
for model_embed, cloned_embeddings in zip(model_embed_list, cloned_embeddings_list):
|
||||
self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] + 10)
|
||||
|
||||
# Check that the model can still do a forward pass successfully (every parameter should be resized)
|
||||
model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
# Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
|
||||
model_embed_list = model.resize_token_embeddings(model_vocab_size - 15)
|
||||
self.assertEqual(model.config.vocab_size, model_vocab_size - 15)
|
||||
for model_embed, cloned_embeddings in zip(model_embed_list, cloned_embeddings_list):
|
||||
self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] - 15)
|
||||
|
||||
# Check that the model can still do a forward pass successfully (every parameter should be resized)
|
||||
# Input ids should be clamped to the maximum size of the vocabulary
|
||||
inputs_dict["input_ids"].clamp_(max=model_vocab_size - 15 - 1)
|
||||
|
||||
model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
# Check that adding and removing tokens has not modified the first part of the embedding matrix.
|
||||
# only check for the first embedding matrix
|
||||
models_equal = True
|
||||
for p1, p2 in zip(cloned_embeddings_list[0], model_embed_list[0].weight):
|
||||
if p1.data.ne(p2.data).sum() > 0:
|
||||
models_equal = False
|
||||
|
||||
self.assertTrue(models_equal)
|
||||
|
||||
def test_resize_embeddings_untied(self):
|
||||
# resizing tokens_embeddings of a ModuleList
|
||||
original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
|
||||
if not self.test_resize_embeddings:
|
||||
return
|
||||
|
||||
original_config.tie_word_embeddings = False
|
||||
|
||||
for model_class in self.all_model_classes:
|
||||
config = copy.deepcopy(original_config)
|
||||
model = model_class(config).to(torch_device)
|
||||
|
||||
# if no output embeddings -> leave test
|
||||
if model.get_output_embeddings() is None:
|
||||
continue
|
||||
|
||||
# Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
|
||||
model_vocab_size = config.vocab_size
|
||||
model.resize_token_embeddings(model_vocab_size + 10)
|
||||
self.assertEqual(model.config.vocab_size, model_vocab_size + 10)
|
||||
output_embeds_list = model.get_output_embeddings()
|
||||
|
||||
for output_embeds in output_embeds_list:
|
||||
self.assertEqual(output_embeds.weight.shape[0], model_vocab_size + 10)
|
||||
|
||||
# Check bias if present
|
||||
if output_embeds.bias is not None:
|
||||
self.assertEqual(output_embeds.bias.shape[0], model_vocab_size + 10)
|
||||
|
||||
# Check that the model can still do a forward pass successfully (every parameter should be resized)
|
||||
model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
# Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
|
||||
model.resize_token_embeddings(model_vocab_size - 15)
|
||||
self.assertEqual(model.config.vocab_size, model_vocab_size - 15)
|
||||
# Check that it actually resizes the embeddings matrix
|
||||
output_embeds_list = model.get_output_embeddings()
|
||||
|
||||
for output_embeds in output_embeds_list:
|
||||
self.assertEqual(output_embeds.weight.shape[0], model_vocab_size - 15)
|
||||
# Check bias if present
|
||||
if output_embeds.bias is not None:
|
||||
self.assertEqual(output_embeds.bias.shape[0], model_vocab_size - 15)
|
||||
|
||||
# Check that the model can still do a forward pass successfully (every parameter should be resized)
|
||||
# Input ids should be clamped to the maximum size of the vocabulary
|
||||
inputs_dict["input_ids"].clamp_(max=model_vocab_size - 15 - 1)
|
||||
|
||||
# Check that the model can still do a forward pass successfully (every parameter should be resized)
|
||||
model(**self._prepare_for_class(inputs_dict, model_class))
|
||||
|
||||
|
||||
@require_torch
|
||||
class BarkModelIntegrationTests(unittest.TestCase):
|
||||
@cached_property
|
||||
def model(self):
|
||||
return BarkModel.from_pretrained("ylacombe/bark-large").to(torch_device)
|
||||
|
||||
@cached_property
|
||||
def processor(self):
|
||||
return BarkProcessor.from_pretrained("ylacombe/bark-large")
|
||||
|
||||
@cached_property
|
||||
def inputs(self):
|
||||
input_ids = self.processor("In the light of the moon, a little egg lay on a leaf", voice_preset="en_speaker_6")
|
||||
|
||||
input_ids = input_ids.to(torch_device)
|
||||
|
||||
return input_ids
|
||||
|
||||
@cached_property
|
||||
def semantic_generation_config(self):
|
||||
semantic_generation_config = BarkSemanticGenerationConfig(**self.model.generation_config.semantic_config)
|
||||
return semantic_generation_config
|
||||
|
||||
@cached_property
|
||||
def coarse_generation_config(self):
|
||||
coarse_generation_config = BarkCoarseGenerationConfig(**self.model.generation_config.coarse_acoustics_config)
|
||||
return coarse_generation_config
|
||||
|
||||
@cached_property
|
||||
def fine_generation_config(self):
|
||||
fine_generation_config = BarkFineGenerationConfig(**self.model.generation_config.fine_acoustics_config)
|
||||
return fine_generation_config
|
||||
|
||||
@slow
|
||||
def test_generate_semantic(self):
|
||||
input_ids = self.inputs
|
||||
|
||||
# fmt: off
|
||||
# check first ids
|
||||
expected_output_ids = [7363, 321, 41, 1461, 6915, 952, 326, 41, 41, 927,]
|
||||
# fmt: on
|
||||
|
||||
# greedy decoding
|
||||
with torch.no_grad():
|
||||
output_ids = self.model.semantic.generate(
|
||||
**input_ids,
|
||||
do_sample=False,
|
||||
semantic_generation_config=self.semantic_generation_config,
|
||||
)
|
||||
|
||||
self.assertListEqual(output_ids[0, : len(expected_output_ids)].tolist(), expected_output_ids)
|
||||
|
||||
@slow
|
||||
def test_generate_coarse(self):
|
||||
input_ids = self.inputs
|
||||
|
||||
history_prompt = input_ids["history_prompt"]
|
||||
|
||||
# fmt: off
|
||||
# check first ids
|
||||
expected_output_ids = [11018, 11391, 10651, 11418, 10857, 11620, 10642, 11366, 10312, 11528, 10531, 11516, 10474, 11051, 10524, 11051, ]
|
||||
# fmt: on
|
||||
|
||||
with torch.no_grad():
|
||||
output_ids = self.model.semantic.generate(
|
||||
**input_ids,
|
||||
do_sample=False,
|
||||
semantic_generation_config=self.semantic_generation_config,
|
||||
)
|
||||
|
||||
output_ids = self.model.coarse_acoustics.generate(
|
||||
output_ids,
|
||||
history_prompt=history_prompt,
|
||||
do_sample=False,
|
||||
semantic_generation_config=self.semantic_generation_config,
|
||||
coarse_generation_config=self.coarse_generation_config,
|
||||
codebook_size=self.model.generation_config.codebook_size,
|
||||
)
|
||||
|
||||
self.assertListEqual(output_ids[0, : len(expected_output_ids)].tolist(), expected_output_ids)
|
||||
|
||||
@slow
|
||||
def test_generate_fine(self):
|
||||
input_ids = self.inputs
|
||||
|
||||
history_prompt = input_ids["history_prompt"]
|
||||
|
||||
# fmt: off
|
||||
expected_output_ids = [
|
||||
[1018, 651, 857, 642, 312, 531, 474, 524, 524, 776,],
|
||||
[367, 394, 596, 342, 504, 492, 27, 27, 822, 822,],
|
||||
[961, 955, 221, 955, 955, 686, 939, 939, 479, 176,],
|
||||
[638, 365, 218, 944, 853, 363, 639, 22, 884, 456,],
|
||||
[302, 912, 524, 38, 174, 209, 879, 23, 910, 227,],
|
||||
[440, 673, 861, 666, 372, 558, 49, 172, 232, 342,],
|
||||
[244, 358, 123, 356, 586, 520, 499, 877, 542, 637,],
|
||||
[806, 685, 905, 848, 803, 810, 921, 208, 625, 203,],
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
with torch.no_grad():
|
||||
output_ids = self.model.semantic.generate(
|
||||
**input_ids,
|
||||
do_sample=False,
|
||||
semantic_generation_config=self.semantic_generation_config,
|
||||
)
|
||||
|
||||
output_ids = self.model.coarse_acoustics.generate(
|
||||
output_ids,
|
||||
history_prompt=history_prompt,
|
||||
do_sample=False,
|
||||
semantic_generation_config=self.semantic_generation_config,
|
||||
coarse_generation_config=self.coarse_generation_config,
|
||||
codebook_size=self.model.generation_config.codebook_size,
|
||||
)
|
||||
|
||||
# greedy decoding
|
||||
output_ids = self.model.fine_acoustics.generate(
|
||||
output_ids,
|
||||
history_prompt=history_prompt,
|
||||
temperature=None,
|
||||
semantic_generation_config=self.semantic_generation_config,
|
||||
coarse_generation_config=self.coarse_generation_config,
|
||||
fine_generation_config=self.fine_generation_config,
|
||||
codebook_size=self.model.generation_config.codebook_size,
|
||||
)
|
||||
|
||||
self.assertListEqual(output_ids[0, :, : len(expected_output_ids[0])].tolist(), expected_output_ids)
|
||||
|
||||
@slow
|
||||
def test_generate_end_to_end(self):
|
||||
input_ids = self.inputs
|
||||
|
||||
with torch.no_grad():
|
||||
self.model.generate(**input_ids)
|
||||
self.model.generate(**{key: val for (key, val) in input_ids.items() if key != "history_prompt"})
|
||||
|
||||
@slow
|
||||
def test_generate_end_to_end_with_args(self):
|
||||
input_ids = self.inputs
|
||||
|
||||
with torch.no_grad():
|
||||
self.model.generate(**input_ids, do_sample=True, temperature=0.6, penalty_alpha=0.6)
|
||||
self.model.generate(**input_ids, do_sample=True, temperature=0.6, num_beams=4)
|
||||
|
||||
@slow
|
||||
def test_generate_end_to_end_with_sub_models_args(self):
|
||||
input_ids = self.inputs
|
||||
|
||||
with torch.no_grad():
|
||||
self.model.generate(**input_ids, do_sample=False, coarse_do_sample=True, coarse_temperature=0.7)
|
||||
self.model.generate(
|
||||
**input_ids, do_sample=False, coarse_do_sample=True, coarse_temperature=0.7, fine_temperature=0.3
|
||||
)
|
||||
self.model.generate(
|
||||
**input_ids,
|
||||
do_sample=True,
|
||||
temperature=0.6,
|
||||
penalty_alpha=0.6,
|
||||
semantic_temperature=0.9,
|
||||
coarse_temperature=0.2,
|
||||
fine_temperature=0.1,
|
||||
)
|
||||
127
tests/models/bark/test_processor_bark.py
Normal file
127
tests/models/bark/test_processor_bark.py
Normal file
@@ -0,0 +1,127 @@
|
||||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import shutil
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
|
||||
from transformers import AutoTokenizer, BarkProcessor
|
||||
from transformers.testing_utils import require_torch, slow
|
||||
|
||||
|
||||
@require_torch
|
||||
class BarkProcessorTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.checkpoint = "ylacombe/bark-small"
|
||||
self.tmpdirname = tempfile.mkdtemp()
|
||||
self.voice_preset = "en_speaker_1"
|
||||
self.input_string = "This is a test string"
|
||||
self.speaker_embeddings_dict_path = "speaker_embeddings_path.json"
|
||||
self.speaker_embeddings_directory = "speaker_embeddings"
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return AutoTokenizer.from_pretrained(self.checkpoint, **kwargs)
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmpdirname)
|
||||
|
||||
def test_save_load_pretrained_default(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = BarkProcessor(tokenizer=tokenizer)
|
||||
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
processor = BarkProcessor.from_pretrained(self.tmpdirname)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
|
||||
|
||||
@slow
|
||||
def test_save_load_pretrained_additional_features(self):
|
||||
processor = BarkProcessor.from_pretrained(
|
||||
pretrained_processor_name_or_path=self.checkpoint,
|
||||
speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
|
||||
)
|
||||
processor.save_pretrained(
|
||||
self.tmpdirname,
|
||||
speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
|
||||
speaker_embeddings_directory=self.speaker_embeddings_directory,
|
||||
)
|
||||
|
||||
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
|
||||
|
||||
processor = BarkProcessor.from_pretrained(
|
||||
self.tmpdirname,
|
||||
self.speaker_embeddings_dict_path,
|
||||
bos_token="(BOS)",
|
||||
eos_token="(EOS)",
|
||||
)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||
|
||||
def test_speaker_embeddings(self):
|
||||
processor = BarkProcessor.from_pretrained(
|
||||
pretrained_processor_name_or_path=self.checkpoint,
|
||||
speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
|
||||
)
|
||||
|
||||
seq_len = 35
|
||||
nb_codebooks_coarse = 2
|
||||
nb_codebooks_total = 8
|
||||
|
||||
voice_preset = {
|
||||
"semantic_prompt": np.ones(seq_len),
|
||||
"coarse_prompt": np.ones((nb_codebooks_coarse, seq_len)),
|
||||
"fine_prompt": np.ones((nb_codebooks_total, seq_len)),
|
||||
}
|
||||
|
||||
# test providing already loaded voice_preset
|
||||
inputs = processor(text=self.input_string, voice_preset=voice_preset)
|
||||
|
||||
processed_voice_preset = inputs["history_prompt"]
|
||||
for key in voice_preset:
|
||||
self.assertListEqual(voice_preset[key].tolist(), processed_voice_preset.get(key, np.array([])).tolist())
|
||||
|
||||
# test loading voice preset from npz file
|
||||
tmpfilename = os.path.join(self.tmpdirname, "file.npz")
|
||||
np.savez(tmpfilename, **voice_preset)
|
||||
inputs = processor(text=self.input_string, voice_preset=tmpfilename)
|
||||
processed_voice_preset = inputs["history_prompt"]
|
||||
|
||||
for key in voice_preset:
|
||||
self.assertListEqual(voice_preset[key].tolist(), processed_voice_preset.get(key, np.array([])).tolist())
|
||||
|
||||
# test loading voice preset from the hub
|
||||
inputs = processor(text=self.input_string, voice_preset=self.voice_preset)
|
||||
|
||||
def test_tokenizer(self):
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = BarkProcessor(tokenizer=tokenizer)
|
||||
|
||||
encoded_processor = processor(text=self.input_string)
|
||||
|
||||
encoded_tok = tokenizer(
|
||||
self.input_string,
|
||||
padding="max_length",
|
||||
max_length=256,
|
||||
add_special_tokens=False,
|
||||
return_attention_mask=True,
|
||||
return_token_type_ids=False,
|
||||
)
|
||||
|
||||
for key in encoded_tok.keys():
|
||||
self.assertListEqual(encoded_tok[key], encoded_processor[key].squeeze().tolist())
|
||||
@@ -167,6 +167,8 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
|
||||
"SpeechT5SpeechEncoder", # Building part of bigger (tested) model.
|
||||
"SpeechT5TextDecoder", # Building part of bigger (tested) model.
|
||||
"SpeechT5TextEncoder", # Building part of bigger (tested) model.
|
||||
"BarkCausalModel", # Building part of bigger (tested) model.
|
||||
"BarkModel", # Does not have a forward signature - generation tested with integration tests
|
||||
]
|
||||
|
||||
# Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't
|
||||
@@ -188,6 +190,7 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
|
||||
"models/vision_text_dual_encoder/test_modeling_tf_vision_text_dual_encoder.py",
|
||||
"models/vision_text_dual_encoder/test_modeling_flax_vision_text_dual_encoder.py",
|
||||
"models/decision_transformer/test_modeling_decision_transformer.py",
|
||||
"models/bark/test_modeling_bark.py",
|
||||
]
|
||||
|
||||
# Update this list for models that are not in any of the auto MODEL_XXX_MAPPING. Being in this list is an exception and
|
||||
@@ -332,11 +335,15 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
|
||||
"AltCLIPVisionModel",
|
||||
"AltRobertaModel",
|
||||
"TvltForAudioVisualClassification",
|
||||
"BarkCausalModel",
|
||||
"BarkCoarseModel",
|
||||
"BarkFineModel",
|
||||
"BarkSemanticModel",
|
||||
"MusicgenModel",
|
||||
"MusicgenForConditionalGeneration",
|
||||
"SpeechT5ForSpeechToSpeech",
|
||||
"SpeechT5ForTextToSpeech",
|
||||
"SpeechT5HifiGan",
|
||||
"MusicgenModel",
|
||||
"MusicgenForConditionalGeneration",
|
||||
]
|
||||
|
||||
# DO NOT edit this list!
|
||||
|
||||
@@ -28,6 +28,9 @@ src/transformers/models/auto/feature_extraction_auto.py
|
||||
src/transformers/models/auto/image_processing_auto.py
|
||||
src/transformers/models/auto/processing_auto.py
|
||||
src/transformers/models/auto/tokenization_auto.py
|
||||
src/transformers/models/bark/configuration_bark.py
|
||||
src/transformers/models/bark/modeling_bark.py
|
||||
src/transformers/models/bark/processing_bark.py
|
||||
src/transformers/models/bart/configuration_bart.py
|
||||
src/transformers/models/bart/modeling_bart.py
|
||||
src/transformers/models/bart/tokenization_bart.py
|
||||
|
||||
Reference in New Issue
Block a user