* first raw version of the bark integration

* working code on small models with single run

* add converting script from suno weights 2 hf

* many changes

* correct past_kv output

* working implementation for inference

* update the converting script according to the architecture changes

* add a working end-to-end inference code

* remove some comments and make small changes

* remove unecessary comment

* add docstrings and ensure no unecessary intermediary output during audio generation

* remove done TODOs

* make style + add config docstrings

* modification for batch inference support on the whole model

* add details to .generation_audio method

* add copyright

* convert EncodecModel from original library to transformers implementation

* add two class in order to facilitate model and sub-models loading from the hub

* add support of loading the whole model

* add BarkProcessor

* correct modeling according to processor output

* Add proper __init__ and auto support

* Add up-to-date copyright/license message

* add relative import instead of absolute

* cleaner head_dim computation

* small comment removal or changes

* more verbose LayerNorm init method

* specify eps for clearer comprehension

* more verbose variable naming in the MLP module

* remove unecessary BarkBlock parameter

* clearer code in the forward pass of the BarkBlock

* remove _initialize_modules method for cleaner code

* Remove unnecessary methods from sub-models

* move code to remove unnecessary function

* rename a variable for clarity and change an assert

* move code and change variable name for clarity

* remove unnecessary asserts

* correct small bug

* correct a comment

* change variable names for clarity

* remove asserts

* change import from absolute to relative

* correct small error due to comma missing + correct import

* Add attribute Bark config

* add first version of tests

* update attention_map

* add tie_weights and resize_token_embeddings for fineModel

* correct getting attention_mask in generate_text_semantic

* remove Bark inference trick

* leave more choices in barkProcessor

* remove _no_split_modules

* fixe error in forward of block and introduce clearer notations

* correct converting script with last changes

* make style + add draft bark.mdx

* correct BarkModelTest::test_generate_text_semantic

* add Bark in main README

* add dummy_pt_objects for Bark

* add missing models in the main init

* correct test_decoder_model_past_with_large_inputs

* disable torchscript test

* change docstring of BarkProcessor

* Add test_processor_bark

* make style

* correct copyrights

* add bark.mdx + make style, quality and consistency

* Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

* Remove unnecessary test method

* simply logic of a test

* Only check first ids for slow audio generation

* split full end-to-end generation tests

* remove unneccessary comment

* change submodel names for clearer naming

* remove ModuleDict from modeling_bark

* combine two if statements

* ensure that an edge misued won't happen

* modify variable name

* move code snippet to the right place (coarse instead of semantic)

* change BarkSemanticModule -> BarkSemanticModel

* align BarkProcessor with transformers paradigm

* correct BarkProcessor tests with last commit changes

* change _validate_voice_preset to an instance method instead of a class method

* tie_weights already called with post_init

* add codec_model config to configuration

* update bark modeling tests with recent BarkProcessor changes

* remove SubModelPretrainedModel + change speakers embeddings prompt type in BarkModel

* change absolute imports to relative

* remove TODO

* change docstrings

* add examples to docs and docstrings

* make style

* uses BatchFeature in BarkProcessor insteads of dict

* continue improving docstrings and docs + make style

* correct docstrings examples

* more comprehensible speaker_embeddings load/Save

* rename speaker_embeddings_dict -> speaker_embeddings

* correct bark.mdx + add bark to documentation_tests

* correct docstrings configuration_bark

* integrate last nit suggestions

* integrate BarkGeneration configs

* make style

* remove bark tests from documentation_tests.txt because timeout - tested manually

* add proper generation config initialization

* small bark.mdx documentation changes

* rename bark.mdx -> bark.md

* add torch.no_grad behind BarkModel.generate_audio()

* replace assert by ValueError in convert_suno_to_hf.py

* integrate a series of short comments from reviewer

* move SemanticLogitsProcessors and remove .detach() from Bark docs and docstrings

* actually remove SemanticLogitsProcessor from modeling_bark.oy

* BarkProcessor returns a single output instead of tuple + correct docstrings

* make style + correct bug

* add initializer_range to BarkConfig + correct slow modeling tests

* add .clone() to history_prompt.coarse_prompt to avoid modifying input array

* Making sure no extra "`" are present

* remove extra characters in modeling_bark.py

* Correct output if history_prompt is None

* remove TODOs

* remove ravel comment

* completing generation_configuration_bark.py docstrings

* change docstrings - number of audio codebooks instead of Encodec codebooks

* change 'bias' docstrings in configuration_bark.py

* format code

* rename BarkModel.generate_audio -> BarkModel.generate_speech

* modify AutoConfig instead of EncodecConfig in BarkConfig

* correct AutoConfig wrong init

* refactor BarkModel and sub-models generate_coarse, generate_fine, generate_text_semantic

* remove SemanticLogitsProcessor and replace it with SuppressTokensLogitsProcessor

* move nb_codebook related config arguments to BarkFineConfig

* rename bark.mdx -> bark.md

* correcting BarkModelConfig from_pretrained + remove keys_to_ignore

* correct bark.md with correct hub path

* correct code bug in bark.md

* correct list tokens_to_suppress

* modify Processor to load nested speaker embeddings in a safer way

* correct batch sampling in BarkFineModel.generate_fine

* Apply suggestions from code review

Small docstrings correction and code improvements

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* give more details about num_layers in docstrings

* correct indentation mistake

* correct submodelconfig order of docstring variables

* put audio models in alphabetical order in utils/check_repo.my

* remove useless line from test_modeling_bark.py

* makes BarkCoarseModelTest inherits from (ModelTesterMixin, GenerationTesterMixin, unittest.TestCase) instead of BarkSemanticModelTest

* make a Tester class for each sub-model instead of inheriting

* add test_resize_embeddings=True for Bark sub-models

* add Copied from transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoSelfAttention._split_heads

* remove 'Copied fom Bark' comment

* remove unneccessary comment

* change np.min -> min in modeling_bark.py

* refactored all custom layers to have Bark prefix

* add attention_mask as an argument of generate_text_semantic

* refactor sub-models start docstrings to have more precise config class definition

* move _tied_weights_keys overriding

* add docstrings to generate_xxx in modeling_bark.py

* add loading whole BarkModel to convert_suno_to_hf

* refactor attribute and variable names

* make style convert_suno

* update bark checkpoints

* remove never entered if statement

* move bark_modeling docstrings after BarkPretrainedModel class definition

* refactor modeling_bark.py: kv -> key_values

* small nits - code refactoring and removing unecessary lines from _init_weights

* nits - replace inplace method by variable assigning

* remove *optional* when necessary

* remove some lines in generate_speech

* add default value for optional parameter

* Refactor preprocess_histories_before_coarse -> preprocess_histories

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* correct usage after refactoring

* refactor Bark's generate_xxx -> generate and modify docstrings and tests accordingly

* update docstrings python in configuration_bark.py

* add bark files in utils/documentation_test.txt

* correct docstrings python snippet

* add the ability to use parameters in the form of e.g coarse_temperature

* add semantic_max_new_tokens in python snippet in docstrings for quicker generation

* Reformate sub-models kwargs in BakModel.generate

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* correct kwargs in BarkModel.generate

* correct attention_mask kwarg in BarkModel.generate

* add tests for sub-models args in BarkModel.generate and correct BarkFineModel.test_generate_fp16

* enrich BarkModel.generate docstrings with a description of how to use the kwargs

---------

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This commit is contained in:
Yoach Lacombe
2023-07-17 18:53:24 +02:00
committed by GitHub
parent c21c3737c1
commit f42a35e611
28 changed files with 4199 additions and 2 deletions

View File

@@ -291,6 +291,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.

View File

@@ -268,6 +268,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.

View File

@@ -240,6 +240,7 @@ conda install -c huggingface transformers
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ] (https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (से École polytechnique) साथ थीसिस [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) पर निर्भर Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis रिहाई।
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research से) साथ में पेपर [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)गुयेन लुओंग ट्रान, डुओंग मिन्ह ले और डाट क्वोक गुयेन द्वारा पोस्ट किया गया।

View File

@@ -302,6 +302,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (BAAI から) Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell から公開された研究論文: [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679)
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (MIT から) Yuan Gong, Yu-An Chung, James Glass から公開された研究論文: [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778)
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (Facebook から) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer から公開された研究論文: [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (École polytechnique から) Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis から公開された研究論文: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321)
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research から) Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen から公開された研究論文: [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)

View File

@@ -217,6 +217,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.

View File

@@ -241,6 +241,7 @@ conda install -c huggingface transformers
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (来自 BAAI) 伴随论文 [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 由 Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell 发布。
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (来自 MIT) 伴随论文 [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 由 Yuan Gong, Yu-An Chung, James Glass 发布。
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (来自 Facebook) 伴随论文 [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) 由 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer 发布。
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (来自 École polytechnique) 伴随论文 [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 由 Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis 发布。
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (来自 VinAI Research) 伴随论文 [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) 由 Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen 发布。

View File

@@ -253,6 +253,7 @@ conda install -c huggingface transformers
1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.

View File

@@ -545,6 +545,8 @@
sections:
- local: model_doc/audio-spectrogram-transformer
title: Audio Spectrogram Transformer
- local: model_doc/bark
title: Bark
- local: model_doc/clap
title: CLAP
- local: model_doc/encodec

View File

@@ -57,6 +57,7 @@ The documentation is organized into five sections:
1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[Autoformer](model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
1. **[Bark](model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
@@ -282,6 +283,7 @@ Flax), PyTorch, and/or TensorFlow.
| AltCLIP | ❌ | ❌ | ✅ | ❌ | ❌ |
| Audio Spectrogram Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| Autoformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| Bark | ❌ | ❌ | ✅ | ❌ | ❌ |
| BART | ✅ | ✅ | ✅ | ✅ | ✅ |
| BEiT | ❌ | ❌ | ✅ | ❌ | ✅ |
| BERT | ✅ | ✅ | ✅ | ✅ | ✅ |

View File

@@ -0,0 +1,141 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Bark
## Overview
Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).
Bark is made of 4 main models:
- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array.
It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
### Tips:
Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
```python
>>> from transformers import AutoProcessor, BarkModel
>>> processor = AutoProcessor.from_pretrained("suno/bark")
>>> model = BarkModel.from_pretrained("suno/bark")
>>> voice_preset = "v2/en_speaker_6"
>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
```
Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects.
```python
>>> # Multilingual speech - simplified Chinese
>>> inputs = processor("惊人的!我会说中文")
>>> # Multilingual speech - French - let's use a voice_preset as well
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
>>> inputs = processor("♪ Hello, my dog is cute ♪")
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
```
The model can also produce **nonverbal communications** like laughing, sighing and crying.
```python
>>> # Adding non-speech cues to the input text
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()
```
To save the audio, simply take the sample rate from the model config and some scipy utility:
```python
>>> from scipy.io.wavfile import write as write_wav
>>> # save audio to disk, but first take the sample rate from the model config
>>> sample_rate = model.generation_config.sample_rate
>>> write_wav("bark_generation.wav", sample_rate, audio_array)
```
This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
The original code can be found [here](https://github.com/suno-ai/bark).
## BarkConfig
[[autodoc]] BarkConfig
- all
## BarkProcessor
[[autodoc]] BarkProcessor
- all
- __call__
## BarkModel
[[autodoc]] BarkModel
- generate
## BarkSemanticModel
[[autodoc]] BarkSemanticModel
- forward
## BarkCoarseModel
[[autodoc]] BarkCoarseModel
- forward
## BarkFineModel
[[autodoc]] BarkFineModel
- forward
## BarkCausalModel
[[autodoc]] BarkCausalModel
- forward
## BarkCoarseConfig
[[autodoc]] BarkCoarseConfig
- all
## BarkFineConfig
[[autodoc]] BarkFineConfig
- all
## BarkSemanticConfig
[[autodoc]] BarkSemanticConfig
- all

View File

@@ -160,6 +160,13 @@ _import_structure = {
"AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
"AutoformerConfig",
],
"models.bark": [
"BarkCoarseConfig",
"BarkConfig",
"BarkFineConfig",
"BarkProcessor",
"BarkSemanticConfig",
],
"models.bart": ["BartConfig", "BartTokenizer"],
"models.barthez": [],
"models.bartpho": [],
@@ -1136,6 +1143,17 @@ else:
"AutoformerPreTrainedModel",
]
)
_import_structure["models.bark"].extend(
[
"BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
"BarkCausalModel",
"BarkCoarseModel",
"BarkFineModel",
"BarkModel",
"BarkPreTrainedModel",
"BarkSemanticModel",
]
)
_import_structure["models.bart"].extend(
[
"BART_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4098,6 +4116,13 @@ if TYPE_CHECKING:
AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
AutoformerConfig,
)
from .models.bark import (
BarkCoarseConfig,
BarkConfig,
BarkFineConfig,
BarkProcessor,
BarkSemanticConfig,
)
from .models.bart import BartConfig, BartTokenizer
from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
from .models.bert import (
@@ -4978,6 +5003,15 @@ if TYPE_CHECKING:
AutoformerModel,
AutoformerPreTrainedModel,
)
from .models.bark import (
BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
BarkCausalModel,
BarkCoarseModel,
BarkFineModel,
BarkModel,
BarkPreTrainedModel,
BarkSemanticModel,
)
from .models.bart import (
BART_PRETRAINED_MODEL_ARCHIVE_LIST,
BartForCausalLM,

View File

@@ -1148,3 +1148,39 @@ class ClassifierFreeGuidanceLogitsProcessor(LogitsProcessor):
cond_logits, uncond_logits = scores.split(unguided_bsz, dim=0)
scores = uncond_logits + (cond_logits - uncond_logits) * self.guidance_scale
return scores
class AlternatingCodebooksLogitsProcessor(LogitsProcessor):
r"""
[`LogitsProcessor`] enforcing alternated generation between the two codebooks of [`Bark`]'s fine submodel.
Args:
input_start_len (`int`):
The length of the initial input sequence.
semantic_vocab_size (`int`):
Vocabulary size of the semantic part, i.e number of tokens associated to the semantic vocabulary.
codebook_size (`int`):
Number of tokens associated to the codebook.
"""
def __init__(self, input_start_len: int, semantic_vocab_size: int, codebook_size: int):
if not isinstance(input_start_len, int) or input_start_len < 0:
raise ValueError(f"`input_starting_length` has to be a non-negative integer, but is {input_start_len}")
self.input_start_len = input_start_len
self.semantic_vocab_size = semantic_vocab_size
self.codebook_size = codebook_size
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
curr_len = input_ids.shape[-1]
# even -> first codebook, odd -> second codebook
is_first_codebook = ((curr_len - self.input_start_len) % 2) == 0
if is_first_codebook:
scores[:, : self.semantic_vocab_size] = -float("inf")
scores[:, self.semantic_vocab_size + self.codebook_size :] = -float("inf")
else:
scores[:, : self.semantic_vocab_size + self.codebook_size] = -float("inf")
return scores

View File

@@ -19,6 +19,7 @@ from . import (
audio_spectrogram_transformer,
auto,
autoformer,
bark,
bart,
barthez,
bartpho,

View File

@@ -35,6 +35,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
("altclip", "AltCLIPConfig"),
("audio-spectrogram-transformer", "ASTConfig"),
("autoformer", "AutoformerConfig"),
("bark", "BarkConfig"),
("bart", "BartConfig"),
("beit", "BeitConfig"),
("bert", "BertConfig"),
@@ -237,6 +238,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
("altclip", "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("audio-spectrogram-transformer", "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("autoformer", "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("bark", "BARK_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("bart", "BART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("bert", "BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -419,6 +421,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
("altclip", "AltCLIP"),
("audio-spectrogram-transformer", "Audio Spectrogram Transformer"),
("autoformer", "Autoformer"),
("bark", "Bark"),
("bart", "BART"),
("barthez", "BARThez"),
("bartpho", "BARTpho"),

View File

@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
("altclip", "AltCLIPModel"),
("audio-spectrogram-transformer", "ASTModel"),
("autoformer", "AutoformerModel"),
("bark", "BarkModel"),
("bart", "BartModel"),
("beit", "BeitModel"),
("bert", "BertModel"),

View File

@@ -44,6 +44,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
[
("align", "AlignProcessor"),
("altclip", "AltCLIPProcessor"),
("bark", "BarkProcessor"),
("blip", "BlipProcessor"),
("blip-2", "Blip2Processor"),
("bridgetower", "BridgeTowerProcessor"),

View File

@@ -0,0 +1,79 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING
from ...utils import (
OptionalDependencyNotAvailable,
_LazyModule,
is_torch_available,
)
_import_structure = {
"configuration_bark": [
"BARK_PRETRAINED_CONFIG_ARCHIVE_MAP",
"BarkCoarseConfig",
"BarkConfig",
"BarkFineConfig",
"BarkSemanticConfig",
],
"processing_bark": ["BarkProcessor"],
}
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
_import_structure["modeling_bark"] = [
"BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
"BarkFineModel",
"BarkSemanticModel",
"BarkCoarseModel",
"BarkModel",
"BarkPreTrainedModel",
"BarkCausalModel",
]
if TYPE_CHECKING:
from .configuration_bark import (
BARK_PRETRAINED_CONFIG_ARCHIVE_MAP,
BarkCoarseConfig,
BarkConfig,
BarkFineConfig,
BarkSemanticConfig,
)
from .processing_bark import BarkProcessor
try:
if not is_torch_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
pass
else:
from .modeling_bark import (
BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
BarkCausalModel,
BarkCoarseModel,
BarkFineModel,
BarkModel,
BarkPreTrainedModel,
BarkSemanticModel,
)
else:
import sys
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)

View File

@@ -0,0 +1,348 @@
# coding=utf-8
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BARK model configuration"""
import copy
import os
from typing import Dict, Optional, Union
from ...configuration_utils import PretrainedConfig
from ...utils import add_start_docstrings, logging
from ..auto import AutoConfig
logger = logging.get_logger(__name__)
BARK_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"suno/bark-small": "https://huggingface.co/suno/bark-small/resolve/main/config.json",
"suno/bark": "https://huggingface.co/suno/bark/resolve/main/config.json",
}
BARK_SUBMODELCONFIG_START_DOCSTRING = """
This is the configuration class to store the configuration of a [`{model}`]. It is used to instantiate the model
according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the Bark [suno/bark](https://huggingface.co/suno/bark)
architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
block_size (`int`, *optional*, defaults to 1024):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
input_vocab_size (`int`, *optional*, defaults to 10_048):
Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`{model}`]. Defaults to 10_048 but should be carefully thought with
regards to the chosen sub-model.
output_vocab_size (`int`, *optional*, defaults to 10_048):
Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented
by the: `output_ids` when passing forward a [`{model}`]. Defaults to 10_048 but should be carefully thought
with regards to the chosen sub-model.
num_layers (`int`, *optional*, defaults to 12):
Number of hidden layers in the given sub-model.
num_heads (`int`, *optional*, defaults to 12):
Number of attention heads for each attention layer in the Transformer architecture.
hidden_size (`int`, *optional*, defaults to 768):
Dimensionality of the "intermediate" (often named feed-forward) layer in the architecture.
dropout (`float`, *optional*, defaults to 0.0):
The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
bias (`bool`, *optional*, defaults to `True`):
Whether or not to use bias in the linear layers and layer norm layers.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models).
"""
class BarkSubModelConfig(PretrainedConfig):
model_type = "bark_module"
keys_to_ignore_at_inference = ["past_key_values"]
attribute_map = {
"num_attention_heads": "num_heads",
"num_hidden_layers": "num_layers",
"vocab_size": "input_vocab_size",
"window_size": "block_size",
}
def __init__(
self,
block_size=1024,
input_vocab_size=10_048,
output_vocab_size=10_048,
num_layers=12,
num_heads=12,
hidden_size=768,
dropout=0.0,
bias=True, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
initializer_range=0.02,
use_cache=True,
**kwargs,
):
self.block_size = block_size
self.input_vocab_size = input_vocab_size
self.output_vocab_size = output_vocab_size
self.num_layers = num_layers
self.num_heads = num_heads
self.hidden_size = hidden_size
self.dropout = dropout
self.bias = bias
self.use_cache = use_cache
self.initializer_range = initializer_range
super().__init__(**kwargs)
@classmethod
def from_pretrained(
cls,
pretrained_model_name_or_path: Union[str, os.PathLike],
cache_dir: Optional[Union[str, os.PathLike]] = None,
force_download: bool = False,
local_files_only: bool = False,
token: Optional[Union[str, bool]] = None,
revision: str = "main",
**kwargs,
) -> "PretrainedConfig":
kwargs["cache_dir"] = cache_dir
kwargs["force_download"] = force_download
kwargs["local_files_only"] = local_files_only
kwargs["revision"] = revision
cls._set_token_in_kwargs(kwargs, token)
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the config dict if we are loading from Bark
if config_dict.get("model_type") == "bark":
config_dict = config_dict[f"{cls.model_type}_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
logger.warning(
f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
)
return cls.from_dict(config_dict, **kwargs)
@add_start_docstrings(
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkSemanticConfig", model="BarkSemanticModel"),
"""
Example:
```python
>>> from transformers import BarkSemanticConfig, BarkSemanticModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkSemanticConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkSemanticModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```""",
)
class BarkSemanticConfig(BarkSubModelConfig):
model_type = "semantic"
@add_start_docstrings(
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkCoarseConfig", model="BarkCoarseModel"),
"""
Example:
```python
>>> from transformers import BarkCoarseConfig, BarkCoarseModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkCoarseConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkCoarseModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```""",
)
class BarkCoarseConfig(BarkSubModelConfig):
model_type = "coarse_acoustics"
@add_start_docstrings(
BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkFineConfig", model="BarkFineModel"),
"""
n_codes_total (`int`, *optional*, defaults to 8):
The total number of audio codebooks predicted. Used in the fine acoustics sub-model.
n_codes_given (`int`, *optional*, defaults to 1):
The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics
sub-models.
Example:
```python
>>> from transformers import BarkFineConfig, BarkFineModel
>>> # Initializing a Bark sub-module style configuration
>>> configuration = BarkFineConfig()
>>> # Initializing a model (with random weights) from the suno/bark style configuration
>>> model = BarkFineModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```""",
)
class BarkFineConfig(BarkSubModelConfig):
model_type = "fine_acoustics"
def __init__(self, tie_word_embeddings=True, n_codes_total=8, n_codes_given=1, **kwargs):
self.n_codes_total = n_codes_total
self.n_codes_given = n_codes_given
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
class BarkConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`BarkModel`]. It is used to instantiate a Bark
model according to the specified sub-models configurations, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark
[suno/bark](https://huggingface.co/suno/bark) architecture.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
semantic_config ([`BarkSemanticConfig`], *optional*):
Configuration of the underlying semantic sub-model.
coarse_acoustics_config ([`BarkCoarseConfig`], *optional*):
Configuration of the underlying coarse acoustics sub-model.
fine_acoustics_config ([`BarkFineConfig`], *optional*):
Configuration of the underlying fine acoustics sub-model.
codec_config ([`AutoConfig`], *optional*):
Configuration of the underlying codec sub-model.
Example:
```python
>>> from transformers import (
... BarkSemanticConfig,
... BarkCoarseConfig,
... BarkFineConfig,
... BarkModel,
... BarkConfig,
... AutoConfig,
... )
>>> # Initializing Bark sub-modules configurations.
>>> semantic_config = BarkSemanticConfig()
>>> coarse_acoustics_config = BarkCoarseConfig()
>>> fine_acoustics_config = BarkFineConfig()
>>> codec_config = AutoConfig.from_pretrained("facebook/encodec_24khz")
>>> # Initializing a Bark module style configuration
>>> configuration = BarkConfig.from_sub_model_configs(
... semantic_config, coarse_acoustics_config, fine_acoustics_config, codec_config
... )
>>> # Initializing a model (with random weights)
>>> model = BarkModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```
"""
model_type = "bark"
is_composition = True
def __init__(
self,
semantic_config: Dict = None,
coarse_acoustics_config: Dict = None,
fine_acoustics_config: Dict = None,
codec_config: Dict = None,
initializer_range=0.02,
**kwargs,
):
if semantic_config is None:
semantic_config = {}
logger.info("semantic_config is None. initializing the semantic model with default values.")
if coarse_acoustics_config is None:
coarse_acoustics_config = {}
logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
if fine_acoustics_config is None:
fine_acoustics_config = {}
logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
if codec_config is None:
codec_config = {}
logger.info("codec_config is None. initializing the codec model with default values.")
self.semantic_config = BarkSemanticConfig(**semantic_config)
self.coarse_acoustics_config = BarkCoarseConfig(**coarse_acoustics_config)
self.fine_acoustics_config = BarkFineConfig(**fine_acoustics_config)
self.codec_config = AutoConfig.for_model(**codec_config)
self.initializer_range = initializer_range
super().__init__(**kwargs)
@classmethod
def from_sub_model_configs(
cls,
semantic_config: BarkSemanticConfig,
coarse_acoustics_config: BarkCoarseConfig,
fine_acoustics_config: BarkFineConfig,
codec_config: AutoConfig,
**kwargs,
):
r"""
Instantiate a [`BarkConfig`] (or a derived class) from bark sub-models configuration.
Returns:
[`BarkConfig`]: An instance of a configuration object
"""
return cls(
semantic_config=semantic_config.to_dict(),
coarse_acoustics_config=coarse_acoustics_config.to_dict(),
fine_acoustics_config=fine_acoustics_config.to_dict(),
codec_config=codec_config.to_dict(),
**kwargs,
)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["semantic_config"] = self.semantic_config.to_dict()
output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
output["codec_config"] = self.codec_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

View File

@@ -0,0 +1,262 @@
"""Convert Bark checkpoint."""
import argparse
import os
from pathlib import Path
import torch
from bark.generation import _load_model as _bark_load_model
from huggingface_hub import hf_hub_download
from transformers import EncodecConfig, EncodecModel, set_seed
from transformers.models.bark.configuration_bark import (
BarkCoarseConfig,
BarkConfig,
BarkFineConfig,
BarkSemanticConfig,
)
from transformers.models.bark.generation_configuration_bark import (
BarkCoarseGenerationConfig,
BarkFineGenerationConfig,
BarkGenerationConfig,
BarkSemanticGenerationConfig,
)
from transformers.models.bark.modeling_bark import BarkCoarseModel, BarkFineModel, BarkModel, BarkSemanticModel
from transformers.utils import logging
logging.set_verbosity_info()
logger = logging.get_logger(__name__)
set_seed(770)
new_layer_name_dict = {
"c_attn": "att_proj",
"c_proj": "out_proj",
"c_fc": "in_proj",
"transformer.": "",
"h.": "layers.",
"ln_1": "layernorm_1",
"ln_2": "layernorm_2",
"ln_f": "layernorm_final",
"wpe": "position_embeds_layer",
"wte": "input_embeds_layer",
}
REMOTE_MODEL_PATHS = {
"text_small": {
"repo_id": "suno/bark",
"file_name": "text.pt",
},
"coarse_small": {
"repo_id": "suno/bark",
"file_name": "coarse.pt",
},
"fine_small": {
"repo_id": "suno/bark",
"file_name": "fine.pt",
},
"text": {
"repo_id": "suno/bark",
"file_name": "text_2.pt",
},
"coarse": {
"repo_id": "suno/bark",
"file_name": "coarse_2.pt",
},
"fine": {
"repo_id": "suno/bark",
"file_name": "fine_2.pt",
},
}
CUR_PATH = os.path.dirname(os.path.abspath(__file__))
default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache")
CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "suno", "bark_v0")
def _get_ckpt_path(model_type, use_small=False):
key = model_type
if use_small:
key += "_small"
return os.path.join(CACHE_DIR, REMOTE_MODEL_PATHS[key]["file_name"])
def _download(from_hf_path, file_name):
os.makedirs(CACHE_DIR, exist_ok=True)
hf_hub_download(repo_id=from_hf_path, filename=file_name, local_dir=CACHE_DIR)
def _load_model(ckpt_path, device, use_small=False, model_type="text"):
if model_type == "text":
ModelClass = BarkSemanticModel
ConfigClass = BarkSemanticConfig
GenerationConfigClass = BarkSemanticGenerationConfig
elif model_type == "coarse":
ModelClass = BarkCoarseModel
ConfigClass = BarkCoarseConfig
GenerationConfigClass = BarkCoarseGenerationConfig
elif model_type == "fine":
ModelClass = BarkFineModel
ConfigClass = BarkFineConfig
GenerationConfigClass = BarkFineGenerationConfig
else:
raise NotImplementedError()
model_key = f"{model_type}_small" if use_small else model_type
model_info = REMOTE_MODEL_PATHS[model_key]
if not os.path.exists(ckpt_path):
logger.info(f"{model_type} model not found, downloading into `{CACHE_DIR}`.")
_download(model_info["repo_id"], model_info["file_name"])
checkpoint = torch.load(ckpt_path, map_location=device)
# this is a hack
model_args = checkpoint["model_args"]
if "input_vocab_size" not in model_args:
model_args["input_vocab_size"] = model_args["vocab_size"]
model_args["output_vocab_size"] = model_args["vocab_size"]
del model_args["vocab_size"]
# convert Bark model arguments to HF Bark model arguments
model_args["num_heads"] = model_args.pop("n_head")
model_args["hidden_size"] = model_args.pop("n_embd")
model_args["num_layers"] = model_args.pop("n_layer")
model_config = ConfigClass(**checkpoint["model_args"])
model = ModelClass(config=model_config)
model_generation_config = GenerationConfigClass()
model.generation_config = model_generation_config
state_dict = checkpoint["model"]
# fixup checkpoint
unwanted_prefix = "_orig_mod."
for k, v in list(state_dict.items()):
if k.startswith(unwanted_prefix):
# replace part of the key with corresponding layer name in HF implementation
new_k = k[len(unwanted_prefix) :]
for old_layer_name in new_layer_name_dict:
new_k = new_k.replace(old_layer_name, new_layer_name_dict[old_layer_name])
state_dict[new_k] = state_dict.pop(k)
extra_keys = set(state_dict.keys()) - set(model.state_dict().keys())
extra_keys = {k for k in extra_keys if not k.endswith(".attn.bias")}
missing_keys = set(model.state_dict().keys()) - set(state_dict.keys())
missing_keys = {k for k in missing_keys if not k.endswith(".attn.bias")}
if len(extra_keys) != 0:
raise ValueError(f"extra keys found: {extra_keys}")
if len(missing_keys) != 0:
raise ValueError(f"missing keys: {missing_keys}")
model.load_state_dict(state_dict, strict=False)
n_params = model.num_parameters(exclude_embeddings=True)
val_loss = checkpoint["best_val_loss"].item()
logger.info(f"model loaded: {round(n_params/1e6,1)}M params, {round(val_loss,3)} loss")
model.eval()
model.to(device)
del checkpoint, state_dict
return model
def load_model(pytorch_dump_folder_path, use_small=False, model_type="text"):
if model_type not in ("text", "coarse", "fine"):
raise NotImplementedError()
device = "cpu" # do conversion on cpu
ckpt_path = _get_ckpt_path(model_type, use_small=use_small)
model = _load_model(ckpt_path, device, model_type=model_type, use_small=use_small)
# load bark initial model
bark_model = _bark_load_model(ckpt_path, "cpu", model_type=model_type, use_small=use_small)
if model_type == "text":
bark_model = bark_model["model"]
if model.num_parameters(exclude_embeddings=True) != bark_model.get_num_params():
raise ValueError("initial and new models don't have the same number of parameters")
# check if same output as the bark model
batch_size = 5
sequence_length = 10
if model_type in ["text", "coarse"]:
vec = torch.randint(256, (batch_size, sequence_length), dtype=torch.int)
output_old_model = bark_model(vec)[0]
output_new_model_total = model(vec)
# take last logits
output_new_model = output_new_model_total.logits[:, [-1], :]
else:
prediction_codeboook_channel = 3
n_codes_total = 8
vec = torch.randint(256, (batch_size, sequence_length, n_codes_total), dtype=torch.int)
output_new_model_total = model(prediction_codeboook_channel, vec)
output_old_model = bark_model(prediction_codeboook_channel, vec)
output_new_model = output_new_model_total.logits
# output difference should come from the difference of self-attention implementation design
if output_new_model.shape != output_old_model.shape:
raise ValueError("initial and new outputs don't have the same shape")
if (output_new_model - output_old_model).abs().max().item() > 1e-3:
raise ValueError("initial and new outputs are not equal")
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
model.save_pretrained(pytorch_dump_folder_path)
def load_whole_bark_model(
semantic_path,
coarse_path,
fine_path,
append_text,
hub_path,
folder_path,
):
pytorch_dump_folder_path = os.path.join(folder_path, append_text)
semanticConfig = BarkSemanticConfig.from_pretrained(os.path.join(semantic_path, "config.json"))
coarseAcousticConfig = BarkCoarseConfig.from_pretrained(os.path.join(coarse_path, "config.json"))
fineAcousticConfig = BarkFineConfig.from_pretrained(os.path.join(fine_path, "config.json"))
codecConfig = EncodecConfig.from_pretrained("facebook/encodec_24khz")
semantic = BarkSemanticModel.from_pretrained(semantic_path)
coarseAcoustic = BarkCoarseModel.from_pretrained(coarse_path)
fineAcoustic = BarkFineModel.from_pretrained(fine_path)
codec = EncodecModel.from_pretrained("facebook/encodec_24khz")
bark_config = BarkConfig.from_sub_model_configs(
semanticConfig, coarseAcousticConfig, fineAcousticConfig, codecConfig
)
bark_generation_config = BarkGenerationConfig.from_sub_model_configs(
semantic.generation_config, coarseAcoustic.generation_config, fineAcoustic.generation_config
)
bark = BarkModel(bark_config)
bark.semantic = semantic
bark.coarse_acoustics = coarseAcoustic
bark.fine_acoustics = fineAcoustic
bark.codec_model = codec
bark.generation_config = bark_generation_config
Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
bark.save_pretrained(pytorch_dump_folder_path, repo_id=hub_path, push_to_hub=True)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument("model_type", type=str, help="text, coarse or fine.")
parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
parser.add_argument("--is_small", action="store_true", help="convert the small version instead of the large.")
args = parser.parse_args()
load_model(args.pytorch_dump_folder_path, model_type=args.model_type, use_small=args.is_small)

View File

@@ -0,0 +1,318 @@
# coding=utf-8
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" BARK model generation configuration"""
import copy
from typing import Dict
from ...generation.configuration_utils import GenerationConfig
from ...utils import logging
logger = logging.get_logger(__name__)
class BarkSemanticGenerationConfig(GenerationConfig):
model_type = "semantic"
def __init__(
self,
eos_token_id=10_000,
renormalize_logits=True,
max_new_tokens=768,
output_scores=False,
return_dict_in_generate=False,
output_hidden_states=False,
output_attentions=False,
temperature=0.7,
do_sample=True,
text_encoding_offset=10_048,
text_pad_token=129_595,
semantic_infer_token=129_599,
semantic_vocab_size=10_000,
max_input_semantic_length=256,
semantic_rate_hz=49.9,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkSemanticModel`].
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
eos_token_id (`int`, *optional*, defaults to 10_000):
The id of the *end-of-sequence* token.
renormalize_logits (`bool`, *optional*, defaults to `True`):
Whether to renormalize the logits after applying all the logits processors or warpers (including the
custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
score logits are normalized but some logit processors or warpers break the normalization.
max_new_tokens (`int`, *optional*, defaults to 768):
The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
output_scores (`bool`, *optional*, defaults to `False`):
Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
return_dict_in_generate (`bool`, *optional*, defaults to `False`):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
output_hidden_states (`bool`, *optional*, defaults to `False`):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
for more details.
output_attentions (`bool`, *optional*, defaults to `False`):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more details.
temperature (`float`, *optional*, defaults to 0.7):
The value used to modulate the next token probabilities.
do_sample (`bool`, *optional*, defaults to `True`):
Whether or not to use sampling ; use greedy decoding otherwise.
text_encoding_offset (`int`, *optional*, defaults to 10_048):
Text encoding offset.
text_pad_token (`int`, *optional*, defaults to 129_595):
Text pad token.
semantic_infer_token (`int`, *optional*, defaults to 129_599):
Semantic infer token.
semantic_vocab_size (`int`, *optional*, defaults to 10_000):
Semantic vocab size.
max_input_semantic_length (`int`, *optional*, defaults to 256):
Max length of semantic input vector.
semantic_rate_hz (`float`, *optional*, defaults to 49.9):
Semantic rate in Hertz.
"""
super().__init__(
temperature=temperature,
do_sample=do_sample,
eos_token_id=eos_token_id,
renormalize_logits=renormalize_logits,
max_new_tokens=max_new_tokens,
output_scores=output_scores,
return_dict_in_generate=return_dict_in_generate,
output_hidden_states=output_hidden_states,
output_attentions=output_attentions,
**kwargs,
)
self.text_encoding_offset = text_encoding_offset
self.text_pad_token = text_pad_token
self.semantic_pad_token = eos_token_id
self.semantic_infer_token = semantic_infer_token
self.semantic_vocab_size = semantic_vocab_size
self.max_input_semantic_length = max_input_semantic_length
self.semantic_rate_hz = semantic_rate_hz
class BarkCoarseGenerationConfig(GenerationConfig):
model_type = "coarse_acoustics"
def __init__(
self,
renormalize_logits=True,
output_scores=False,
return_dict_in_generate=False,
output_hidden_states=False,
output_attentions=False,
temperature=0.7,
do_sample=True,
coarse_semantic_pad_token=12_048,
coarse_rate_hz=75,
n_coarse_codebooks=2,
coarse_infer_token=12_050,
max_coarse_input_length=256,
max_coarse_history: int = 630,
sliding_window_len: int = 60,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkCoarseModel`].
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
renormalize_logits (`bool`, *optional*, defaults to `True`):
Whether to renormalize the logits after applying all the logits processors or warpers (including the
custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
score logits are normalized but some logit processors or warpers break the normalization.
output_scores (`bool`, *optional*, defaults to `False`):
Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
return_dict_in_generate (`bool`, *optional*, defaults to `False`):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
output_hidden_states (`bool`, *optional*, defaults to `False`):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
for more details.
output_attentions (`bool`, *optional*, defaults to `False`):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more details.
temperature (`float`, *optional*, defaults to 0.7):
The value used to modulate the next token probabilities.
do_sample (`bool`, *optional*, defaults to `True`):
Whether or not to use sampling ; use greedy decoding otherwise.
coarse_semantic_pad_token (`int`, *optional*, defaults to 12_048):
Coarse semantic pad token.
coarse_rate_hz (`int`, *optional*, defaults to 75):
Coarse rate in Hertz.
n_coarse_codebooks (`int`, *optional*, defaults to 2):
Number of coarse codebooks.
coarse_infer_token (`int`, *optional*, defaults to 12_050):
Coarse infer token.
max_coarse_input_length (`int`, *optional*, defaults to 256):
Max length of input coarse vector.
max_coarse_history (`int`, *optional*, defaults to 630):
Max length of the output of the coarse acoustics model used in the fine generation step.
sliding_window_len (`int`, *optional*, defaults to 60):
The coarse generation step uses a sliding window to generate raw audio.
"""
super().__init__(
temperature=temperature,
do_sample=do_sample,
renormalize_logits=renormalize_logits,
output_scores=output_scores,
return_dict_in_generate=return_dict_in_generate,
output_hidden_states=output_hidden_states,
output_attentions=output_attentions,
**kwargs,
)
self.coarse_semantic_pad_token = coarse_semantic_pad_token
self.coarse_rate_hz = coarse_rate_hz
self.n_coarse_codebooks = n_coarse_codebooks
self.coarse_infer_token = coarse_infer_token
self.max_coarse_input_length = max_coarse_input_length
self.max_coarse_history = max_coarse_history
self.sliding_window_len = sliding_window_len
class BarkFineGenerationConfig(GenerationConfig):
model_type = "fine_acoustics"
def __init__(
self,
temperature=0.5,
max_fine_history_length=512,
max_fine_input_length=1024,
n_fine_codebooks=8,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkFineModel`].
[`BarkFineModel`] is an autoencoder model, so should not usually be used for generation. However, under the
hood, it uses `temperature` when used by [`BarkModel`]
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
temperature (`float`, *optional*, defaults to 0.5):
The value used to modulate the next token probabilities.
max_fine_history_length (`int`, *optional*, defaults to 512):
Max length of the fine history vector.
max_fine_input_length (`int`, *optional*, defaults to 1024):
Max length of fine input vector.
n_fine_codebooks (`int`, *optional*, defaults to 8):
Number of codebooks used.
"""
super().__init__(temperature=temperature)
self.max_fine_history_length = max_fine_history_length
self.max_fine_input_length = max_fine_input_length
self.n_fine_codebooks = n_fine_codebooks
class BarkGenerationConfig(GenerationConfig):
model_type = "bark"
is_composition = True
# TODO (joao): nested from_dict
def __init__(
self,
semantic_config: Dict = None,
coarse_acoustics_config: Dict = None,
fine_acoustics_config: Dict = None,
sample_rate=24_000,
codebook_size=1024,
**kwargs,
):
"""Class that holds a generation configuration for [`BarkModel`].
The [`BarkModel`] does not have a `generate` method, but uses this class to generate speeches with a nested
[`BarkGenerationConfig`] which uses [`BarkSemanticGenerationConfig`], [`BarkCoarseGenerationConfig`],
[`BarkFineGenerationConfig`].
This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
documentation from [`GenerationConfig`] for more information.
Args:
semantic_config (`Dict`, *optional*):
Semantic generation configuration.
coarse_acoustics_config (`Dict`, *optional*):
Coarse generation configuration.
fine_acoustics_config (`Dict`, *optional*):
Fine generation configuration.
sample_rate (`int`, *optional*, defaults to 24_000):
Sample rate.
codebook_size (`int`, *optional*, defaults to 1024):
Vector length for each codebook.
"""
if semantic_config is None:
semantic_config = {}
logger.info("semantic_config is None. initializing the semantic model with default values.")
if coarse_acoustics_config is None:
coarse_acoustics_config = {}
logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
if fine_acoustics_config is None:
fine_acoustics_config = {}
logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
self.semantic_config = BarkSemanticGenerationConfig(**semantic_config)
self.coarse_acoustics_config = BarkCoarseGenerationConfig(**coarse_acoustics_config)
self.fine_acoustics_config = BarkFineGenerationConfig(**fine_acoustics_config)
self.sample_rate = sample_rate
self.codebook_size = codebook_size
@classmethod
def from_sub_model_configs(
cls,
semantic_config: BarkSemanticGenerationConfig,
coarse_acoustics_config: BarkCoarseGenerationConfig,
fine_acoustics_config: BarkFineGenerationConfig,
**kwargs,
):
r"""
Instantiate a [`BarkGenerationConfig`] (or a derived class) from bark sub-models generation configuration.
Returns:
[`BarkGenerationConfig`]: An instance of a configuration object
"""
return cls(
semantic_config=semantic_config.to_dict(),
coarse_acoustics_config=coarse_acoustics_config.to_dict(),
fine_acoustics_config=fine_acoustics_config.to_dict(),
**kwargs,
)
def to_dict(self):
"""
Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
Returns:
`Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
"""
output = copy.deepcopy(self.__dict__)
output["semantic_config"] = self.semantic_config.to_dict()
output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
output["model_type"] = self.__class__.model_type
return output

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,286 @@
# coding=utf-8
# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for Bark
"""
import json
import os
from typing import Optional
import numpy as np
from ...feature_extraction_utils import BatchFeature
from ...processing_utils import ProcessorMixin
from ...utils import logging
from ...utils.hub import get_file_from_repo
from ..auto import AutoTokenizer
logger = logging.get_logger(__name__)
class BarkProcessor(ProcessorMixin):
r"""
Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor.
Args:
tokenizer ([`PreTrainedTokenizer`]):
An instance of [`PreTrainedTokenizer`].
speaker_embeddings (`Dict[Dict[str]]`, *optional*, defaults to `None`):
Optional nested speaker embeddings dictionary. The first level contains voice preset names (e.g
`"en_speaker_4"`). The second level contains `"semantic_prompt"`, `"coarse_prompt"` and `"fine_prompt"`
embeddings. The values correspond to the path of the corresponding `np.ndarray`. See
[here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) for
a list of `voice_preset_names`.
"""
tokenizer_class = "AutoTokenizer"
attributes = ["tokenizer"]
preset_shape = {
"semantic_prompt": 1,
"coarse_prompt": 2,
"fine_prompt": 2,
}
def __init__(self, tokenizer, speaker_embeddings=None):
super().__init__(tokenizer)
self.speaker_embeddings = speaker_embeddings
@classmethod
def from_pretrained(
cls, pretrained_processor_name_or_path, speaker_embeddings_dict_path="speaker_embeddings_path.json", **kwargs
):
r"""
Instantiate a Bark processor associated with a pretrained model.
Args:
pretrained_model_name_or_path (`str` or `os.PathLike`):
This can be either:
- a string, the *model id* of a pretrained [`BarkProcessor`] hosted inside a model repo on
huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
- a path to a *directory* containing a processor saved using the [`~BarkProcessor.save_pretrained`]
method, e.g., `./my_model_directory/`.
speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`):
The name of the `.json` file containing the speaker_embeddings dictionnary located in
`pretrained_model_name_or_path`. If `None`, no speaker_embeddings is loaded.
**kwargs
Additional keyword arguments passed along to both
[`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`].
"""
if speaker_embeddings_dict_path is not None:
speaker_embeddings_path = get_file_from_repo(
pretrained_processor_name_or_path,
speaker_embeddings_dict_path,
subfolder=kwargs.pop("subfolder", None),
cache_dir=kwargs.pop("cache_dir", None),
force_download=kwargs.pop("force_download", False),
proxies=kwargs.pop("proxies", None),
resume_download=kwargs.pop("resume_download", False),
local_files_only=kwargs.pop("local_files_only", False),
use_auth_token=kwargs.pop("use_auth_token", None),
revision=kwargs.pop("revision", None),
)
if speaker_embeddings_path is None:
logger.warning(
f"""`{os.path.join(pretrained_processor_name_or_path,speaker_embeddings_dict_path)}` does not exists
, no preloaded speaker embeddings will be used - Make sure to provide a correct path to the json
dictionnary if wanted, otherwise set `speaker_embeddings_dict_path=None`."""
)
speaker_embeddings = None
else:
with open(speaker_embeddings_path) as speaker_embeddings_json:
speaker_embeddings = json.load(speaker_embeddings_json)
else:
speaker_embeddings = None
tokenizer = AutoTokenizer.from_pretrained(pretrained_processor_name_or_path, **kwargs)
return cls(tokenizer=tokenizer, speaker_embeddings=speaker_embeddings)
def save_pretrained(
self,
save_directory,
speaker_embeddings_dict_path="speaker_embeddings_path.json",
speaker_embeddings_directory="speaker_embeddings",
push_to_hub: bool = False,
**kwargs,
):
"""
Saves the attributes of this processor (tokenizer...) in the specified directory so that it can be reloaded
using the [`~BarkProcessor.from_pretrained`] method.
Args:
save_directory (`str` or `os.PathLike`):
Directory where the tokenizer files and the speaker embeddings will be saved (directory will be created
if it does not exist).
speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`):
The name of the `.json` file that will contains the speaker_embeddings nested path dictionnary, if it
exists, and that will be located in `pretrained_model_name_or_path/speaker_embeddings_directory`.
speaker_embeddings_directory (`str`, *optional*, defaults to `"speaker_embeddings/"`):
The name of the folder in which the speaker_embeddings arrays will be saved.
push_to_hub (`bool`, *optional*, defaults to `False`):
Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
namespace).
kwargs:
Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
"""
if self.speaker_embeddings is not None:
os.makedirs(os.path.join(save_directory, speaker_embeddings_directory, "v2"), exist_ok=True)
embeddings_dict = {}
embeddings_dict["repo_or_path"] = save_directory
for prompt_key in self.speaker_embeddings:
if prompt_key != "repo_or_path":
voice_preset = self._load_voice_preset(prompt_key)
tmp_dict = {}
for key in self.speaker_embeddings[prompt_key]:
np.save(
os.path.join(
embeddings_dict["repo_or_path"], speaker_embeddings_directory, f"{prompt_key}_{key}"
),
voice_preset[key],
allow_pickle=False,
)
tmp_dict[key] = os.path.join(speaker_embeddings_directory, f"{prompt_key}_{key}.npy")
embeddings_dict[prompt_key] = tmp_dict
with open(os.path.join(save_directory, speaker_embeddings_dict_path), "w") as fp:
json.dump(embeddings_dict, fp)
super().save_pretrained(save_directory, push_to_hub, **kwargs)
def _load_voice_preset(self, voice_preset: str = None, **kwargs):
voice_preset_paths = self.speaker_embeddings[voice_preset]
voice_preset_dict = {}
for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]:
if key not in voice_preset_paths:
raise ValueError(
f"Voice preset unrecognized, missing {key} as a key in self.speaker_embeddings[{voice_preset}]."
)
path = get_file_from_repo(
self.speaker_embeddings.get("repo_or_path", "/"),
voice_preset_paths[key],
subfolder=kwargs.pop("subfolder", None),
cache_dir=kwargs.pop("cache_dir", None),
force_download=kwargs.pop("force_download", False),
proxies=kwargs.pop("proxies", None),
resume_download=kwargs.pop("resume_download", False),
local_files_only=kwargs.pop("local_files_only", False),
use_auth_token=kwargs.pop("use_auth_token", None),
revision=kwargs.pop("revision", None),
)
if path is None:
raise ValueError(
f"""`{os.path.join(self.speaker_embeddings.get("repo_or_path", "/"),voice_preset_paths[key])}` does not exists
, no preloaded voice preset will be used - Make sure to provide correct paths to the {voice_preset}
embeddings."""
)
voice_preset_dict[key] = np.load(path)
return voice_preset_dict
def _validate_voice_preset_dict(self, voice_preset: Optional[dict] = None):
for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]:
if key not in voice_preset:
raise ValueError(f"Voice preset unrecognized, missing {key} as a key.")
if not isinstance(voice_preset[key], np.ndarray):
raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.")
if len(voice_preset[key].shape) != self.preset_shape[key]:
raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.")
def __call__(
self,
text=None,
voice_preset=None,
return_tensors="pt",
max_length=256,
add_special_tokens=False,
return_attention_mask=True,
return_token_type_ids=False,
**kwargs,
):
"""
Main method to prepare for the model one or several sequences(s). This method forwards the `text` and `kwargs`
arguments to the AutoTokenizer's [`~AutoTokenizer.__call__`] to encode the text. The method also proposes a
voice preset which is a dictionary of arrays that conditions `Bark`'s output. `kwargs` arguments are forwarded
to the tokenizer and to `cached_file` method if `voice_preset` is a valid filename.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
voice_preset (`str`, `Dict[np.ndarray]`):
The voice preset, i.e the speaker embeddings. It can either be a valid voice_preset name, e.g
`"en_speaker_1"`, or directly a dictionnary of `np.ndarray` embeddings for each submodel of `Bark`. Or
it can be a valid file name of a local `.npz` single voice preset.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
Returns:
Tuple([`BatchEncoding`], [`BatchFeature`]): A tuple composed of a [`BatchEncoding`], i.e the output of the
`tokenizer` and a [`BatchFeature`], i.e the voice preset with the right tensors type.
"""
if voice_preset is not None and not isinstance(voice_preset, dict):
if (
isinstance(voice_preset, str)
and self.speaker_embeddings is not None
and voice_preset in self.speaker_embeddings
):
voice_preset = self._load_voice_preset(voice_preset)
else:
if isinstance(voice_preset, str) and not voice_preset.endswith(".npz"):
voice_preset = voice_preset + ".npz"
voice_preset = np.load(voice_preset)
if voice_preset is not None:
self._validate_voice_preset_dict(voice_preset, **kwargs)
voice_preset = BatchFeature(data=voice_preset, tensor_type=return_tensors)
encoded_text = self.tokenizer(
text,
return_tensors=return_tensors,
padding="max_length",
max_length=max_length,
return_attention_mask=return_attention_mask,
return_token_type_ids=return_token_type_ids,
add_special_tokens=add_special_tokens,
**kwargs,
)
if voice_preset is not None:
encoded_text["history_prompt"] = voice_preset
return encoded_text

View File

@@ -816,6 +816,51 @@ class AutoformerPreTrainedModel(metaclass=DummyObject):
requires_backends(self, ["torch"])
BARK_PRETRAINED_MODEL_ARCHIVE_LIST = None
class BarkCausalModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class BarkCoarseModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class BarkFineModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class BarkModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class BarkPreTrainedModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
class BarkSemanticModel(metaclass=DummyObject):
_backends = ["torch"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
BART_PRETRAINED_MODEL_ARCHIVE_LIST = None

View File

View File

@@ -0,0 +1,991 @@
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Testing suite for the PyTorch Bark model. """
import copy
import inspect
import tempfile
import unittest
from transformers import (
BarkCoarseConfig,
BarkFineConfig,
BarkSemanticConfig,
is_torch_available,
)
from transformers.models.bark.generation_configuration_bark import (
BarkCoarseGenerationConfig,
BarkFineGenerationConfig,
BarkSemanticGenerationConfig,
)
from transformers.testing_utils import require_torch, slow, torch_device
from transformers.utils import cached_property
from ...generation.test_utils import GenerationTesterMixin
from ...test_configuration_common import ConfigTester
from ...test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
if is_torch_available():
import torch
from transformers import (
BarkCausalModel,
BarkCoarseModel,
BarkFineModel,
BarkModel,
BarkProcessor,
BarkSemanticModel,
)
class BarkSemanticModelTester:
def __init__(
self,
parent,
batch_size=2,
seq_length=4,
is_training=False, # for now training is not supported
use_input_mask=True,
use_labels=True,
vocab_size=33,
output_vocab_size=33,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=2,
intermediate_size=15,
dropout=0.1,
window_size=256,
initializer_range=0.02,
n_codes_total=8, # for BarkFineModel
n_codes_given=1, # for BarkFineModel
config_class=None,
model_class=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.output_vocab_size = output_vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.window_size = window_size
self.initializer_range = initializer_range
self.bos_token_id = output_vocab_size - 1
self.eos_token_id = output_vocab_size - 1
self.pad_token_id = output_vocab_size - 1
self.n_codes_total = n_codes_total
self.n_codes_given = n_codes_given
self.is_encoder_decoder = False
self.config_class = config_class
self.model_class = model_class
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
config = self.get_config()
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
inputs_dict = {
"input_ids": input_ids,
"head_mask": head_mask,
"attention_mask": input_mask,
}
return config, inputs_dict
def get_config(self):
return self.config_class(
vocab_size=self.vocab_size,
output_vocab_size=self.output_vocab_size,
hidden_size=self.hidden_size,
num_layers=self.num_hidden_layers,
num_heads=self.num_attention_heads,
use_cache=True,
bos_token_id=self.bos_token_id,
eos_token_id=self.eos_token_id,
pad_token_id=self.pad_token_id,
window_size=self.window_size,
)
def get_pipeline_config(self):
config = self.get_config()
config.vocab_size = 300
return config
def prepare_config_and_inputs_for_common(self):
config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = self.model_class(config=config).to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"logits"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
# test no attention_mask works
outputs = model(input_ids, use_cache=True)
_, past_key_values = outputs.to_tuple()
output_from_no_past = model(next_input_ids)["logits"]
output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
class BarkCoarseModelTester:
def __init__(
self,
parent,
batch_size=2,
seq_length=4,
is_training=False, # for now training is not supported
use_input_mask=True,
use_labels=True,
vocab_size=33,
output_vocab_size=33,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=2,
intermediate_size=15,
dropout=0.1,
window_size=256,
initializer_range=0.02,
n_codes_total=8, # for BarkFineModel
n_codes_given=1, # for BarkFineModel
config_class=None,
model_class=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.output_vocab_size = output_vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.window_size = window_size
self.initializer_range = initializer_range
self.bos_token_id = output_vocab_size - 1
self.eos_token_id = output_vocab_size - 1
self.pad_token_id = output_vocab_size - 1
self.n_codes_total = n_codes_total
self.n_codes_given = n_codes_given
self.is_encoder_decoder = False
self.config_class = config_class
self.model_class = model_class
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
config = self.get_config()
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
inputs_dict = {
"input_ids": input_ids,
"head_mask": head_mask,
"attention_mask": input_mask,
}
return config, inputs_dict
def get_config(self):
return self.config_class(
vocab_size=self.vocab_size,
output_vocab_size=self.output_vocab_size,
hidden_size=self.hidden_size,
num_layers=self.num_hidden_layers,
num_heads=self.num_attention_heads,
use_cache=True,
bos_token_id=self.bos_token_id,
eos_token_id=self.eos_token_id,
pad_token_id=self.pad_token_id,
window_size=self.window_size,
)
def get_pipeline_config(self):
config = self.get_config()
config.vocab_size = 300
return config
def prepare_config_and_inputs_for_common(self):
config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = self.model_class(config=config).to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"logits"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
# test no attention_mask works
outputs = model(input_ids, use_cache=True)
_, past_key_values = outputs.to_tuple()
output_from_no_past = model(next_input_ids)["logits"]
output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
class BarkFineModelTester:
def __init__(
self,
parent,
batch_size=2,
seq_length=4,
is_training=False, # for now training is not supported
use_input_mask=True,
use_labels=True,
vocab_size=33,
output_vocab_size=33,
hidden_size=16,
num_hidden_layers=2,
num_attention_heads=2,
intermediate_size=15,
dropout=0.1,
window_size=256,
initializer_range=0.02,
n_codes_total=8, # for BarkFineModel
n_codes_given=1, # for BarkFineModel
config_class=None,
model_class=None,
):
self.parent = parent
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_labels = use_labels
self.vocab_size = vocab_size
self.output_vocab_size = output_vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.dropout = dropout
self.window_size = window_size
self.initializer_range = initializer_range
self.bos_token_id = output_vocab_size - 1
self.eos_token_id = output_vocab_size - 1
self.pad_token_id = output_vocab_size - 1
self.n_codes_total = n_codes_total
self.n_codes_given = n_codes_given
self.is_encoder_decoder = False
self.config_class = config_class
self.model_class = model_class
def prepare_config_and_inputs(self):
input_ids = ids_tensor([self.batch_size, self.seq_length, self.n_codes_total], self.vocab_size)
input_mask = None
if self.use_input_mask:
input_mask = random_attention_mask([self.batch_size, self.seq_length])
config = self.get_config()
head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
# randint between self.n_codes_given - 1 and self.n_codes_total - 1
codebook_idx = ids_tensor((1,), self.n_codes_total - self.n_codes_given).item() + self.n_codes_given
inputs_dict = {
"codebook_idx": codebook_idx,
"input_ids": input_ids,
"head_mask": head_mask,
"attention_mask": input_mask,
}
return config, inputs_dict
def get_config(self):
return self.config_class(
vocab_size=self.vocab_size,
output_vocab_size=self.output_vocab_size,
hidden_size=self.hidden_size,
num_layers=self.num_hidden_layers,
num_heads=self.num_attention_heads,
use_cache=True,
bos_token_id=self.bos_token_id,
eos_token_id=self.eos_token_id,
pad_token_id=self.pad_token_id,
window_size=self.window_size,
)
def get_pipeline_config(self):
config = self.get_config()
config.vocab_size = 300
return config
def prepare_config_and_inputs_for_common(self):
config, inputs_dict = self.prepare_config_and_inputs()
return config, inputs_dict
def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
model = self.model_class(config=config).to(torch_device).eval()
input_ids = inputs_dict["input_ids"]
attention_mask = inputs_dict["attention_mask"]
# first forward pass
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
output, past_key_values = outputs.to_tuple()
# create hypothetical multiple next token and extent to next_input_ids
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
next_attn_mask = ids_tensor((self.batch_size, 3), 2)
# append to next input_ids and
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
"logits"
]
# select random slice
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
# test no attention_mask works
outputs = model(input_ids, use_cache=True)
_, past_key_values = outputs.to_tuple()
output_from_no_past = model(next_input_ids)["logits"]
output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
# test that outputs are equal for slice
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
@require_torch
class BarkSemanticModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
all_model_classes = (BarkSemanticModel,) if is_torch_available() else ()
all_generative_model_classes = (BarkCausalModel,) if is_torch_available() else ()
is_encoder_decoder = False
fx_compatible = False
test_missing_keys = False
test_pruning = False
test_model_parallel = False
# no model_parallel for now
test_resize_embeddings = True
def setUp(self):
self.model_tester = BarkSemanticModelTester(
self, config_class=BarkSemanticConfig, model_class=BarkSemanticModel
)
self.config_tester = ConfigTester(self, config_class=BarkSemanticConfig, n_embd=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_decoder_model_past_with_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
def test_inputs_embeds(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
input_ids = inputs["input_ids"]
del inputs["input_ids"]
wte = model.get_input_embeddings()
inputs["input_embeds"] = wte(input_ids)
with torch.no_grad():
model(**inputs)[0]
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
attention_mask = input_ids.ne(1).to(torch_device)
model = self.all_generative_model_classes[0](config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
model.generate(input_ids, attention_mask=attention_mask)
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
@require_torch
class BarkCoarseModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
# Same tester as BarkSemanticModelTest, except for model_class and config_class
all_model_classes = (BarkCoarseModel,) if is_torch_available() else ()
all_generative_model_classes = (BarkCausalModel,) if is_torch_available() else ()
is_encoder_decoder = False
fx_compatible = False
test_missing_keys = False
test_pruning = False
test_model_parallel = False
# no model_parallel for now
test_resize_embeddings = True
def setUp(self):
self.model_tester = BarkCoarseModelTester(self, config_class=BarkCoarseConfig, model_class=BarkCoarseModel)
self.config_tester = ConfigTester(self, config_class=BarkCoarseConfig, n_embd=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_decoder_model_past_with_large_inputs(self):
config_and_inputs = self.model_tester.prepare_config_and_inputs()
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
def test_inputs_embeds(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
input_ids = inputs["input_ids"]
del inputs["input_ids"]
wte = model.get_input_embeddings()
inputs["input_embeds"] = wte(input_ids)
with torch.no_grad():
model(**inputs)[0]
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
attention_mask = input_ids.ne(1).to(torch_device)
model = self.all_generative_model_classes[0](config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
model.generate(input_ids, attention_mask=attention_mask)
model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
@require_torch
class BarkFineModelTest(ModelTesterMixin, unittest.TestCase):
all_model_classes = (BarkFineModel,) if is_torch_available() else ()
is_encoder_decoder = False
fx_compatible = False
test_missing_keys = False
test_pruning = False
# no model_parallel for now
test_model_parallel = False
# torchscript disabled for now because forward with an int
test_torchscript = False
test_resize_embeddings = True
def setUp(self):
self.model_tester = BarkFineModelTester(self, config_class=BarkFineConfig, model_class=BarkFineModel)
self.config_tester = ConfigTester(self, config_class=BarkFineConfig, n_embd=37)
def test_config(self):
self.config_tester.run_common_tests()
def test_save_load_strict(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs()
for model_class in self.all_model_classes:
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
self.assertEqual(info["missing_keys"], [])
def test_inputs_embeds(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
model.to(torch_device)
model.eval()
inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
input_ids = inputs["input_ids"]
del inputs["input_ids"]
wte = model.get_input_embeddings()[inputs_dict["codebook_idx"]]
inputs["input_embeds"] = wte(input_ids[:, :, inputs_dict["codebook_idx"]])
with torch.no_grad():
model(**inputs)[0]
def test_generate_fp16(self):
config, input_dict = self.model_tester.prepare_config_and_inputs()
input_ids = input_dict["input_ids"]
# take first codebook channel
model = self.all_model_classes[0](config).eval().to(torch_device)
if torch_device == "cuda":
model.half()
# toy generation_configs
semantic_generation_config = BarkSemanticGenerationConfig(semantic_vocab_size=0)
coarse_generation_config = BarkCoarseGenerationConfig(n_coarse_codebooks=config.n_codes_given)
fine_generation_config = BarkFineGenerationConfig(
max_fine_history_length=config.block_size // 2,
max_fine_input_length=config.block_size,
n_fine_codebooks=config.n_codes_total,
)
codebook_size = config.vocab_size - 1
model.generate(
input_ids,
history_prompt=None,
temperature=None,
semantic_generation_config=semantic_generation_config,
coarse_generation_config=coarse_generation_config,
fine_generation_config=fine_generation_config,
codebook_size=codebook_size,
)
model.generate(
input_ids,
history_prompt=None,
temperature=0.7,
semantic_generation_config=semantic_generation_config,
coarse_generation_config=coarse_generation_config,
fine_generation_config=fine_generation_config,
codebook_size=codebook_size,
)
def test_forward_signature(self):
config, _ = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
signature = inspect.signature(model.forward)
# signature.parameters is an OrderedDict => so arg_names order is deterministic
arg_names = [*signature.parameters.keys()]
expected_arg_names = ["codebook_idx", "input_ids"]
self.assertListEqual(arg_names[:2], expected_arg_names)
def test_model_common_attributes(self):
# one embedding layer per codebook
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
model = model_class(config)
self.assertIsInstance(model.get_input_embeddings()[0], (torch.nn.Embedding))
model.set_input_embeddings(
torch.nn.ModuleList([torch.nn.Embedding(10, 10) for _ in range(config.n_codes_total)])
)
x = model.get_output_embeddings()
self.assertTrue(x is None or isinstance(x[0], torch.nn.Linear))
def test_resize_tokens_embeddings(self):
# resizing tokens_embeddings of a ModuleList
original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
if not self.test_resize_embeddings:
return
for model_class in self.all_model_classes:
config = copy.deepcopy(original_config)
model = model_class(config)
model.to(torch_device)
if self.model_tester.is_training is False:
model.eval()
model_vocab_size = config.vocab_size
# Retrieve the embeddings and clone theme
model_embed_list = model.resize_token_embeddings(model_vocab_size)
cloned_embeddings_list = [model_embed.weight.clone() for model_embed in model_embed_list]
# Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
model_embed_list = model.resize_token_embeddings(model_vocab_size + 10)
self.assertEqual(model.config.vocab_size, model_vocab_size + 10)
# Check that it actually resizes the embeddings matrix for each codebook
for model_embed, cloned_embeddings in zip(model_embed_list, cloned_embeddings_list):
self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] + 10)
# Check that the model can still do a forward pass successfully (every parameter should be resized)
model(**self._prepare_for_class(inputs_dict, model_class))
# Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
model_embed_list = model.resize_token_embeddings(model_vocab_size - 15)
self.assertEqual(model.config.vocab_size, model_vocab_size - 15)
for model_embed, cloned_embeddings in zip(model_embed_list, cloned_embeddings_list):
self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] - 15)
# Check that the model can still do a forward pass successfully (every parameter should be resized)
# Input ids should be clamped to the maximum size of the vocabulary
inputs_dict["input_ids"].clamp_(max=model_vocab_size - 15 - 1)
model(**self._prepare_for_class(inputs_dict, model_class))
# Check that adding and removing tokens has not modified the first part of the embedding matrix.
# only check for the first embedding matrix
models_equal = True
for p1, p2 in zip(cloned_embeddings_list[0], model_embed_list[0].weight):
if p1.data.ne(p2.data).sum() > 0:
models_equal = False
self.assertTrue(models_equal)
def test_resize_embeddings_untied(self):
# resizing tokens_embeddings of a ModuleList
original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
if not self.test_resize_embeddings:
return
original_config.tie_word_embeddings = False
for model_class in self.all_model_classes:
config = copy.deepcopy(original_config)
model = model_class(config).to(torch_device)
# if no output embeddings -> leave test
if model.get_output_embeddings() is None:
continue
# Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
model_vocab_size = config.vocab_size
model.resize_token_embeddings(model_vocab_size + 10)
self.assertEqual(model.config.vocab_size, model_vocab_size + 10)
output_embeds_list = model.get_output_embeddings()
for output_embeds in output_embeds_list:
self.assertEqual(output_embeds.weight.shape[0], model_vocab_size + 10)
# Check bias if present
if output_embeds.bias is not None:
self.assertEqual(output_embeds.bias.shape[0], model_vocab_size + 10)
# Check that the model can still do a forward pass successfully (every parameter should be resized)
model(**self._prepare_for_class(inputs_dict, model_class))
# Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
model.resize_token_embeddings(model_vocab_size - 15)
self.assertEqual(model.config.vocab_size, model_vocab_size - 15)
# Check that it actually resizes the embeddings matrix
output_embeds_list = model.get_output_embeddings()
for output_embeds in output_embeds_list:
self.assertEqual(output_embeds.weight.shape[0], model_vocab_size - 15)
# Check bias if present
if output_embeds.bias is not None:
self.assertEqual(output_embeds.bias.shape[0], model_vocab_size - 15)
# Check that the model can still do a forward pass successfully (every parameter should be resized)
# Input ids should be clamped to the maximum size of the vocabulary
inputs_dict["input_ids"].clamp_(max=model_vocab_size - 15 - 1)
# Check that the model can still do a forward pass successfully (every parameter should be resized)
model(**self._prepare_for_class(inputs_dict, model_class))
@require_torch
class BarkModelIntegrationTests(unittest.TestCase):
@cached_property
def model(self):
return BarkModel.from_pretrained("ylacombe/bark-large").to(torch_device)
@cached_property
def processor(self):
return BarkProcessor.from_pretrained("ylacombe/bark-large")
@cached_property
def inputs(self):
input_ids = self.processor("In the light of the moon, a little egg lay on a leaf", voice_preset="en_speaker_6")
input_ids = input_ids.to(torch_device)
return input_ids
@cached_property
def semantic_generation_config(self):
semantic_generation_config = BarkSemanticGenerationConfig(**self.model.generation_config.semantic_config)
return semantic_generation_config
@cached_property
def coarse_generation_config(self):
coarse_generation_config = BarkCoarseGenerationConfig(**self.model.generation_config.coarse_acoustics_config)
return coarse_generation_config
@cached_property
def fine_generation_config(self):
fine_generation_config = BarkFineGenerationConfig(**self.model.generation_config.fine_acoustics_config)
return fine_generation_config
@slow
def test_generate_semantic(self):
input_ids = self.inputs
# fmt: off
# check first ids
expected_output_ids = [7363, 321, 41, 1461, 6915, 952, 326, 41, 41, 927,]
# fmt: on
# greedy decoding
with torch.no_grad():
output_ids = self.model.semantic.generate(
**input_ids,
do_sample=False,
semantic_generation_config=self.semantic_generation_config,
)
self.assertListEqual(output_ids[0, : len(expected_output_ids)].tolist(), expected_output_ids)
@slow
def test_generate_coarse(self):
input_ids = self.inputs
history_prompt = input_ids["history_prompt"]
# fmt: off
# check first ids
expected_output_ids = [11018, 11391, 10651, 11418, 10857, 11620, 10642, 11366, 10312, 11528, 10531, 11516, 10474, 11051, 10524, 11051, ]
# fmt: on
with torch.no_grad():
output_ids = self.model.semantic.generate(
**input_ids,
do_sample=False,
semantic_generation_config=self.semantic_generation_config,
)
output_ids = self.model.coarse_acoustics.generate(
output_ids,
history_prompt=history_prompt,
do_sample=False,
semantic_generation_config=self.semantic_generation_config,
coarse_generation_config=self.coarse_generation_config,
codebook_size=self.model.generation_config.codebook_size,
)
self.assertListEqual(output_ids[0, : len(expected_output_ids)].tolist(), expected_output_ids)
@slow
def test_generate_fine(self):
input_ids = self.inputs
history_prompt = input_ids["history_prompt"]
# fmt: off
expected_output_ids = [
[1018, 651, 857, 642, 312, 531, 474, 524, 524, 776,],
[367, 394, 596, 342, 504, 492, 27, 27, 822, 822,],
[961, 955, 221, 955, 955, 686, 939, 939, 479, 176,],
[638, 365, 218, 944, 853, 363, 639, 22, 884, 456,],
[302, 912, 524, 38, 174, 209, 879, 23, 910, 227,],
[440, 673, 861, 666, 372, 558, 49, 172, 232, 342,],
[244, 358, 123, 356, 586, 520, 499, 877, 542, 637,],
[806, 685, 905, 848, 803, 810, 921, 208, 625, 203,],
]
# fmt: on
with torch.no_grad():
output_ids = self.model.semantic.generate(
**input_ids,
do_sample=False,
semantic_generation_config=self.semantic_generation_config,
)
output_ids = self.model.coarse_acoustics.generate(
output_ids,
history_prompt=history_prompt,
do_sample=False,
semantic_generation_config=self.semantic_generation_config,
coarse_generation_config=self.coarse_generation_config,
codebook_size=self.model.generation_config.codebook_size,
)
# greedy decoding
output_ids = self.model.fine_acoustics.generate(
output_ids,
history_prompt=history_prompt,
temperature=None,
semantic_generation_config=self.semantic_generation_config,
coarse_generation_config=self.coarse_generation_config,
fine_generation_config=self.fine_generation_config,
codebook_size=self.model.generation_config.codebook_size,
)
self.assertListEqual(output_ids[0, :, : len(expected_output_ids[0])].tolist(), expected_output_ids)
@slow
def test_generate_end_to_end(self):
input_ids = self.inputs
with torch.no_grad():
self.model.generate(**input_ids)
self.model.generate(**{key: val for (key, val) in input_ids.items() if key != "history_prompt"})
@slow
def test_generate_end_to_end_with_args(self):
input_ids = self.inputs
with torch.no_grad():
self.model.generate(**input_ids, do_sample=True, temperature=0.6, penalty_alpha=0.6)
self.model.generate(**input_ids, do_sample=True, temperature=0.6, num_beams=4)
@slow
def test_generate_end_to_end_with_sub_models_args(self):
input_ids = self.inputs
with torch.no_grad():
self.model.generate(**input_ids, do_sample=False, coarse_do_sample=True, coarse_temperature=0.7)
self.model.generate(
**input_ids, do_sample=False, coarse_do_sample=True, coarse_temperature=0.7, fine_temperature=0.3
)
self.model.generate(
**input_ids,
do_sample=True,
temperature=0.6,
penalty_alpha=0.6,
semantic_temperature=0.9,
coarse_temperature=0.2,
fine_temperature=0.1,
)

View File

@@ -0,0 +1,127 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import shutil
import tempfile
import unittest
import numpy as np
from transformers import AutoTokenizer, BarkProcessor
from transformers.testing_utils import require_torch, slow
@require_torch
class BarkProcessorTest(unittest.TestCase):
def setUp(self):
self.checkpoint = "ylacombe/bark-small"
self.tmpdirname = tempfile.mkdtemp()
self.voice_preset = "en_speaker_1"
self.input_string = "This is a test string"
self.speaker_embeddings_dict_path = "speaker_embeddings_path.json"
self.speaker_embeddings_directory = "speaker_embeddings"
def get_tokenizer(self, **kwargs):
return AutoTokenizer.from_pretrained(self.checkpoint, **kwargs)
def tearDown(self):
shutil.rmtree(self.tmpdirname)
def test_save_load_pretrained_default(self):
tokenizer = self.get_tokenizer()
processor = BarkProcessor(tokenizer=tokenizer)
processor.save_pretrained(self.tmpdirname)
processor = BarkProcessor.from_pretrained(self.tmpdirname)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
@slow
def test_save_load_pretrained_additional_features(self):
processor = BarkProcessor.from_pretrained(
pretrained_processor_name_or_path=self.checkpoint,
speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
)
processor.save_pretrained(
self.tmpdirname,
speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
speaker_embeddings_directory=self.speaker_embeddings_directory,
)
tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
processor = BarkProcessor.from_pretrained(
self.tmpdirname,
self.speaker_embeddings_dict_path,
bos_token="(BOS)",
eos_token="(EOS)",
)
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
def test_speaker_embeddings(self):
processor = BarkProcessor.from_pretrained(
pretrained_processor_name_or_path=self.checkpoint,
speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
)
seq_len = 35
nb_codebooks_coarse = 2
nb_codebooks_total = 8
voice_preset = {
"semantic_prompt": np.ones(seq_len),
"coarse_prompt": np.ones((nb_codebooks_coarse, seq_len)),
"fine_prompt": np.ones((nb_codebooks_total, seq_len)),
}
# test providing already loaded voice_preset
inputs = processor(text=self.input_string, voice_preset=voice_preset)
processed_voice_preset = inputs["history_prompt"]
for key in voice_preset:
self.assertListEqual(voice_preset[key].tolist(), processed_voice_preset.get(key, np.array([])).tolist())
# test loading voice preset from npz file
tmpfilename = os.path.join(self.tmpdirname, "file.npz")
np.savez(tmpfilename, **voice_preset)
inputs = processor(text=self.input_string, voice_preset=tmpfilename)
processed_voice_preset = inputs["history_prompt"]
for key in voice_preset:
self.assertListEqual(voice_preset[key].tolist(), processed_voice_preset.get(key, np.array([])).tolist())
# test loading voice preset from the hub
inputs = processor(text=self.input_string, voice_preset=self.voice_preset)
def test_tokenizer(self):
tokenizer = self.get_tokenizer()
processor = BarkProcessor(tokenizer=tokenizer)
encoded_processor = processor(text=self.input_string)
encoded_tok = tokenizer(
self.input_string,
padding="max_length",
max_length=256,
add_special_tokens=False,
return_attention_mask=True,
return_token_type_ids=False,
)
for key in encoded_tok.keys():
self.assertListEqual(encoded_tok[key], encoded_processor[key].squeeze().tolist())

View File

@@ -167,6 +167,8 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
"SpeechT5SpeechEncoder", # Building part of bigger (tested) model.
"SpeechT5TextDecoder", # Building part of bigger (tested) model.
"SpeechT5TextEncoder", # Building part of bigger (tested) model.
"BarkCausalModel", # Building part of bigger (tested) model.
"BarkModel", # Does not have a forward signature - generation tested with integration tests
]
# Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't
@@ -188,6 +190,7 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
"models/vision_text_dual_encoder/test_modeling_tf_vision_text_dual_encoder.py",
"models/vision_text_dual_encoder/test_modeling_flax_vision_text_dual_encoder.py",
"models/decision_transformer/test_modeling_decision_transformer.py",
"models/bark/test_modeling_bark.py",
]
# Update this list for models that are not in any of the auto MODEL_XXX_MAPPING. Being in this list is an exception and
@@ -332,11 +335,15 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
"AltCLIPVisionModel",
"AltRobertaModel",
"TvltForAudioVisualClassification",
"BarkCausalModel",
"BarkCoarseModel",
"BarkFineModel",
"BarkSemanticModel",
"MusicgenModel",
"MusicgenForConditionalGeneration",
"SpeechT5ForSpeechToSpeech",
"SpeechT5ForTextToSpeech",
"SpeechT5HifiGan",
"MusicgenModel",
"MusicgenForConditionalGeneration",
]
# DO NOT edit this list!

View File

@@ -28,6 +28,9 @@ src/transformers/models/auto/feature_extraction_auto.py
src/transformers/models/auto/image_processing_auto.py
src/transformers/models/auto/processing_auto.py
src/transformers/models/auto/tokenization_auto.py
src/transformers/models/bark/configuration_bark.py
src/transformers/models/bark/modeling_bark.py
src/transformers/models/bark/processing_bark.py
src/transformers/models/bart/configuration_bart.py
src/transformers/models/bart/modeling_bart.py
src/transformers/models/bart/tokenization_bart.py