Add bark (#24086)

* first raw version of the bark integration * working code on small models with single run * add converting script from suno weights 2 hf * many changes * correct past_kv output * working implementation for inference * update the converting script according to the architecture changes * add a working end-to-end inference code * remove some comments and make small changes * remove unecessary comment * add docstrings and ensure no unecessary intermediary output during audio generation * remove done TODOs * make style + add config docstrings * modification for batch inference support on the whole model * add details to .generation_audio method * add copyright * convert EncodecModel from original library to transformers implementation * add two class in order to facilitate model and sub-models loading from the hub * add support of loading the whole model * add BarkProcessor * correct modeling according to processor output * Add proper __init__ and auto support * Add up-to-date copyright/license message * add relative import instead of absolute * cleaner head_dim computation * small comment removal or changes * more verbose LayerNorm init method * specify eps for clearer comprehension * more verbose variable naming in the MLP module * remove unecessary BarkBlock parameter * clearer code in the forward pass of the BarkBlock * remove _initialize_modules method for cleaner code * Remove unnecessary methods from sub-models * move code to remove unnecessary function * rename a variable for clarity and change an assert * move code and change variable name for clarity * remove unnecessary asserts * correct small bug * correct a comment * change variable names for clarity * remove asserts * change import from absolute to relative * correct small error due to comma missing + correct import * Add attribute Bark config * add first version of tests * update attention_map * add tie_weights and resize_token_embeddings for fineModel * correct getting attention_mask in generate_text_semantic * remove Bark inference trick * leave more choices in barkProcessor * remove _no_split_modules * fixe error in forward of block and introduce clearer notations * correct converting script with last changes * make style + add draft bark.mdx * correct BarkModelTest::test_generate_text_semantic * add Bark in main README * add dummy_pt_objects for Bark * add missing models in the main init * correct test_decoder_model_past_with_large_inputs * disable torchscript test * change docstring of BarkProcessor * Add test_processor_bark * make style * correct copyrights * add bark.mdx + make style, quality and consistency * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * Remove unnecessary test method * simply logic of a test * Only check first ids for slow audio generation * split full end-to-end generation tests * remove unneccessary comment * change submodel names for clearer naming * remove ModuleDict from modeling_bark * combine two if statements * ensure that an edge misued won't happen * modify variable name * move code snippet to the right place (coarse instead of semantic) * change BarkSemanticModule -> BarkSemanticModel * align BarkProcessor with transformers paradigm * correct BarkProcessor tests with last commit changes * change _validate_voice_preset to an instance method instead of a class method * tie_weights already called with post_init * add codec_model config to configuration * update bark modeling tests with recent BarkProcessor changes * remove SubModelPretrainedModel + change speakers embeddings prompt type in BarkModel * change absolute imports to relative * remove TODO * change docstrings * add examples to docs and docstrings * make style * uses BatchFeature in BarkProcessor insteads of dict * continue improving docstrings and docs + make style * correct docstrings examples * more comprehensible speaker_embeddings load/Save * rename speaker_embeddings_dict -> speaker_embeddings * correct bark.mdx + add bark to documentation_tests * correct docstrings configuration_bark * integrate last nit suggestions * integrate BarkGeneration configs * make style * remove bark tests from documentation_tests.txt because timeout - tested manually * add proper generation config initialization * small bark.mdx documentation changes * rename bark.mdx -> bark.md * add torch.no_grad behind BarkModel.generate_audio() * replace assert by ValueError in convert_suno_to_hf.py * integrate a series of short comments from reviewer * move SemanticLogitsProcessors and remove .detach() from Bark docs and docstrings * actually remove SemanticLogitsProcessor from modeling_bark.oy * BarkProcessor returns a single output instead of tuple + correct docstrings * make style + correct bug * add initializer_range to BarkConfig + correct slow modeling tests * add .clone() to history_prompt.coarse_prompt to avoid modifying input array * Making sure no extra "`" are present * remove extra characters in modeling_bark.py * Correct output if history_prompt is None * remove TODOs * remove ravel comment * completing generation_configuration_bark.py docstrings * change docstrings - number of audio codebooks instead of Encodec codebooks * change 'bias' docstrings in configuration_bark.py * format code * rename BarkModel.generate_audio -> BarkModel.generate_speech * modify AutoConfig instead of EncodecConfig in BarkConfig * correct AutoConfig wrong init * refactor BarkModel and sub-models generate_coarse, generate_fine, generate_text_semantic * remove SemanticLogitsProcessor and replace it with SuppressTokensLogitsProcessor * move nb_codebook related config arguments to BarkFineConfig * rename bark.mdx -> bark.md * correcting BarkModelConfig from_pretrained + remove keys_to_ignore * correct bark.md with correct hub path * correct code bug in bark.md * correct list tokens_to_suppress * modify Processor to load nested speaker embeddings in a safer way * correct batch sampling in BarkFineModel.generate_fine * Apply suggestions from code review Small docstrings correction and code improvements Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * give more details about num_layers in docstrings * correct indentation mistake * correct submodelconfig order of docstring variables * put audio models in alphabetical order in utils/check_repo.my * remove useless line from test_modeling_bark.py * makes BarkCoarseModelTest inherits from (ModelTesterMixin, GenerationTesterMixin, unittest.TestCase) instead of BarkSemanticModelTest * make a Tester class for each sub-model instead of inheriting * add test_resize_embeddings=True for Bark sub-models * add Copied from transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoSelfAttention._split_heads * remove 'Copied fom Bark' comment * remove unneccessary comment * change np.min -> min in modeling_bark.py * refactored all custom layers to have Bark prefix * add attention_mask as an argument of generate_text_semantic * refactor sub-models start docstrings to have more precise config class definition * move _tied_weights_keys overriding * add docstrings to generate_xxx in modeling_bark.py * add loading whole BarkModel to convert_suno_to_hf * refactor attribute and variable names * make style convert_suno * update bark checkpoints * remove never entered if statement * move bark_modeling docstrings after BarkPretrainedModel class definition * refactor modeling_bark.py: kv -> key_values * small nits - code refactoring and removing unecessary lines from _init_weights * nits - replace inplace method by variable assigning * remove *optional* when necessary * remove some lines in generate_speech * add default value for optional parameter * Refactor preprocess_histories_before_coarse -> preprocess_histories Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * correct usage after refactoring * refactor Bark's generate_xxx -> generate and modify docstrings and tests accordingly * update docstrings python in configuration_bark.py * add bark files in utils/documentation_test.txt * correct docstrings python snippet * add the ability to use parameters in the form of e.g coarse_temperature * add semantic_max_new_tokens in python snippet in docstrings for quicker generation * Reformate sub-models kwargs in BakModel.generate Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * correct kwargs in BarkModel.generate * correct attention_mask kwarg in BarkModel.generate * add tests for sub-models args in BarkModel.generate and correct BarkFineModel.test_generate_fp16 * enrich BarkModel.generate docstrings with a description of how to use the kwargs --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-07-17 18:53:24 +02:00
parent c21c3737c1
commit f42a35e611
28 changed files with 4199 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -291,6 +291,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
--- a/README_es.md
+++ b/README_es.md
@@ -268,6 +268,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt
 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
--- a/README_hd.md
+++ b/README_hd.md
@@ -240,6 +240,7 @@ conda install -c huggingface transformers
 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ] (https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर
 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (से École polytechnique) साथ थीसिस [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) पर निर्भर Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis रिहाई।
 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research से) साथ में पेपर [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)गुयेन लुओंग ट्रान, डुओंग मिन्ह ले और डाट क्वोक गुयेन द्वारा पोस्ट किया गया।
--- a/README_ja.md
+++ b/README_ja.md
@@ -302,6 +302,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (BAAI から) Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell から公開された研究論文: [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679)
 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (MIT から) Yuan Gong, Yu-An Chung, James Glass から公開された研究論文: [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778)
 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (Facebook から) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer から公開された研究論文: [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (École polytechnique から) Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis から公開された研究論文: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321)
 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research から) Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen から公開された研究論文: [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)
--- a/README_ko.md
+++ b/README_ko.md
@@ -217,6 +217,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
--- a/README_zh-hans.md
+++ b/README_zh-hans.md
@@ -241,6 +241,7 @@ conda install -c huggingface transformers
 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (来自 BAAI) 伴随论文 [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 由 Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell 发布。
 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (来自 MIT) 伴随论文 [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 由 Yuan Gong, Yu-An Chung, James Glass 发布。
 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (来自 Facebook) 伴随论文 [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) 由 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer 发布。
 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (来自 École polytechnique) 伴随论文 [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 由 Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis 发布。
 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (来自 VinAI Research) 伴随论文 [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) 由 Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen 发布。
--- a/README_zh-hant.md
+++ b/README_zh-hant.md
@@ -253,6 +253,7 @@ conda install -c huggingface transformers
 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
 1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](https://huggingface.co/docs/transformers/main/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -545,6 +545,8 @@
      sections:
      - local: model_doc/audio-spectrogram-transformer
        title: Audio Spectrogram Transformer
+      - local: model_doc/bark
+        title: Bark
      - local: model_doc/clap
        title: CLAP
      - local: model_doc/encodec
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -57,6 +57,7 @@ The documentation is organized into five sections:
 1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell.
 1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
 1. **[Autoformer](model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long.
+1. **[Bark](model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team.
 1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
 1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
 1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
@@ -282,6 +283,7 @@ Flax), PyTorch, and/or TensorFlow.
 |            AltCLIP            |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 | Audio Spectrogram Transformer |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |          Autoformer           |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
+|             Bark              |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |
 |             BART              |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
 |             BEiT              |       ❌       |       ❌       |       ✅        |         ❌         |      ✅      |
 |             BERT              |       ✅       |       ✅       |       ✅        |         ✅         |      ✅      |
--- a/docs/source/en/model_doc/bark.md
+++ b/docs/source/en/model_doc/bark.md
@@ -0,0 +1,141 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Bark
+
+## Overview
+
+Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark). 
+
+
+Bark is made of 4 main models:
+
+- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
+- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec.
+- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
+- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array.
+
+It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.
+
+
+### Tips:
+
+Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c).
+These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings).
+
+```python
+>>> from transformers import AutoProcessor, BarkModel
+
+>>> processor = AutoProcessor.from_pretrained("suno/bark")
+>>> model = BarkModel.from_pretrained("suno/bark")
+
+>>> voice_preset = "v2/en_speaker_6"
+
+>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)
+
+>>> audio_array = model.generate(**inputs)
+>>> audio_array = audio_array.cpu().numpy().squeeze()
+```
+
+Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects. 
+
+```python
+>>> # Multilingual speech - simplified Chinese
+>>> inputs = processor("惊人的！我会说中文")
+
+>>> # Multilingual speech - French - let's use a voice_preset as well
+>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")
+
+>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
+>>> inputs = processor("♪ Hello, my dog is cute ♪")
+
+>>> audio_array = model.generate(**inputs)
+>>> audio_array = audio_array.cpu().numpy().squeeze()
+```
+
+The model can also produce **nonverbal communications** like laughing, sighing and crying.
+
+
+```python
+>>> # Adding non-speech cues to the input text
+>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")
+
+>>> audio_array = model.generate(**inputs)
+>>> audio_array = audio_array.cpu().numpy().squeeze()
+```
+
+To save the audio, simply take the sample rate from the model config and some scipy utility:
+
+```python
+>>> from scipy.io.wavfile import write as write_wav
+
+>>> # save audio to disk, but first take the sample rate from the model config
+>>> sample_rate = model.generation_config.sample_rate
+>>> write_wav("bark_generation.wav", sample_rate, audio_array)
+```
+
+
+This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi).
+The original code can be found [here](https://github.com/suno-ai/bark).
+
+
+## BarkConfig
+
+[[autodoc]] BarkConfig
+    - all
+
+## BarkProcessor
+
+[[autodoc]] BarkProcessor
+    - all
+    - __call__
+
+## BarkModel
+
+[[autodoc]] BarkModel
+    - generate
+
+## BarkSemanticModel
+
+[[autodoc]] BarkSemanticModel
+    - forward
+
+## BarkCoarseModel
+
+[[autodoc]] BarkCoarseModel
+    - forward
+
+## BarkFineModel
+
+[[autodoc]] BarkFineModel
+    - forward
+
+## BarkCausalModel
+
+[[autodoc]] BarkCausalModel
+    - forward
+
+## BarkCoarseConfig
+
+[[autodoc]] BarkCoarseConfig
+    - all
+
+## BarkFineConfig
+
+[[autodoc]] BarkFineConfig
+    - all
+
+## BarkSemanticConfig
+
+[[autodoc]] BarkSemanticConfig
+    - all
+
--- a/src/transformers/init.py
+++ b/src/transformers/init.py
@@ -160,6 +160,13 @@ _import_structure = {
        "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
        "AutoformerConfig",
    ],
+    "models.bark": [
+        "BarkCoarseConfig",
+        "BarkConfig",
+        "BarkFineConfig",
+        "BarkProcessor",
+        "BarkSemanticConfig",
+    ],
    "models.bart": ["BartConfig", "BartTokenizer"],
    "models.barthez": [],
    "models.bartpho": [],
@@ -1136,6 +1143,17 @@ else:
            "AutoformerPreTrainedModel",
        ]
    )
+    _import_structure["models.bark"].extend(
+        [
+            "BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
+            "BarkCausalModel",
+            "BarkCoarseModel",
+            "BarkFineModel",
+            "BarkModel",
+            "BarkPreTrainedModel",
+            "BarkSemanticModel",
+        ]
+    )
    _import_structure["models.bart"].extend(
        [
            "BART_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -4098,6 +4116,13 @@ if TYPE_CHECKING:
        AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
        AutoformerConfig,
    )
+    from .models.bark import (
+        BarkCoarseConfig,
+        BarkConfig,
+        BarkFineConfig,
+        BarkProcessor,
+        BarkSemanticConfig,
+    )
    from .models.bart import BartConfig, BartTokenizer
    from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig
    from .models.bert import (
@@ -4978,6 +5003,15 @@ if TYPE_CHECKING:
            AutoformerModel,
            AutoformerPreTrainedModel,
        )
+        from .models.bark import (
+            BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BarkCausalModel,
+            BarkCoarseModel,
+            BarkFineModel,
+            BarkModel,
+            BarkPreTrainedModel,
+            BarkSemanticModel,
+        )
        from .models.bart import (
            BART_PRETRAINED_MODEL_ARCHIVE_LIST,
            BartForCausalLM,
--- a/src/transformers/generation/logits_process.py
+++ b/src/transformers/generation/logits_process.py
@@ -1148,3 +1148,39 @@ class ClassifierFreeGuidanceLogitsProcessor(LogitsProcessor):
        cond_logits, uncond_logits = scores.split(unguided_bsz, dim=0)
        scores = uncond_logits + (cond_logits - uncond_logits) * self.guidance_scale
        return scores
+
+
+class AlternatingCodebooksLogitsProcessor(LogitsProcessor):
+    r"""
+    [`LogitsProcessor`] enforcing alternated generation between the two codebooks of [`Bark`]'s fine submodel.
+
+    Args:
+        input_start_len (`int`):
+            The length of the initial input sequence.
+        semantic_vocab_size (`int`):
+            Vocabulary size of the semantic part, i.e number of tokens associated to the semantic vocabulary.
+        codebook_size (`int`):
+            Number of tokens associated to the codebook.
+    """
+
+    def __init__(self, input_start_len: int, semantic_vocab_size: int, codebook_size: int):
+        if not isinstance(input_start_len, int) or input_start_len < 0:
+            raise ValueError(f"`input_starting_length` has to be a non-negative integer, but is {input_start_len}")
+
+        self.input_start_len = input_start_len
+        self.semantic_vocab_size = semantic_vocab_size
+        self.codebook_size = codebook_size
+
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
+        curr_len = input_ids.shape[-1]
+
+        # even -> first codebook, odd -> second codebook
+        is_first_codebook = ((curr_len - self.input_start_len) % 2) == 0
+
+        if is_first_codebook:
+            scores[:, : self.semantic_vocab_size] = -float("inf")
+            scores[:, self.semantic_vocab_size + self.codebook_size :] = -float("inf")
+        else:
+            scores[:, : self.semantic_vocab_size + self.codebook_size] = -float("inf")
+
+        return scores
--- a/src/transformers/models/init.py
+++ b/src/transformers/models/init.py
@@ -19,6 +19,7 @@ from . import (
    audio_spectrogram_transformer,
    auto,
    autoformer,
+    bark,
    bart,
    barthez,
    bartpho,
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -35,6 +35,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
        ("altclip", "AltCLIPConfig"),
        ("audio-spectrogram-transformer", "ASTConfig"),
        ("autoformer", "AutoformerConfig"),
+        ("bark", "BarkConfig"),
        ("bart", "BartConfig"),
        ("beit", "BeitConfig"),
        ("bert", "BertConfig"),
@@ -237,6 +238,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
        ("altclip", "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("audio-spectrogram-transformer", "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("autoformer", "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
+        ("bark", "BARK_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("bart", "BART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
        ("bert", "BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
@@ -419,6 +421,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
        ("altclip", "AltCLIP"),
        ("audio-spectrogram-transformer", "Audio Spectrogram Transformer"),
        ("autoformer", "Autoformer"),
+        ("bark", "Bark"),
        ("bart", "BART"),
        ("barthez", "BARThez"),
        ("bartpho", "BARTpho"),
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -33,6 +33,7 @@ MODEL_MAPPING_NAMES = OrderedDict(
        ("altclip", "AltCLIPModel"),
        ("audio-spectrogram-transformer", "ASTModel"),
        ("autoformer", "AutoformerModel"),
+        ("bark", "BarkModel"),
        ("bart", "BartModel"),
        ("beit", "BeitModel"),
        ("bert", "BertModel"),
--- a/src/transformers/models/auto/processing_auto.py
+++ b/src/transformers/models/auto/processing_auto.py
@@ -44,6 +44,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
    [
        ("align", "AlignProcessor"),
        ("altclip", "AltCLIPProcessor"),
+        ("bark", "BarkProcessor"),
        ("blip", "BlipProcessor"),
        ("blip-2", "Blip2Processor"),
        ("bridgetower", "BridgeTowerProcessor"),
--- a/src/transformers/models/bark/init.py
+++ b/src/transformers/models/bark/init.py
@@ -0,0 +1,79 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_bark": [
+        "BARK_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "BarkCoarseConfig",
+        "BarkConfig",
+        "BarkFineConfig",
+        "BarkSemanticConfig",
+    ],
+    "processing_bark": ["BarkProcessor"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_bark"] = [
+        "BARK_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "BarkFineModel",
+        "BarkSemanticModel",
+        "BarkCoarseModel",
+        "BarkModel",
+        "BarkPreTrainedModel",
+        "BarkCausalModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_bark import (
+        BARK_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        BarkCoarseConfig,
+        BarkConfig,
+        BarkFineConfig,
+        BarkSemanticConfig,
+    )
+    from .processing_bark import BarkProcessor
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_bark import (
+            BARK_PRETRAINED_MODEL_ARCHIVE_LIST,
+            BarkCausalModel,
+            BarkCoarseModel,
+            BarkFineModel,
+            BarkModel,
+            BarkPreTrainedModel,
+            BarkSemanticModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
--- a/src/transformers/models/bark/configuration_bark.py
+++ b/src/transformers/models/bark/configuration_bark.py
@@ -0,0 +1,348 @@
+# coding=utf-8
+# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BARK model configuration"""
+
+import copy
+import os
+from typing import Dict, Optional, Union
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import add_start_docstrings, logging
+from ..auto import AutoConfig
+
+
+logger = logging.get_logger(__name__)
+
+
+BARK_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "suno/bark-small": "https://huggingface.co/suno/bark-small/resolve/main/config.json",
+    "suno/bark": "https://huggingface.co/suno/bark/resolve/main/config.json",
+}
+
+BARK_SUBMODELCONFIG_START_DOCSTRING = """
+    This is the configuration class to store the configuration of a [`{model}`]. It is used to instantiate the model
+    according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the Bark [suno/bark](https://huggingface.co/suno/bark)
+    architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        block_size (`int`, *optional*, defaults to 1024):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        input_vocab_size (`int`, *optional*, defaults to 10_048):
+            Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`{model}`]. Defaults to 10_048 but should be carefully thought with
+            regards to the chosen sub-model.
+        output_vocab_size (`int`, *optional*, defaults to 10_048):
+            Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented
+            by the: `output_ids` when passing forward a [`{model}`]. Defaults to 10_048 but should be carefully thought
+            with regards to the chosen sub-model.
+        num_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the given sub-model.
+        num_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer architecture.
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the architecture.
+        dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        bias (`bool`, *optional*, defaults to `True`):
+            Whether or not to use bias in the linear layers and layer norm layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models).
+"""
+
+
+class BarkSubModelConfig(PretrainedConfig):
+    model_type = "bark_module"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    attribute_map = {
+        "num_attention_heads": "num_heads",
+        "num_hidden_layers": "num_layers",
+        "vocab_size": "input_vocab_size",
+        "window_size": "block_size",
+    }
+
+    def __init__(
+        self,
+        block_size=1024,
+        input_vocab_size=10_048,
+        output_vocab_size=10_048,
+        num_layers=12,
+        num_heads=12,
+        hidden_size=768,
+        dropout=0.0,
+        bias=True,  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
+        initializer_range=0.02,
+        use_cache=True,
+        **kwargs,
+    ):
+        self.block_size = block_size
+        self.input_vocab_size = input_vocab_size
+        self.output_vocab_size = output_vocab_size
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self.hidden_size = hidden_size
+        self.dropout = dropout
+        self.bias = bias
+        self.use_cache = use_cache
+        self.initializer_range = initializer_range
+
+        super().__init__(**kwargs)
+
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        cache_dir: Optional[Union[str, os.PathLike]] = None,
+        force_download: bool = False,
+        local_files_only: bool = False,
+        token: Optional[Union[str, bool]] = None,
+        revision: str = "main",
+        **kwargs,
+    ) -> "PretrainedConfig":
+        kwargs["cache_dir"] = cache_dir
+        kwargs["force_download"] = force_download
+        kwargs["local_files_only"] = local_files_only
+        kwargs["revision"] = revision
+
+        cls._set_token_in_kwargs(kwargs, token)
+
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+
+        # get the config dict if we are loading from Bark
+        if config_dict.get("model_type") == "bark":
+            config_dict = config_dict[f"{cls.model_type}_config"]
+
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+
+        return cls.from_dict(config_dict, **kwargs)
+
+
+@add_start_docstrings(
+    BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkSemanticConfig", model="BarkSemanticModel"),
+    """
+    Example:
+
+    ```python
+    >>> from transformers import BarkSemanticConfig, BarkSemanticModel
+
+    >>> # Initializing a Bark sub-module style configuration
+    >>> configuration = BarkSemanticConfig()
+
+    >>> # Initializing a model (with random weights) from the suno/bark style configuration
+    >>> model = BarkSemanticModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```""",
+)
+class BarkSemanticConfig(BarkSubModelConfig):
+    model_type = "semantic"
+
+
+@add_start_docstrings(
+    BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkCoarseConfig", model="BarkCoarseModel"),
+    """
+    Example:
+
+    ```python
+    >>> from transformers import BarkCoarseConfig, BarkCoarseModel
+
+    >>> # Initializing a Bark sub-module style configuration
+    >>> configuration = BarkCoarseConfig()
+
+    >>> # Initializing a model (with random weights) from the suno/bark style configuration
+    >>> model = BarkCoarseModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```""",
+)
+class BarkCoarseConfig(BarkSubModelConfig):
+    model_type = "coarse_acoustics"
+
+
+@add_start_docstrings(
+    BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkFineConfig", model="BarkFineModel"),
+    """
+        n_codes_total (`int`, *optional*, defaults to 8):
+            The total number of audio codebooks predicted. Used in the fine acoustics sub-model.
+        n_codes_given (`int`, *optional*, defaults to 1):
+            The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics
+            sub-models.
+    Example:
+
+    ```python
+    >>> from transformers import BarkFineConfig, BarkFineModel
+
+    >>> # Initializing a Bark sub-module style configuration
+    >>> configuration = BarkFineConfig()
+
+    >>> # Initializing a model (with random weights) from the suno/bark style configuration
+    >>> model = BarkFineModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```""",
+)
+class BarkFineConfig(BarkSubModelConfig):
+    model_type = "fine_acoustics"
+
+    def __init__(self, tie_word_embeddings=True, n_codes_total=8, n_codes_given=1, **kwargs):
+        self.n_codes_total = n_codes_total
+        self.n_codes_given = n_codes_given
+
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+
+
+class BarkConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`BarkModel`]. It is used to instantiate a Bark
+    model according to the specified sub-models configurations, defining the model architecture.
+
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark
+    [suno/bark](https://huggingface.co/suno/bark) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+    semantic_config ([`BarkSemanticConfig`], *optional*):
+        Configuration of the underlying semantic sub-model.
+    coarse_acoustics_config ([`BarkCoarseConfig`], *optional*):
+        Configuration of the underlying coarse acoustics sub-model.
+    fine_acoustics_config ([`BarkFineConfig`], *optional*):
+        Configuration of the underlying fine acoustics sub-model.
+    codec_config ([`AutoConfig`], *optional*):
+        Configuration of the underlying codec sub-model.
+
+    Example:
+
+    ```python
+    >>> from transformers import (
+    ...     BarkSemanticConfig,
+    ...     BarkCoarseConfig,
+    ...     BarkFineConfig,
+    ...     BarkModel,
+    ...     BarkConfig,
+    ...     AutoConfig,
+    ... )
+
+    >>> # Initializing Bark sub-modules configurations.
+    >>> semantic_config = BarkSemanticConfig()
+    >>> coarse_acoustics_config = BarkCoarseConfig()
+    >>> fine_acoustics_config = BarkFineConfig()
+    >>> codec_config = AutoConfig.from_pretrained("facebook/encodec_24khz")
+
+
+    >>> # Initializing a Bark module style configuration
+    >>> configuration = BarkConfig.from_sub_model_configs(
+    ...     semantic_config, coarse_acoustics_config, fine_acoustics_config, codec_config
+    ... )
+
+    >>> # Initializing a model (with random weights)
+    >>> model = BarkModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+
+    model_type = "bark"
+    is_composition = True
+
+    def __init__(
+        self,
+        semantic_config: Dict = None,
+        coarse_acoustics_config: Dict = None,
+        fine_acoustics_config: Dict = None,
+        codec_config: Dict = None,
+        initializer_range=0.02,
+        **kwargs,
+    ):
+        if semantic_config is None:
+            semantic_config = {}
+            logger.info("semantic_config is None. initializing the semantic model with default values.")
+
+        if coarse_acoustics_config is None:
+            coarse_acoustics_config = {}
+            logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
+
+        if fine_acoustics_config is None:
+            fine_acoustics_config = {}
+            logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
+
+        if codec_config is None:
+            codec_config = {}
+            logger.info("codec_config is None. initializing the codec model with default values.")
+
+        self.semantic_config = BarkSemanticConfig(**semantic_config)
+        self.coarse_acoustics_config = BarkCoarseConfig(**coarse_acoustics_config)
+        self.fine_acoustics_config = BarkFineConfig(**fine_acoustics_config)
+        self.codec_config = AutoConfig.for_model(**codec_config)
+
+        self.initializer_range = initializer_range
+
+        super().__init__(**kwargs)
+
+    @classmethod
+    def from_sub_model_configs(
+        cls,
+        semantic_config: BarkSemanticConfig,
+        coarse_acoustics_config: BarkCoarseConfig,
+        fine_acoustics_config: BarkFineConfig,
+        codec_config: AutoConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`BarkConfig`] (or a derived class) from bark sub-models configuration.
+
+        Returns:
+            [`BarkConfig`]: An instance of a configuration object
+        """
+        return cls(
+            semantic_config=semantic_config.to_dict(),
+            coarse_acoustics_config=coarse_acoustics_config.to_dict(),
+            fine_acoustics_config=fine_acoustics_config.to_dict(),
+            codec_config=codec_config.to_dict(),
+            **kwargs,
+        )
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+
+        output["semantic_config"] = self.semantic_config.to_dict()
+        output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
+        output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
+        output["codec_config"] = self.codec_config.to_dict()
+
+        output["model_type"] = self.__class__.model_type
+        return output
--- a/src/transformers/models/bark/convert_suno_to_hf.py
+++ b/src/transformers/models/bark/convert_suno_to_hf.py
@@ -0,0 +1,262 @@
+"""Convert Bark checkpoint."""
+import argparse
+import os
+from pathlib import Path
+
+import torch
+from bark.generation import _load_model as _bark_load_model
+from huggingface_hub import hf_hub_download
+
+from transformers import EncodecConfig, EncodecModel, set_seed
+from transformers.models.bark.configuration_bark import (
+    BarkCoarseConfig,
+    BarkConfig,
+    BarkFineConfig,
+    BarkSemanticConfig,
+)
+from transformers.models.bark.generation_configuration_bark import (
+    BarkCoarseGenerationConfig,
+    BarkFineGenerationConfig,
+    BarkGenerationConfig,
+    BarkSemanticGenerationConfig,
+)
+from transformers.models.bark.modeling_bark import BarkCoarseModel, BarkFineModel, BarkModel, BarkSemanticModel
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+set_seed(770)
+
+
+new_layer_name_dict = {
+    "c_attn": "att_proj",
+    "c_proj": "out_proj",
+    "c_fc": "in_proj",
+    "transformer.": "",
+    "h.": "layers.",
+    "ln_1": "layernorm_1",
+    "ln_2": "layernorm_2",
+    "ln_f": "layernorm_final",
+    "wpe": "position_embeds_layer",
+    "wte": "input_embeds_layer",
+}
+
+
+REMOTE_MODEL_PATHS = {
+    "text_small": {
+        "repo_id": "suno/bark",
+        "file_name": "text.pt",
+    },
+    "coarse_small": {
+        "repo_id": "suno/bark",
+        "file_name": "coarse.pt",
+    },
+    "fine_small": {
+        "repo_id": "suno/bark",
+        "file_name": "fine.pt",
+    },
+    "text": {
+        "repo_id": "suno/bark",
+        "file_name": "text_2.pt",
+    },
+    "coarse": {
+        "repo_id": "suno/bark",
+        "file_name": "coarse_2.pt",
+    },
+    "fine": {
+        "repo_id": "suno/bark",
+        "file_name": "fine_2.pt",
+    },
+}
+
+CUR_PATH = os.path.dirname(os.path.abspath(__file__))
+default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache")
+CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "suno", "bark_v0")
+
+
+def _get_ckpt_path(model_type, use_small=False):
+    key = model_type
+    if use_small:
+        key += "_small"
+    return os.path.join(CACHE_DIR, REMOTE_MODEL_PATHS[key]["file_name"])
+
+
+def _download(from_hf_path, file_name):
+    os.makedirs(CACHE_DIR, exist_ok=True)
+    hf_hub_download(repo_id=from_hf_path, filename=file_name, local_dir=CACHE_DIR)
+
+
+def _load_model(ckpt_path, device, use_small=False, model_type="text"):
+    if model_type == "text":
+        ModelClass = BarkSemanticModel
+        ConfigClass = BarkSemanticConfig
+        GenerationConfigClass = BarkSemanticGenerationConfig
+    elif model_type == "coarse":
+        ModelClass = BarkCoarseModel
+        ConfigClass = BarkCoarseConfig
+        GenerationConfigClass = BarkCoarseGenerationConfig
+    elif model_type == "fine":
+        ModelClass = BarkFineModel
+        ConfigClass = BarkFineConfig
+        GenerationConfigClass = BarkFineGenerationConfig
+    else:
+        raise NotImplementedError()
+    model_key = f"{model_type}_small" if use_small else model_type
+    model_info = REMOTE_MODEL_PATHS[model_key]
+    if not os.path.exists(ckpt_path):
+        logger.info(f"{model_type} model not found, downloading into `{CACHE_DIR}`.")
+        _download(model_info["repo_id"], model_info["file_name"])
+    checkpoint = torch.load(ckpt_path, map_location=device)
+    # this is a hack
+    model_args = checkpoint["model_args"]
+    if "input_vocab_size" not in model_args:
+        model_args["input_vocab_size"] = model_args["vocab_size"]
+        model_args["output_vocab_size"] = model_args["vocab_size"]
+        del model_args["vocab_size"]
+
+    # convert Bark model arguments to HF Bark model arguments
+    model_args["num_heads"] = model_args.pop("n_head")
+    model_args["hidden_size"] = model_args.pop("n_embd")
+    model_args["num_layers"] = model_args.pop("n_layer")
+
+    model_config = ConfigClass(**checkpoint["model_args"])
+    model = ModelClass(config=model_config)
+    model_generation_config = GenerationConfigClass()
+
+    model.generation_config = model_generation_config
+    state_dict = checkpoint["model"]
+    # fixup checkpoint
+    unwanted_prefix = "_orig_mod."
+    for k, v in list(state_dict.items()):
+        if k.startswith(unwanted_prefix):
+            # replace part of the key with corresponding layer name in HF implementation
+            new_k = k[len(unwanted_prefix) :]
+            for old_layer_name in new_layer_name_dict:
+                new_k = new_k.replace(old_layer_name, new_layer_name_dict[old_layer_name])
+
+            state_dict[new_k] = state_dict.pop(k)
+
+    extra_keys = set(state_dict.keys()) - set(model.state_dict().keys())
+    extra_keys = {k for k in extra_keys if not k.endswith(".attn.bias")}
+    missing_keys = set(model.state_dict().keys()) - set(state_dict.keys())
+    missing_keys = {k for k in missing_keys if not k.endswith(".attn.bias")}
+    if len(extra_keys) != 0:
+        raise ValueError(f"extra keys found: {extra_keys}")
+    if len(missing_keys) != 0:
+        raise ValueError(f"missing keys: {missing_keys}")
+    model.load_state_dict(state_dict, strict=False)
+    n_params = model.num_parameters(exclude_embeddings=True)
+    val_loss = checkpoint["best_val_loss"].item()
+    logger.info(f"model loaded: {round(n_params/1e6,1)}M params, {round(val_loss,3)} loss")
+    model.eval()
+    model.to(device)
+    del checkpoint, state_dict
+
+    return model
+
+
+def load_model(pytorch_dump_folder_path, use_small=False, model_type="text"):
+    if model_type not in ("text", "coarse", "fine"):
+        raise NotImplementedError()
+
+    device = "cpu"  # do conversion on cpu
+
+    ckpt_path = _get_ckpt_path(model_type, use_small=use_small)
+    model = _load_model(ckpt_path, device, model_type=model_type, use_small=use_small)
+
+    # load bark initial model
+    bark_model = _bark_load_model(ckpt_path, "cpu", model_type=model_type, use_small=use_small)
+
+    if model_type == "text":
+        bark_model = bark_model["model"]
+
+    if model.num_parameters(exclude_embeddings=True) != bark_model.get_num_params():
+        raise ValueError("initial and new models don't have the same number of parameters")
+
+    # check if same output as the bark model
+    batch_size = 5
+    sequence_length = 10
+
+    if model_type in ["text", "coarse"]:
+        vec = torch.randint(256, (batch_size, sequence_length), dtype=torch.int)
+        output_old_model = bark_model(vec)[0]
+
+        output_new_model_total = model(vec)
+
+        # take last logits
+        output_new_model = output_new_model_total.logits[:, [-1], :]
+
+    else:
+        prediction_codeboook_channel = 3
+        n_codes_total = 8
+        vec = torch.randint(256, (batch_size, sequence_length, n_codes_total), dtype=torch.int)
+
+        output_new_model_total = model(prediction_codeboook_channel, vec)
+        output_old_model = bark_model(prediction_codeboook_channel, vec)
+
+        output_new_model = output_new_model_total.logits
+
+    # output difference should come from the difference of self-attention implementation design
+    if output_new_model.shape != output_old_model.shape:
+        raise ValueError("initial and new outputs don't have the same shape")
+    if (output_new_model - output_old_model).abs().max().item() > 1e-3:
+        raise ValueError("initial and new outputs are not equal")
+
+    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+    model.save_pretrained(pytorch_dump_folder_path)
+
+
+def load_whole_bark_model(
+    semantic_path,
+    coarse_path,
+    fine_path,
+    append_text,
+    hub_path,
+    folder_path,
+):
+    pytorch_dump_folder_path = os.path.join(folder_path, append_text)
+
+    semanticConfig = BarkSemanticConfig.from_pretrained(os.path.join(semantic_path, "config.json"))
+    coarseAcousticConfig = BarkCoarseConfig.from_pretrained(os.path.join(coarse_path, "config.json"))
+    fineAcousticConfig = BarkFineConfig.from_pretrained(os.path.join(fine_path, "config.json"))
+    codecConfig = EncodecConfig.from_pretrained("facebook/encodec_24khz")
+
+    semantic = BarkSemanticModel.from_pretrained(semantic_path)
+    coarseAcoustic = BarkCoarseModel.from_pretrained(coarse_path)
+    fineAcoustic = BarkFineModel.from_pretrained(fine_path)
+    codec = EncodecModel.from_pretrained("facebook/encodec_24khz")
+
+    bark_config = BarkConfig.from_sub_model_configs(
+        semanticConfig, coarseAcousticConfig, fineAcousticConfig, codecConfig
+    )
+
+    bark_generation_config = BarkGenerationConfig.from_sub_model_configs(
+        semantic.generation_config, coarseAcoustic.generation_config, fineAcoustic.generation_config
+    )
+
+    bark = BarkModel(bark_config)
+
+    bark.semantic = semantic
+    bark.coarse_acoustics = coarseAcoustic
+    bark.fine_acoustics = fineAcoustic
+    bark.codec_model = codec
+
+    bark.generation_config = bark_generation_config
+
+    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+    bark.save_pretrained(pytorch_dump_folder_path, repo_id=hub_path, push_to_hub=True)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+
+    parser.add_argument("model_type", type=str, help="text, coarse or fine.")
+    parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.")
+    parser.add_argument("--is_small", action="store_true", help="convert the small version instead of the large.")
+
+    args = parser.parse_args()
+
+    load_model(args.pytorch_dump_folder_path, model_type=args.model_type, use_small=args.is_small)
--- a/src/transformers/models/bark/generation_configuration_bark.py
+++ b/src/transformers/models/bark/generation_configuration_bark.py
@@ -0,0 +1,318 @@
+# coding=utf-8
+# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" BARK model generation configuration"""
+
+import copy
+from typing import Dict
+
+from ...generation.configuration_utils import GenerationConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+
+class BarkSemanticGenerationConfig(GenerationConfig):
+    model_type = "semantic"
+
+    def __init__(
+        self,
+        eos_token_id=10_000,
+        renormalize_logits=True,
+        max_new_tokens=768,
+        output_scores=False,
+        return_dict_in_generate=False,
+        output_hidden_states=False,
+        output_attentions=False,
+        temperature=0.7,
+        do_sample=True,
+        text_encoding_offset=10_048,
+        text_pad_token=129_595,
+        semantic_infer_token=129_599,
+        semantic_vocab_size=10_000,
+        max_input_semantic_length=256,
+        semantic_rate_hz=49.9,
+        **kwargs,
+    ):
+        """Class that holds a generation configuration for [`BarkSemanticModel`].
+
+        This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
+        documentation from [`GenerationConfig`] for more information.
+
+        Args:
+            eos_token_id (`int`, *optional*, defaults to 10_000):
+                The id of the *end-of-sequence* token.
+            renormalize_logits (`bool`, *optional*, defaults to `True`):
+                Whether to renormalize the logits after applying all the logits processors or warpers (including the
+                custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
+                score logits are normalized but some logit processors or warpers break the normalization.
+            max_new_tokens (`int`, *optional*, defaults to 768):
+                The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
+            output_scores (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
+            return_dict_in_generate (`bool`, *optional*, defaults to `False`):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+            output_hidden_states (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more details.
+            output_attentions (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more details.
+            temperature (`float`, *optional*, defaults to 0.7):
+                The value used to modulate the next token probabilities.
+            do_sample (`bool`, *optional*, defaults to `True`):
+                Whether or not to use sampling ; use greedy decoding otherwise.
+            text_encoding_offset (`int`, *optional*, defaults to 10_048):
+                Text encoding offset.
+            text_pad_token (`int`, *optional*, defaults to 129_595):
+                Text pad token.
+            semantic_infer_token (`int`, *optional*, defaults to 129_599):
+                Semantic infer token.
+            semantic_vocab_size (`int`, *optional*, defaults to 10_000):
+                Semantic vocab size.
+            max_input_semantic_length (`int`, *optional*, defaults to 256):
+                Max length of semantic input vector.
+            semantic_rate_hz (`float`, *optional*, defaults to 49.9):
+                Semantic rate in Hertz.
+        """
+        super().__init__(
+            temperature=temperature,
+            do_sample=do_sample,
+            eos_token_id=eos_token_id,
+            renormalize_logits=renormalize_logits,
+            max_new_tokens=max_new_tokens,
+            output_scores=output_scores,
+            return_dict_in_generate=return_dict_in_generate,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            **kwargs,
+        )
+
+        self.text_encoding_offset = text_encoding_offset
+        self.text_pad_token = text_pad_token
+        self.semantic_pad_token = eos_token_id
+        self.semantic_infer_token = semantic_infer_token
+        self.semantic_vocab_size = semantic_vocab_size
+        self.max_input_semantic_length = max_input_semantic_length
+        self.semantic_rate_hz = semantic_rate_hz
+
+
+class BarkCoarseGenerationConfig(GenerationConfig):
+    model_type = "coarse_acoustics"
+
+    def __init__(
+        self,
+        renormalize_logits=True,
+        output_scores=False,
+        return_dict_in_generate=False,
+        output_hidden_states=False,
+        output_attentions=False,
+        temperature=0.7,
+        do_sample=True,
+        coarse_semantic_pad_token=12_048,
+        coarse_rate_hz=75,
+        n_coarse_codebooks=2,
+        coarse_infer_token=12_050,
+        max_coarse_input_length=256,
+        max_coarse_history: int = 630,
+        sliding_window_len: int = 60,
+        **kwargs,
+    ):
+        """Class that holds a generation configuration for [`BarkCoarseModel`].
+
+        This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
+        documentation from [`GenerationConfig`] for more information.
+
+        Args:
+            renormalize_logits (`bool`, *optional*, defaults to `True`):
+                Whether to renormalize the logits after applying all the logits processors or warpers (including the
+                custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the
+                score logits are normalized but some logit processors or warpers break the normalization.
+            output_scores (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
+            return_dict_in_generate (`bool`, *optional*, defaults to `False`):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+            output_hidden_states (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more details.
+            output_attentions (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more details.
+            temperature (`float`, *optional*, defaults to 0.7):
+                The value used to modulate the next token probabilities.
+            do_sample (`bool`, *optional*, defaults to `True`):
+                Whether or not to use sampling ; use greedy decoding otherwise.
+            coarse_semantic_pad_token (`int`, *optional*, defaults to 12_048):
+                Coarse semantic pad token.
+            coarse_rate_hz (`int`, *optional*, defaults to 75):
+                Coarse rate in Hertz.
+            n_coarse_codebooks (`int`, *optional*, defaults to 2):
+                Number of coarse codebooks.
+            coarse_infer_token (`int`, *optional*, defaults to 12_050):
+                Coarse infer token.
+            max_coarse_input_length (`int`, *optional*, defaults to 256):
+                Max length of input coarse vector.
+            max_coarse_history (`int`, *optional*, defaults to 630):
+                Max length of the output of the coarse acoustics model used in the fine generation step.
+            sliding_window_len (`int`, *optional*, defaults to 60):
+                The coarse generation step uses a sliding window to generate raw audio.
+        """
+        super().__init__(
+            temperature=temperature,
+            do_sample=do_sample,
+            renormalize_logits=renormalize_logits,
+            output_scores=output_scores,
+            return_dict_in_generate=return_dict_in_generate,
+            output_hidden_states=output_hidden_states,
+            output_attentions=output_attentions,
+            **kwargs,
+        )
+
+        self.coarse_semantic_pad_token = coarse_semantic_pad_token
+        self.coarse_rate_hz = coarse_rate_hz
+        self.n_coarse_codebooks = n_coarse_codebooks
+        self.coarse_infer_token = coarse_infer_token
+        self.max_coarse_input_length = max_coarse_input_length
+        self.max_coarse_history = max_coarse_history
+        self.sliding_window_len = sliding_window_len
+
+
+class BarkFineGenerationConfig(GenerationConfig):
+    model_type = "fine_acoustics"
+
+    def __init__(
+        self,
+        temperature=0.5,
+        max_fine_history_length=512,
+        max_fine_input_length=1024,
+        n_fine_codebooks=8,
+        **kwargs,
+    ):
+        """Class that holds a generation configuration for [`BarkFineModel`].
+
+        [`BarkFineModel`] is an autoencoder model, so should not usually be used for generation. However, under the
+        hood, it uses `temperature` when used by [`BarkModel`]
+
+        This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
+        documentation from [`GenerationConfig`] for more information.
+
+        Args:
+            temperature (`float`, *optional*, defaults to 0.5):
+                The value used to modulate the next token probabilities.
+            max_fine_history_length (`int`, *optional*, defaults to 512):
+                Max length of the fine history vector.
+            max_fine_input_length (`int`, *optional*, defaults to 1024):
+                Max length of fine input vector.
+            n_fine_codebooks (`int`, *optional*, defaults to 8):
+                Number of codebooks used.
+        """
+        super().__init__(temperature=temperature)
+
+        self.max_fine_history_length = max_fine_history_length
+        self.max_fine_input_length = max_fine_input_length
+        self.n_fine_codebooks = n_fine_codebooks
+
+
+class BarkGenerationConfig(GenerationConfig):
+    model_type = "bark"
+    is_composition = True
+
+    # TODO (joao): nested from_dict
+
+    def __init__(
+        self,
+        semantic_config: Dict = None,
+        coarse_acoustics_config: Dict = None,
+        fine_acoustics_config: Dict = None,
+        sample_rate=24_000,
+        codebook_size=1024,
+        **kwargs,
+    ):
+        """Class that holds a generation configuration for [`BarkModel`].
+
+        The [`BarkModel`] does not have a `generate` method, but uses this class to generate speeches with a nested
+        [`BarkGenerationConfig`] which uses [`BarkSemanticGenerationConfig`], [`BarkCoarseGenerationConfig`],
+        [`BarkFineGenerationConfig`].
+
+        This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the
+        documentation from [`GenerationConfig`] for more information.
+
+        Args:
+            semantic_config (`Dict`, *optional*):
+                Semantic generation configuration.
+            coarse_acoustics_config (`Dict`, *optional*):
+                Coarse generation configuration.
+            fine_acoustics_config (`Dict`, *optional*):
+                Fine generation configuration.
+            sample_rate (`int`, *optional*, defaults to 24_000):
+                Sample rate.
+            codebook_size (`int`, *optional*, defaults to 1024):
+                Vector length for each codebook.
+        """
+        if semantic_config is None:
+            semantic_config = {}
+            logger.info("semantic_config is None. initializing the semantic model with default values.")
+
+        if coarse_acoustics_config is None:
+            coarse_acoustics_config = {}
+            logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.")
+
+        if fine_acoustics_config is None:
+            fine_acoustics_config = {}
+            logger.info("fine_acoustics_config is None. initializing the fine model with default values.")
+
+        self.semantic_config = BarkSemanticGenerationConfig(**semantic_config)
+        self.coarse_acoustics_config = BarkCoarseGenerationConfig(**coarse_acoustics_config)
+        self.fine_acoustics_config = BarkFineGenerationConfig(**fine_acoustics_config)
+
+        self.sample_rate = sample_rate
+        self.codebook_size = codebook_size
+
+    @classmethod
+    def from_sub_model_configs(
+        cls,
+        semantic_config: BarkSemanticGenerationConfig,
+        coarse_acoustics_config: BarkCoarseGenerationConfig,
+        fine_acoustics_config: BarkFineGenerationConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`BarkGenerationConfig`] (or a derived class) from bark sub-models generation configuration.
+
+        Returns:
+            [`BarkGenerationConfig`]: An instance of a configuration object
+        """
+        return cls(
+            semantic_config=semantic_config.to_dict(),
+            coarse_acoustics_config=coarse_acoustics_config.to_dict(),
+            fine_acoustics_config=fine_acoustics_config.to_dict(),
+            **kwargs,
+        )
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
+
+        Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+
+        output["semantic_config"] = self.semantic_config.to_dict()
+        output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict()
+        output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict()
+
+        output["model_type"] = self.__class__.model_type
+        return output
--- a/src/transformers/models/bark/modeling_bark.py
+++ b/src/transformers/models/bark/modeling_bark.py
--- a/src/transformers/models/bark/processing_bark.py
+++ b/src/transformers/models/bark/processing_bark.py
@@ -0,0 +1,286 @@
+# coding=utf-8
+# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for Bark
+"""
+import json
+import os
+from typing import Optional
+
+import numpy as np
+
+from ...feature_extraction_utils import BatchFeature
+from ...processing_utils import ProcessorMixin
+from ...utils import logging
+from ...utils.hub import get_file_from_repo
+from ..auto import AutoTokenizer
+
+
+logger = logging.get_logger(__name__)
+
+
+class BarkProcessor(ProcessorMixin):
+    r"""
+    Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor.
+
+    Args:
+        tokenizer ([`PreTrainedTokenizer`]):
+            An instance of [`PreTrainedTokenizer`].
+        speaker_embeddings (`Dict[Dict[str]]`, *optional*, defaults to `None`):
+            Optional nested speaker embeddings dictionary. The first level contains voice preset names (e.g
+            `"en_speaker_4"`). The second level contains `"semantic_prompt"`, `"coarse_prompt"` and `"fine_prompt"`
+            embeddings. The values correspond to the path of the corresponding `np.ndarray`. See
+            [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) for
+            a list of `voice_preset_names`.
+
+    """
+    tokenizer_class = "AutoTokenizer"
+    attributes = ["tokenizer"]
+
+    preset_shape = {
+        "semantic_prompt": 1,
+        "coarse_prompt": 2,
+        "fine_prompt": 2,
+    }
+
+    def __init__(self, tokenizer, speaker_embeddings=None):
+        super().__init__(tokenizer)
+
+        self.speaker_embeddings = speaker_embeddings
+
+    @classmethod
+    def from_pretrained(
+        cls, pretrained_processor_name_or_path, speaker_embeddings_dict_path="speaker_embeddings_path.json", **kwargs
+    ):
+        r"""
+        Instantiate a Bark processor associated with a pretrained model.
+
+        Args:
+            pretrained_model_name_or_path (`str` or `os.PathLike`):
+                This can be either:
+
+                - a string, the *model id* of a pretrained [`BarkProcessor`] hosted inside a model repo on
+                  huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or
+                  namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`.
+                - a path to a *directory* containing a processor saved using the [`~BarkProcessor.save_pretrained`]
+                  method, e.g., `./my_model_directory/`.
+            speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`):
+                The name of the `.json` file containing the speaker_embeddings dictionnary located in
+                `pretrained_model_name_or_path`. If `None`, no speaker_embeddings is loaded.
+            **kwargs
+                Additional keyword arguments passed along to both
+                [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`].
+        """
+
+        if speaker_embeddings_dict_path is not None:
+            speaker_embeddings_path = get_file_from_repo(
+                pretrained_processor_name_or_path,
+                speaker_embeddings_dict_path,
+                subfolder=kwargs.pop("subfolder", None),
+                cache_dir=kwargs.pop("cache_dir", None),
+                force_download=kwargs.pop("force_download", False),
+                proxies=kwargs.pop("proxies", None),
+                resume_download=kwargs.pop("resume_download", False),
+                local_files_only=kwargs.pop("local_files_only", False),
+                use_auth_token=kwargs.pop("use_auth_token", None),
+                revision=kwargs.pop("revision", None),
+            )
+            if speaker_embeddings_path is None:
+                logger.warning(
+                    f"""`{os.path.join(pretrained_processor_name_or_path,speaker_embeddings_dict_path)}` does not exists
+                    , no preloaded speaker embeddings will be used - Make sure to provide a correct path to the json
+                    dictionnary if wanted, otherwise set `speaker_embeddings_dict_path=None`."""
+                )
+                speaker_embeddings = None
+            else:
+                with open(speaker_embeddings_path) as speaker_embeddings_json:
+                    speaker_embeddings = json.load(speaker_embeddings_json)
+        else:
+            speaker_embeddings = None
+
+        tokenizer = AutoTokenizer.from_pretrained(pretrained_processor_name_or_path, **kwargs)
+
+        return cls(tokenizer=tokenizer, speaker_embeddings=speaker_embeddings)
+
+    def save_pretrained(
+        self,
+        save_directory,
+        speaker_embeddings_dict_path="speaker_embeddings_path.json",
+        speaker_embeddings_directory="speaker_embeddings",
+        push_to_hub: bool = False,
+        **kwargs,
+    ):
+        """
+        Saves the attributes of this processor (tokenizer...) in the specified directory so that it can be reloaded
+        using the [`~BarkProcessor.from_pretrained`] method.
+
+        Args:
+            save_directory (`str` or `os.PathLike`):
+                Directory where the tokenizer files and the speaker embeddings will be saved (directory will be created
+                if it does not exist).
+            speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`):
+                The name of the `.json` file that will contains the speaker_embeddings nested path dictionnary, if it
+                exists, and that will be located in `pretrained_model_name_or_path/speaker_embeddings_directory`.
+            speaker_embeddings_directory (`str`, *optional*, defaults to `"speaker_embeddings/"`):
+                The name of the folder in which the speaker_embeddings arrays will be saved.
+            push_to_hub (`bool`, *optional*, defaults to `False`):
+                Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
+                repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
+                namespace).
+            kwargs:
+                Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
+        """
+        if self.speaker_embeddings is not None:
+            os.makedirs(os.path.join(save_directory, speaker_embeddings_directory, "v2"), exist_ok=True)
+
+            embeddings_dict = {}
+
+            embeddings_dict["repo_or_path"] = save_directory
+
+            for prompt_key in self.speaker_embeddings:
+                if prompt_key != "repo_or_path":
+                    voice_preset = self._load_voice_preset(prompt_key)
+
+                    tmp_dict = {}
+                    for key in self.speaker_embeddings[prompt_key]:
+                        np.save(
+                            os.path.join(
+                                embeddings_dict["repo_or_path"], speaker_embeddings_directory, f"{prompt_key}_{key}"
+                            ),
+                            voice_preset[key],
+                            allow_pickle=False,
+                        )
+                        tmp_dict[key] = os.path.join(speaker_embeddings_directory, f"{prompt_key}_{key}.npy")
+
+                    embeddings_dict[prompt_key] = tmp_dict
+
+            with open(os.path.join(save_directory, speaker_embeddings_dict_path), "w") as fp:
+                json.dump(embeddings_dict, fp)
+
+        super().save_pretrained(save_directory, push_to_hub, **kwargs)
+
+    def _load_voice_preset(self, voice_preset: str = None, **kwargs):
+        voice_preset_paths = self.speaker_embeddings[voice_preset]
+
+        voice_preset_dict = {}
+        for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]:
+            if key not in voice_preset_paths:
+                raise ValueError(
+                    f"Voice preset unrecognized, missing {key} as a key in self.speaker_embeddings[{voice_preset}]."
+                )
+
+            path = get_file_from_repo(
+                self.speaker_embeddings.get("repo_or_path", "/"),
+                voice_preset_paths[key],
+                subfolder=kwargs.pop("subfolder", None),
+                cache_dir=kwargs.pop("cache_dir", None),
+                force_download=kwargs.pop("force_download", False),
+                proxies=kwargs.pop("proxies", None),
+                resume_download=kwargs.pop("resume_download", False),
+                local_files_only=kwargs.pop("local_files_only", False),
+                use_auth_token=kwargs.pop("use_auth_token", None),
+                revision=kwargs.pop("revision", None),
+            )
+            if path is None:
+                raise ValueError(
+                    f"""`{os.path.join(self.speaker_embeddings.get("repo_or_path", "/"),voice_preset_paths[key])}` does not exists
+                    , no preloaded voice preset will be used - Make sure to provide correct paths to the {voice_preset}
+                    embeddings."""
+                )
+
+            voice_preset_dict[key] = np.load(path)
+
+        return voice_preset_dict
+
+    def _validate_voice_preset_dict(self, voice_preset: Optional[dict] = None):
+        for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]:
+            if key not in voice_preset:
+                raise ValueError(f"Voice preset unrecognized, missing {key} as a key.")
+
+            if not isinstance(voice_preset[key], np.ndarray):
+                raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.")
+
+            if len(voice_preset[key].shape) != self.preset_shape[key]:
+                raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.")
+
+    def __call__(
+        self,
+        text=None,
+        voice_preset=None,
+        return_tensors="pt",
+        max_length=256,
+        add_special_tokens=False,
+        return_attention_mask=True,
+        return_token_type_ids=False,
+        **kwargs,
+    ):
+        """
+        Main method to prepare for the model one or several sequences(s). This method forwards the `text` and `kwargs`
+        arguments to the AutoTokenizer's [`~AutoTokenizer.__call__`] to encode the text. The method also proposes a
+        voice preset which is a dictionary of arrays that conditions `Bark`'s output. `kwargs` arguments are forwarded
+        to the tokenizer and to `cached_file` method if `voice_preset` is a valid filename.
+
+        Args:
+            text (`str`, `List[str]`, `List[List[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            voice_preset (`str`, `Dict[np.ndarray]`):
+                The voice preset, i.e the speaker embeddings. It can either be a valid voice_preset name, e.g
+                `"en_speaker_1"`, or directly a dictionnary of `np.ndarray` embeddings for each submodel of `Bark`. Or
+                it can be a valid file name of a local `.npz` single voice preset.
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+
+        Returns:
+            Tuple([`BatchEncoding`], [`BatchFeature`]): A tuple composed of a [`BatchEncoding`], i.e the output of the
+            `tokenizer` and a [`BatchFeature`], i.e the voice preset with the right tensors type.
+        """
+        if voice_preset is not None and not isinstance(voice_preset, dict):
+            if (
+                isinstance(voice_preset, str)
+                and self.speaker_embeddings is not None
+                and voice_preset in self.speaker_embeddings
+            ):
+                voice_preset = self._load_voice_preset(voice_preset)
+
+            else:
+                if isinstance(voice_preset, str) and not voice_preset.endswith(".npz"):
+                    voice_preset = voice_preset + ".npz"
+
+                voice_preset = np.load(voice_preset)
+
+        if voice_preset is not None:
+            self._validate_voice_preset_dict(voice_preset, **kwargs)
+            voice_preset = BatchFeature(data=voice_preset, tensor_type=return_tensors)
+
+        encoded_text = self.tokenizer(
+            text,
+            return_tensors=return_tensors,
+            padding="max_length",
+            max_length=max_length,
+            return_attention_mask=return_attention_mask,
+            return_token_type_ids=return_token_type_ids,
+            add_special_tokens=add_special_tokens,
+            **kwargs,
+        )
+
+        if voice_preset is not None:
+            encoded_text["history_prompt"] = voice_preset
+
+        return encoded_text
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -816,6 +816,51 @@ class AutoformerPreTrainedModel(metaclass=DummyObject):
        requires_backends(self, ["torch"])


+BARK_PRETRAINED_MODEL_ARCHIVE_LIST = None
+
+
+class BarkCausalModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class BarkCoarseModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class BarkFineModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class BarkModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class BarkPreTrainedModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
+class BarkSemanticModel(metaclass=DummyObject):
+    _backends = ["torch"]
+
+    def __init__(self, *args, **kwargs):
+        requires_backends(self, ["torch"])
+
+
 BART_PRETRAINED_MODEL_ARCHIVE_LIST = None


--- a/tests/models/bark/init.py
+++ b/tests/models/bark/init.py
--- a/tests/models/bark/test_modeling_bark.py
+++ b/tests/models/bark/test_modeling_bark.py
@@ -0,0 +1,991 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Testing suite for the PyTorch Bark model. """
+
+
+import copy
+import inspect
+import tempfile
+import unittest
+
+from transformers import (
+    BarkCoarseConfig,
+    BarkFineConfig,
+    BarkSemanticConfig,
+    is_torch_available,
+)
+from transformers.models.bark.generation_configuration_bark import (
+    BarkCoarseGenerationConfig,
+    BarkFineGenerationConfig,
+    BarkSemanticGenerationConfig,
+)
+from transformers.testing_utils import require_torch, slow, torch_device
+from transformers.utils import cached_property
+
+from ...generation.test_utils import GenerationTesterMixin
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, ids_tensor, random_attention_mask
+
+
+if is_torch_available():
+    import torch
+
+    from transformers import (
+        BarkCausalModel,
+        BarkCoarseModel,
+        BarkFineModel,
+        BarkModel,
+        BarkProcessor,
+        BarkSemanticModel,
+    )
+
+
+class BarkSemanticModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        seq_length=4,
+        is_training=False,  # for now training is not supported
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=33,
+        output_vocab_size=33,
+        hidden_size=16,
+        num_hidden_layers=2,
+        num_attention_heads=2,
+        intermediate_size=15,
+        dropout=0.1,
+        window_size=256,
+        initializer_range=0.02,
+        n_codes_total=8,  # for BarkFineModel
+        n_codes_given=1,  # for BarkFineModel
+        config_class=None,
+        model_class=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.output_vocab_size = output_vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.window_size = window_size
+        self.initializer_range = initializer_range
+        self.bos_token_id = output_vocab_size - 1
+        self.eos_token_id = output_vocab_size - 1
+        self.pad_token_id = output_vocab_size - 1
+
+        self.n_codes_total = n_codes_total
+        self.n_codes_given = n_codes_given
+
+        self.is_encoder_decoder = False
+        self.config_class = config_class
+        self.model_class = model_class
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        config = self.get_config()
+
+        head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "head_mask": head_mask,
+            "attention_mask": input_mask,
+        }
+
+        return config, inputs_dict
+
+    def get_config(self):
+        return self.config_class(
+            vocab_size=self.vocab_size,
+            output_vocab_size=self.output_vocab_size,
+            hidden_size=self.hidden_size,
+            num_layers=self.num_hidden_layers,
+            num_heads=self.num_attention_heads,
+            use_cache=True,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            window_size=self.window_size,
+        )
+
+    def get_pipeline_config(self):
+        config = self.get_config()
+        config.vocab_size = 300
+        return config
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
+        model = self.model_class(config=config).to(torch_device).eval()
+
+        input_ids = inputs_dict["input_ids"]
+        attention_mask = inputs_dict["attention_mask"]
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
+
+        output, past_key_values = outputs.to_tuple()
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
+        next_attn_mask = ids_tensor((self.batch_size, 3), 2)
+
+        # append to next input_ids and
+        next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
+        next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
+
+        output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
+        output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
+            "logits"
+        ]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+        # test no attention_mask works
+        outputs = model(input_ids, use_cache=True)
+        _, past_key_values = outputs.to_tuple()
+        output_from_no_past = model(next_input_ids)["logits"]
+
+        output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
+
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+        # test that outputs are equal for slice
+        self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+
+class BarkCoarseModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        seq_length=4,
+        is_training=False,  # for now training is not supported
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=33,
+        output_vocab_size=33,
+        hidden_size=16,
+        num_hidden_layers=2,
+        num_attention_heads=2,
+        intermediate_size=15,
+        dropout=0.1,
+        window_size=256,
+        initializer_range=0.02,
+        n_codes_total=8,  # for BarkFineModel
+        n_codes_given=1,  # for BarkFineModel
+        config_class=None,
+        model_class=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.output_vocab_size = output_vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.window_size = window_size
+        self.initializer_range = initializer_range
+        self.bos_token_id = output_vocab_size - 1
+        self.eos_token_id = output_vocab_size - 1
+        self.pad_token_id = output_vocab_size - 1
+
+        self.n_codes_total = n_codes_total
+        self.n_codes_given = n_codes_given
+
+        self.is_encoder_decoder = False
+        self.config_class = config_class
+        self.model_class = model_class
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        config = self.get_config()
+
+        head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+        inputs_dict = {
+            "input_ids": input_ids,
+            "head_mask": head_mask,
+            "attention_mask": input_mask,
+        }
+
+        return config, inputs_dict
+
+    def get_config(self):
+        return self.config_class(
+            vocab_size=self.vocab_size,
+            output_vocab_size=self.output_vocab_size,
+            hidden_size=self.hidden_size,
+            num_layers=self.num_hidden_layers,
+            num_heads=self.num_attention_heads,
+            use_cache=True,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            window_size=self.window_size,
+        )
+
+    def get_pipeline_config(self):
+        config = self.get_config()
+        config.vocab_size = 300
+        return config
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
+        model = self.model_class(config=config).to(torch_device).eval()
+
+        input_ids = inputs_dict["input_ids"]
+        attention_mask = inputs_dict["attention_mask"]
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
+
+        output, past_key_values = outputs.to_tuple()
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
+        next_attn_mask = ids_tensor((self.batch_size, 3), 2)
+
+        # append to next input_ids and
+        next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
+        next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
+
+        output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
+        output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
+            "logits"
+        ]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+        # test no attention_mask works
+        outputs = model(input_ids, use_cache=True)
+        _, past_key_values = outputs.to_tuple()
+        output_from_no_past = model(next_input_ids)["logits"]
+
+        output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
+
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+        # test that outputs are equal for slice
+        self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+
+class BarkFineModelTester:
+    def __init__(
+        self,
+        parent,
+        batch_size=2,
+        seq_length=4,
+        is_training=False,  # for now training is not supported
+        use_input_mask=True,
+        use_labels=True,
+        vocab_size=33,
+        output_vocab_size=33,
+        hidden_size=16,
+        num_hidden_layers=2,
+        num_attention_heads=2,
+        intermediate_size=15,
+        dropout=0.1,
+        window_size=256,
+        initializer_range=0.02,
+        n_codes_total=8,  # for BarkFineModel
+        n_codes_given=1,  # for BarkFineModel
+        config_class=None,
+        model_class=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.output_vocab_size = output_vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.dropout = dropout
+        self.window_size = window_size
+        self.initializer_range = initializer_range
+        self.bos_token_id = output_vocab_size - 1
+        self.eos_token_id = output_vocab_size - 1
+        self.pad_token_id = output_vocab_size - 1
+
+        self.n_codes_total = n_codes_total
+        self.n_codes_given = n_codes_given
+
+        self.is_encoder_decoder = False
+        self.config_class = config_class
+        self.model_class = model_class
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length, self.n_codes_total], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        config = self.get_config()
+
+        head_mask = ids_tensor([self.num_hidden_layers, self.num_attention_heads], 2)
+
+        # randint between self.n_codes_given - 1 and self.n_codes_total - 1
+        codebook_idx = ids_tensor((1,), self.n_codes_total - self.n_codes_given).item() + self.n_codes_given
+
+        inputs_dict = {
+            "codebook_idx": codebook_idx,
+            "input_ids": input_ids,
+            "head_mask": head_mask,
+            "attention_mask": input_mask,
+        }
+
+        return config, inputs_dict
+
+    def get_config(self):
+        return self.config_class(
+            vocab_size=self.vocab_size,
+            output_vocab_size=self.output_vocab_size,
+            hidden_size=self.hidden_size,
+            num_layers=self.num_hidden_layers,
+            num_heads=self.num_attention_heads,
+            use_cache=True,
+            bos_token_id=self.bos_token_id,
+            eos_token_id=self.eos_token_id,
+            pad_token_id=self.pad_token_id,
+            window_size=self.window_size,
+        )
+
+    def get_pipeline_config(self):
+        config = self.get_config()
+        config.vocab_size = 300
+        return config
+
+    def prepare_config_and_inputs_for_common(self):
+        config, inputs_dict = self.prepare_config_and_inputs()
+        return config, inputs_dict
+
+    def create_and_check_decoder_model_past_large_inputs(self, config, inputs_dict):
+        model = self.model_class(config=config).to(torch_device).eval()
+
+        input_ids = inputs_dict["input_ids"]
+        attention_mask = inputs_dict["attention_mask"]
+
+        # first forward pass
+        outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
+
+        output, past_key_values = outputs.to_tuple()
+
+        # create hypothetical multiple next token and extent to next_input_ids
+        next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
+        next_attn_mask = ids_tensor((self.batch_size, 3), 2)
+
+        # append to next input_ids and
+        next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
+        next_attention_mask = torch.cat([attention_mask, next_attn_mask], dim=-1)
+
+        output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["logits"]
+        output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
+            "logits"
+        ]
+
+        # select random slice
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+
+        self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
+
+        # test that outputs are equal for slice
+        self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+        # test no attention_mask works
+        outputs = model(input_ids, use_cache=True)
+        _, past_key_values = outputs.to_tuple()
+        output_from_no_past = model(next_input_ids)["logits"]
+
+        output_from_past = model(next_tokens, past_key_values=past_key_values)["logits"]
+
+        random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
+        output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
+        output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
+        # test that outputs are equal for slice
+        self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
+
+
+@require_torch
+class BarkSemanticModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    all_model_classes = (BarkSemanticModel,) if is_torch_available() else ()
+    all_generative_model_classes = (BarkCausalModel,) if is_torch_available() else ()
+
+    is_encoder_decoder = False
+    fx_compatible = False
+    test_missing_keys = False
+    test_pruning = False
+    test_model_parallel = False
+    # no model_parallel for now
+
+    test_resize_embeddings = True
+
+    def setUp(self):
+        self.model_tester = BarkSemanticModelTester(
+            self, config_class=BarkSemanticConfig, model_class=BarkSemanticModel
+        )
+        self.config_tester = ConfigTester(self, config_class=BarkSemanticConfig, n_embd=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_save_load_strict(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs()
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
+            self.assertEqual(info["missing_keys"], [])
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_inputs_embeds(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
+
+            input_ids = inputs["input_ids"]
+            del inputs["input_ids"]
+
+            wte = model.get_input_embeddings()
+            inputs["input_embeds"] = wte(input_ids)
+
+            with torch.no_grad():
+                model(**inputs)[0]
+
+    def test_generate_fp16(self):
+        config, input_dict = self.model_tester.prepare_config_and_inputs()
+        input_ids = input_dict["input_ids"]
+        attention_mask = input_ids.ne(1).to(torch_device)
+        model = self.all_generative_model_classes[0](config).eval().to(torch_device)
+        if torch_device == "cuda":
+            model.half()
+        model.generate(input_ids, attention_mask=attention_mask)
+        model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
+
+
+@require_torch
+class BarkCoarseModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
+    # Same tester as BarkSemanticModelTest, except for model_class and config_class
+    all_model_classes = (BarkCoarseModel,) if is_torch_available() else ()
+    all_generative_model_classes = (BarkCausalModel,) if is_torch_available() else ()
+
+    is_encoder_decoder = False
+    fx_compatible = False
+    test_missing_keys = False
+    test_pruning = False
+    test_model_parallel = False
+    # no model_parallel for now
+
+    test_resize_embeddings = True
+
+    def setUp(self):
+        self.model_tester = BarkCoarseModelTester(self, config_class=BarkCoarseConfig, model_class=BarkCoarseModel)
+        self.config_tester = ConfigTester(self, config_class=BarkCoarseConfig, n_embd=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_save_load_strict(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs()
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
+            self.assertEqual(info["missing_keys"], [])
+
+    def test_decoder_model_past_with_large_inputs(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
+
+    def test_inputs_embeds(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
+
+            input_ids = inputs["input_ids"]
+            del inputs["input_ids"]
+
+            wte = model.get_input_embeddings()
+            inputs["input_embeds"] = wte(input_ids)
+
+            with torch.no_grad():
+                model(**inputs)[0]
+
+    def test_generate_fp16(self):
+        config, input_dict = self.model_tester.prepare_config_and_inputs()
+        input_ids = input_dict["input_ids"]
+        attention_mask = input_ids.ne(1).to(torch_device)
+        model = self.all_generative_model_classes[0](config).eval().to(torch_device)
+        if torch_device == "cuda":
+            model.half()
+        model.generate(input_ids, attention_mask=attention_mask)
+        model.generate(num_beams=4, do_sample=True, early_stopping=False, num_return_sequences=3)
+
+
+@require_torch
+class BarkFineModelTest(ModelTesterMixin, unittest.TestCase):
+    all_model_classes = (BarkFineModel,) if is_torch_available() else ()
+
+    is_encoder_decoder = False
+    fx_compatible = False
+    test_missing_keys = False
+    test_pruning = False
+    # no model_parallel for now
+    test_model_parallel = False
+
+    # torchscript disabled for now because forward with an int
+    test_torchscript = False
+
+    test_resize_embeddings = True
+
+    def setUp(self):
+        self.model_tester = BarkFineModelTester(self, config_class=BarkFineConfig, model_class=BarkFineModel)
+        self.config_tester = ConfigTester(self, config_class=BarkFineConfig, n_embd=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_save_load_strict(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs()
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+
+            with tempfile.TemporaryDirectory() as tmpdirname:
+                model.save_pretrained(tmpdirname)
+                model2, info = model_class.from_pretrained(tmpdirname, output_loading_info=True)
+            self.assertEqual(info["missing_keys"], [])
+
+    def test_inputs_embeds(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))
+
+            input_ids = inputs["input_ids"]
+            del inputs["input_ids"]
+
+            wte = model.get_input_embeddings()[inputs_dict["codebook_idx"]]
+
+            inputs["input_embeds"] = wte(input_ids[:, :, inputs_dict["codebook_idx"]])
+
+            with torch.no_grad():
+                model(**inputs)[0]
+
+    def test_generate_fp16(self):
+        config, input_dict = self.model_tester.prepare_config_and_inputs()
+        input_ids = input_dict["input_ids"]
+        # take first codebook channel
+
+        model = self.all_model_classes[0](config).eval().to(torch_device)
+        if torch_device == "cuda":
+            model.half()
+
+        # toy generation_configs
+        semantic_generation_config = BarkSemanticGenerationConfig(semantic_vocab_size=0)
+        coarse_generation_config = BarkCoarseGenerationConfig(n_coarse_codebooks=config.n_codes_given)
+        fine_generation_config = BarkFineGenerationConfig(
+            max_fine_history_length=config.block_size // 2,
+            max_fine_input_length=config.block_size,
+            n_fine_codebooks=config.n_codes_total,
+        )
+        codebook_size = config.vocab_size - 1
+
+        model.generate(
+            input_ids,
+            history_prompt=None,
+            temperature=None,
+            semantic_generation_config=semantic_generation_config,
+            coarse_generation_config=coarse_generation_config,
+            fine_generation_config=fine_generation_config,
+            codebook_size=codebook_size,
+        )
+
+        model.generate(
+            input_ids,
+            history_prompt=None,
+            temperature=0.7,
+            semantic_generation_config=semantic_generation_config,
+            coarse_generation_config=coarse_generation_config,
+            fine_generation_config=fine_generation_config,
+            codebook_size=codebook_size,
+        )
+
+    def test_forward_signature(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            signature = inspect.signature(model.forward)
+            # signature.parameters is an OrderedDict => so arg_names order is deterministic
+            arg_names = [*signature.parameters.keys()]
+
+            expected_arg_names = ["codebook_idx", "input_ids"]
+            self.assertListEqual(arg_names[:2], expected_arg_names)
+
+    def test_model_common_attributes(self):
+        # one embedding layer per codebook
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            self.assertIsInstance(model.get_input_embeddings()[0], (torch.nn.Embedding))
+            model.set_input_embeddings(
+                torch.nn.ModuleList([torch.nn.Embedding(10, 10) for _ in range(config.n_codes_total)])
+            )
+            x = model.get_output_embeddings()
+            self.assertTrue(x is None or isinstance(x[0], torch.nn.Linear))
+
+    def test_resize_tokens_embeddings(self):
+        # resizing tokens_embeddings of a ModuleList
+        original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        if not self.test_resize_embeddings:
+            return
+
+        for model_class in self.all_model_classes:
+            config = copy.deepcopy(original_config)
+            model = model_class(config)
+            model.to(torch_device)
+
+            if self.model_tester.is_training is False:
+                model.eval()
+
+            model_vocab_size = config.vocab_size
+            # Retrieve the embeddings and clone theme
+            model_embed_list = model.resize_token_embeddings(model_vocab_size)
+            cloned_embeddings_list = [model_embed.weight.clone() for model_embed in model_embed_list]
+
+            # Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
+            model_embed_list = model.resize_token_embeddings(model_vocab_size + 10)
+            self.assertEqual(model.config.vocab_size, model_vocab_size + 10)
+
+            # Check that it actually resizes the embeddings matrix for each codebook
+            for model_embed, cloned_embeddings in zip(model_embed_list, cloned_embeddings_list):
+                self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] + 10)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+            # Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
+            model_embed_list = model.resize_token_embeddings(model_vocab_size - 15)
+            self.assertEqual(model.config.vocab_size, model_vocab_size - 15)
+            for model_embed, cloned_embeddings in zip(model_embed_list, cloned_embeddings_list):
+                self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] - 15)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            # Input ids should be clamped to the maximum size of the vocabulary
+            inputs_dict["input_ids"].clamp_(max=model_vocab_size - 15 - 1)
+
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+            # Check that adding and removing tokens has not modified the first part of the embedding matrix.
+            # only check for the first embedding matrix
+            models_equal = True
+            for p1, p2 in zip(cloned_embeddings_list[0], model_embed_list[0].weight):
+                if p1.data.ne(p2.data).sum() > 0:
+                    models_equal = False
+
+            self.assertTrue(models_equal)
+
+    def test_resize_embeddings_untied(self):
+        # resizing tokens_embeddings of a ModuleList
+        original_config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+        if not self.test_resize_embeddings:
+            return
+
+        original_config.tie_word_embeddings = False
+
+        for model_class in self.all_model_classes:
+            config = copy.deepcopy(original_config)
+            model = model_class(config).to(torch_device)
+
+            # if no output embeddings -> leave test
+            if model.get_output_embeddings() is None:
+                continue
+
+            # Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
+            model_vocab_size = config.vocab_size
+            model.resize_token_embeddings(model_vocab_size + 10)
+            self.assertEqual(model.config.vocab_size, model_vocab_size + 10)
+            output_embeds_list = model.get_output_embeddings()
+
+            for output_embeds in output_embeds_list:
+                self.assertEqual(output_embeds.weight.shape[0], model_vocab_size + 10)
+
+                # Check bias if present
+                if output_embeds.bias is not None:
+                    self.assertEqual(output_embeds.bias.shape[0], model_vocab_size + 10)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+            # Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
+            model.resize_token_embeddings(model_vocab_size - 15)
+            self.assertEqual(model.config.vocab_size, model_vocab_size - 15)
+            # Check that it actually resizes the embeddings matrix
+            output_embeds_list = model.get_output_embeddings()
+
+            for output_embeds in output_embeds_list:
+                self.assertEqual(output_embeds.weight.shape[0], model_vocab_size - 15)
+                # Check bias if present
+                if output_embeds.bias is not None:
+                    self.assertEqual(output_embeds.bias.shape[0], model_vocab_size - 15)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            # Input ids should be clamped to the maximum size of the vocabulary
+            inputs_dict["input_ids"].clamp_(max=model_vocab_size - 15 - 1)
+
+            # Check that the model can still do a forward pass successfully (every parameter should be resized)
+            model(**self._prepare_for_class(inputs_dict, model_class))
+
+
+@require_torch
+class BarkModelIntegrationTests(unittest.TestCase):
+    @cached_property
+    def model(self):
+        return BarkModel.from_pretrained("ylacombe/bark-large").to(torch_device)
+
+    @cached_property
+    def processor(self):
+        return BarkProcessor.from_pretrained("ylacombe/bark-large")
+
+    @cached_property
+    def inputs(self):
+        input_ids = self.processor("In the light of the moon, a little egg lay on a leaf", voice_preset="en_speaker_6")
+
+        input_ids = input_ids.to(torch_device)
+
+        return input_ids
+
+    @cached_property
+    def semantic_generation_config(self):
+        semantic_generation_config = BarkSemanticGenerationConfig(**self.model.generation_config.semantic_config)
+        return semantic_generation_config
+
+    @cached_property
+    def coarse_generation_config(self):
+        coarse_generation_config = BarkCoarseGenerationConfig(**self.model.generation_config.coarse_acoustics_config)
+        return coarse_generation_config
+
+    @cached_property
+    def fine_generation_config(self):
+        fine_generation_config = BarkFineGenerationConfig(**self.model.generation_config.fine_acoustics_config)
+        return fine_generation_config
+
+    @slow
+    def test_generate_semantic(self):
+        input_ids = self.inputs
+
+        # fmt: off
+        # check first ids
+        expected_output_ids = [7363, 321, 41, 1461, 6915, 952, 326, 41, 41, 927,]
+        # fmt: on
+
+        # greedy decoding
+        with torch.no_grad():
+            output_ids = self.model.semantic.generate(
+                **input_ids,
+                do_sample=False,
+                semantic_generation_config=self.semantic_generation_config,
+            )
+
+        self.assertListEqual(output_ids[0, : len(expected_output_ids)].tolist(), expected_output_ids)
+
+    @slow
+    def test_generate_coarse(self):
+        input_ids = self.inputs
+
+        history_prompt = input_ids["history_prompt"]
+
+        # fmt: off
+        # check first ids
+        expected_output_ids = [11018, 11391, 10651, 11418, 10857, 11620, 10642, 11366, 10312, 11528, 10531, 11516, 10474, 11051, 10524, 11051, ]
+        # fmt: on
+
+        with torch.no_grad():
+            output_ids = self.model.semantic.generate(
+                **input_ids,
+                do_sample=False,
+                semantic_generation_config=self.semantic_generation_config,
+            )
+
+            output_ids = self.model.coarse_acoustics.generate(
+                output_ids,
+                history_prompt=history_prompt,
+                do_sample=False,
+                semantic_generation_config=self.semantic_generation_config,
+                coarse_generation_config=self.coarse_generation_config,
+                codebook_size=self.model.generation_config.codebook_size,
+            )
+
+        self.assertListEqual(output_ids[0, : len(expected_output_ids)].tolist(), expected_output_ids)
+
+    @slow
+    def test_generate_fine(self):
+        input_ids = self.inputs
+
+        history_prompt = input_ids["history_prompt"]
+
+        # fmt: off
+        expected_output_ids = [
+            [1018, 651, 857, 642, 312, 531, 474, 524, 524, 776,],
+            [367, 394, 596, 342, 504, 492, 27, 27, 822, 822,],
+            [961, 955, 221, 955, 955, 686, 939, 939, 479, 176,],
+            [638, 365, 218, 944, 853, 363, 639, 22, 884, 456,],
+            [302, 912, 524, 38, 174, 209, 879, 23, 910, 227,],
+            [440, 673, 861, 666, 372, 558, 49, 172, 232, 342,],
+            [244, 358, 123, 356, 586, 520, 499, 877, 542, 637,],
+            [806, 685, 905, 848, 803, 810, 921, 208, 625, 203,],
+        ]
+        # fmt: on
+
+        with torch.no_grad():
+            output_ids = self.model.semantic.generate(
+                **input_ids,
+                do_sample=False,
+                semantic_generation_config=self.semantic_generation_config,
+            )
+
+            output_ids = self.model.coarse_acoustics.generate(
+                output_ids,
+                history_prompt=history_prompt,
+                do_sample=False,
+                semantic_generation_config=self.semantic_generation_config,
+                coarse_generation_config=self.coarse_generation_config,
+                codebook_size=self.model.generation_config.codebook_size,
+            )
+
+            # greedy decoding
+            output_ids = self.model.fine_acoustics.generate(
+                output_ids,
+                history_prompt=history_prompt,
+                temperature=None,
+                semantic_generation_config=self.semantic_generation_config,
+                coarse_generation_config=self.coarse_generation_config,
+                fine_generation_config=self.fine_generation_config,
+                codebook_size=self.model.generation_config.codebook_size,
+            )
+
+        self.assertListEqual(output_ids[0, :, : len(expected_output_ids[0])].tolist(), expected_output_ids)
+
+    @slow
+    def test_generate_end_to_end(self):
+        input_ids = self.inputs
+
+        with torch.no_grad():
+            self.model.generate(**input_ids)
+            self.model.generate(**{key: val for (key, val) in input_ids.items() if key != "history_prompt"})
+
+    @slow
+    def test_generate_end_to_end_with_args(self):
+        input_ids = self.inputs
+
+        with torch.no_grad():
+            self.model.generate(**input_ids, do_sample=True, temperature=0.6, penalty_alpha=0.6)
+            self.model.generate(**input_ids, do_sample=True, temperature=0.6, num_beams=4)
+
+    @slow
+    def test_generate_end_to_end_with_sub_models_args(self):
+        input_ids = self.inputs
+
+        with torch.no_grad():
+            self.model.generate(**input_ids, do_sample=False, coarse_do_sample=True, coarse_temperature=0.7)
+            self.model.generate(
+                **input_ids, do_sample=False, coarse_do_sample=True, coarse_temperature=0.7, fine_temperature=0.3
+            )
+            self.model.generate(
+                **input_ids,
+                do_sample=True,
+                temperature=0.6,
+                penalty_alpha=0.6,
+                semantic_temperature=0.9,
+                coarse_temperature=0.2,
+                fine_temperature=0.1,
+            )
--- a/tests/models/bark/test_processor_bark.py
+++ b/tests/models/bark/test_processor_bark.py
@@ -0,0 +1,127 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+
+from transformers import AutoTokenizer, BarkProcessor
+from transformers.testing_utils import require_torch, slow
+
+
+@require_torch
+class BarkProcessorTest(unittest.TestCase):
+    def setUp(self):
+        self.checkpoint = "ylacombe/bark-small"
+        self.tmpdirname = tempfile.mkdtemp()
+        self.voice_preset = "en_speaker_1"
+        self.input_string = "This is a test string"
+        self.speaker_embeddings_dict_path = "speaker_embeddings_path.json"
+        self.speaker_embeddings_directory = "speaker_embeddings"
+
+    def get_tokenizer(self, **kwargs):
+        return AutoTokenizer.from_pretrained(self.checkpoint, **kwargs)
+
+    def tearDown(self):
+        shutil.rmtree(self.tmpdirname)
+
+    def test_save_load_pretrained_default(self):
+        tokenizer = self.get_tokenizer()
+
+        processor = BarkProcessor(tokenizer=tokenizer)
+
+        processor.save_pretrained(self.tmpdirname)
+        processor = BarkProcessor.from_pretrained(self.tmpdirname)
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer.get_vocab())
+
+    @slow
+    def test_save_load_pretrained_additional_features(self):
+        processor = BarkProcessor.from_pretrained(
+            pretrained_processor_name_or_path=self.checkpoint,
+            speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
+        )
+        processor.save_pretrained(
+            self.tmpdirname,
+            speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
+            speaker_embeddings_directory=self.speaker_embeddings_directory,
+        )
+
+        tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+
+        processor = BarkProcessor.from_pretrained(
+            self.tmpdirname,
+            self.speaker_embeddings_dict_path,
+            bos_token="(BOS)",
+            eos_token="(EOS)",
+        )
+
+        self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+
+    def test_speaker_embeddings(self):
+        processor = BarkProcessor.from_pretrained(
+            pretrained_processor_name_or_path=self.checkpoint,
+            speaker_embeddings_dict_path=self.speaker_embeddings_dict_path,
+        )
+
+        seq_len = 35
+        nb_codebooks_coarse = 2
+        nb_codebooks_total = 8
+
+        voice_preset = {
+            "semantic_prompt": np.ones(seq_len),
+            "coarse_prompt": np.ones((nb_codebooks_coarse, seq_len)),
+            "fine_prompt": np.ones((nb_codebooks_total, seq_len)),
+        }
+
+        # test providing already loaded voice_preset
+        inputs = processor(text=self.input_string, voice_preset=voice_preset)
+
+        processed_voice_preset = inputs["history_prompt"]
+        for key in voice_preset:
+            self.assertListEqual(voice_preset[key].tolist(), processed_voice_preset.get(key, np.array([])).tolist())
+
+        # test loading voice preset from npz file
+        tmpfilename = os.path.join(self.tmpdirname, "file.npz")
+        np.savez(tmpfilename, **voice_preset)
+        inputs = processor(text=self.input_string, voice_preset=tmpfilename)
+        processed_voice_preset = inputs["history_prompt"]
+
+        for key in voice_preset:
+            self.assertListEqual(voice_preset[key].tolist(), processed_voice_preset.get(key, np.array([])).tolist())
+
+        # test loading voice preset from the hub
+        inputs = processor(text=self.input_string, voice_preset=self.voice_preset)
+
+    def test_tokenizer(self):
+        tokenizer = self.get_tokenizer()
+
+        processor = BarkProcessor(tokenizer=tokenizer)
+
+        encoded_processor = processor(text=self.input_string)
+
+        encoded_tok = tokenizer(
+            self.input_string,
+            padding="max_length",
+            max_length=256,
+            add_special_tokens=False,
+            return_attention_mask=True,
+            return_token_type_ids=False,
+        )
+
+        for key in encoded_tok.keys():
+            self.assertListEqual(encoded_tok[key], encoded_processor[key].squeeze().tolist())
--- a/utils/check_repo.py
+++ b/utils/check_repo.py
@@ -167,6 +167,8 @@ IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [
    "SpeechT5SpeechEncoder",  # Building part of bigger (tested) model.
    "SpeechT5TextDecoder",  # Building part of bigger (tested) model.
    "SpeechT5TextEncoder",  # Building part of bigger (tested) model.
+    "BarkCausalModel",  # Building part of bigger (tested) model.
+    "BarkModel",  # Does not have a forward signature - generation tested with integration tests
 ]

 # Update this list with test files that don't have a tester with a `all_model_classes` variable and which don't
@@ -188,6 +190,7 @@ TEST_FILES_WITH_NO_COMMON_TESTS = [
    "models/vision_text_dual_encoder/test_modeling_tf_vision_text_dual_encoder.py",
    "models/vision_text_dual_encoder/test_modeling_flax_vision_text_dual_encoder.py",
    "models/decision_transformer/test_modeling_decision_transformer.py",
+    "models/bark/test_modeling_bark.py",
 ]

 # Update this list for models that are not in any of the auto MODEL_XXX_MAPPING. Being in this list is an exception and
@@ -332,11 +335,15 @@ IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
    "AltCLIPVisionModel",
    "AltRobertaModel",
    "TvltForAudioVisualClassification",
+    "BarkCausalModel",
+    "BarkCoarseModel",
+    "BarkFineModel",
+    "BarkSemanticModel",
+    "MusicgenModel",
+    "MusicgenForConditionalGeneration",
    "SpeechT5ForSpeechToSpeech",
    "SpeechT5ForTextToSpeech",
    "SpeechT5HifiGan",
-    "MusicgenModel",
-    "MusicgenForConditionalGeneration",
 ]

 # DO NOT edit this list!
--- a/utils/documentation_tests.txt
+++ b/utils/documentation_tests.txt
@@ -28,6 +28,9 @@ src/transformers/models/auto/feature_extraction_auto.py
 src/transformers/models/auto/image_processing_auto.py
 src/transformers/models/auto/processing_auto.py
 src/transformers/models/auto/tokenization_auto.py
+src/transformers/models/bark/configuration_bark.py
+src/transformers/models/bark/modeling_bark.py
+src/transformers/models/bark/processing_bark.py
 src/transformers/models/bart/configuration_bart.py
 src/transformers/models/bart/modeling_bart.py
 src/transformers/models/bart/tokenization_bart.py