Add Pop2Piano (#21785)
* init commit * config updated also some modeling * Processor and Model config combined * extraction pipeline(upto before spectogram & mel_conditioner) added but not properly tested * model loading successful! * feature extractor done! * FE can now be called from HF * postprocessing added in fe file * same as prev commit * Pop2PianoConfig doc done * cfg docs slightly changed * fe docs done * batched * batched working! * temp * v1 * checking * trying to go with generate * with generate and model tests passed * before rebasing * . * tests done docs done remaining others & nits * nits * LogMelSpectogram shifted to FeatureExtractor * is_tf rmeoved from pop2piano/init * import solved * tokenization tests added * minor fixed regarding modeling_pop2piano * tokenizer changed to only return midi_object and other changes * Updated paper abstract(Camera-ready version) (#2) * more comments and nits * ruff changes * code quality fix * sg comments * t5 change added and rebased * comments except batching * batching done * comments * small doc fix * example removed from modeling * ckpt * forward it compatible with fe and generation done * comments * comments * code-quality fix(maybe) * ckpts changed * doc file changed from mdx to md * test fixes * tokenizer test fix * changes * nits done main changes remaining * code modified * Pop2PianoProcessor added with tests * other comments * added Pop2PianoProcessor to dummy_objects * added require_onnx to modeling file * changes * update .md file * remove extra line in index.md * back to the main index * added pop2piano to index * Added tokenizer.__call__ with valid args and batch_decode and aligned the processor part too * changes * added return types to 2 tokenizer methods * the PR build test might work now * added backends * PR build fix * vocab added * comments * refactored vocab into 1 file * added conversion script * comments * essentia version changed in .md * comments * more tokenizer tests added * minor fix * tests extended for outputs acc check * small fix --------- Co-authored-by: Jongho Choi <sweetcocoa@snu.ac.kr>
This commit is contained in:
@@ -434,6 +434,7 @@ Current number of checkpoints: ** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
|
||||
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
|
||||
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
|
||||
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee.
|
||||
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
|
||||
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
|
||||
|
||||
@@ -411,6 +411,7 @@ Número actual de puntos de control: ** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
|
||||
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
|
||||
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
|
||||
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
|
||||
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
|
||||
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
|
||||
|
||||
@@ -383,6 +383,7 @@ conda install -c huggingface transformers
|
||||
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google से) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. द्वाराअनुसंधान पत्र [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) के साथ जारी किया गया
|
||||
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP से) साथ वाला पेपर [प्रोग्राम अंडरस्टैंडिंग एंड जेनरेशन के लिए यूनिफाइड प्री-ट्रेनिंग](https://arxiv .org/abs/2103.06333) वसी उद्दीन अहमद, सैकत चक्रवर्ती, बैशाखी रे, काई-वेई चांग द्वारा।
|
||||
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
|
||||
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
|
||||
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [ProphetNet: प्रेडिक्टिंग फ्यूचर एन-ग्राम फॉर सीक्वेंस-टू-सीक्वेंस प्री-ट्रेनिंग ](https://arxiv.org/abs/2001.04063) यू यान, वीज़ेन क्यूई, येयुन गोंग, दयाहेंग लियू, नान डुआन, जिउशेंग चेन, रुओफ़ेई झांग और मिंग झोउ द्वारा पोस्ट किया गया।
|
||||
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. से) Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. द्वाराअनुसंधान पत्र [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) के साथ जारी किया गया
|
||||
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA से) साथ वाला पेपर [डीप लर्निंग इंफ़ेक्शन के लिए इंटीजर क्वांटिज़ेशन: प्रिंसिपल्स एंड एम्पिरिकल इवैल्यूएशन](https:// arxiv.org/abs/2004.09602) हाओ वू, पैट्रिक जुड, जिआओजी झांग, मिखाइल इसेव और पॉलियस माइकेविसियस द्वारा।
|
||||
|
||||
@@ -445,6 +445,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ
|
||||
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google から) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. から公開された研究論文 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347)
|
||||
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP から) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang から公開された研究論文: [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333)
|
||||
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs から) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng から公開された研究論文: [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418)
|
||||
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
|
||||
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research から) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou から公開された研究論文: [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063)
|
||||
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. から) Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. から公開された研究論文 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf)
|
||||
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA から) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius から公開された研究論文: [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602)
|
||||
|
||||
@@ -360,6 +360,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는
|
||||
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google 에서 제공)은 Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.의 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347)논문과 함께 발표했습니다.
|
||||
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP 에서) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 의 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 논문과 함께 발표했습니다.
|
||||
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs 에서) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 의 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 논문과 함께 발표했습니다.
|
||||
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
|
||||
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research 에서) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 의 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 논문과 함께 발표했습니다.
|
||||
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. 에서 제공)은 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.의 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf)논문과 함께 발표했습니다.
|
||||
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA 에서) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 의 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 논문과 함께 발표했습니다.
|
||||
|
||||
@@ -384,6 +384,7 @@ conda install -c huggingface transformers
|
||||
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (来自 Google) 伴随论文 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) 由 Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova 发布。
|
||||
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (来自 UCLA NLP) 伴随论文 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 由 Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 发布。
|
||||
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (来自 Sea AI Labs) 伴随论文 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 由 Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 发布。
|
||||
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
|
||||
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。
|
||||
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (来自 Nanjing University, The University of Hong Kong etc.) 伴随论文 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) 由 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao 发布。
|
||||
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。
|
||||
|
||||
@@ -396,6 +396,7 @@ conda install -c huggingface transformers
|
||||
1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
|
||||
1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
|
||||
1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
|
||||
1. **[Pop2Piano](https://huggingface.co/docs/transformers/main/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee.
|
||||
1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
|
||||
1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
|
||||
|
||||
@@ -584,6 +584,8 @@
|
||||
title: MMS
|
||||
- local: model_doc/musicgen
|
||||
title: MusicGen
|
||||
- local: model_doc/pop2piano
|
||||
title: Pop2Piano
|
||||
- local: model_doc/sew
|
||||
title: SEW
|
||||
- local: model_doc/sew-d
|
||||
|
||||
@@ -200,6 +200,7 @@ The documentation is organized into five sections:
|
||||
1. **[Pix2Struct](model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.
|
||||
1. **[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang.
|
||||
1. **[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng.
|
||||
1. **[Pop2Piano](model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee.
|
||||
1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
|
||||
1. **[PVT](model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
|
||||
1. **[QDQBert](model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius.
|
||||
@@ -415,6 +416,7 @@ Flax), PyTorch, and/or TensorFlow.
|
||||
| Pix2Struct | ✅ | ❌ | ❌ |
|
||||
| PLBart | ✅ | ❌ | ❌ |
|
||||
| PoolFormer | ✅ | ❌ | ❌ |
|
||||
| Pop2Piano | ✅ | ❌ | ❌ |
|
||||
| ProphetNet | ✅ | ❌ | ❌ |
|
||||
| PVT | ✅ | ❌ | ❌ |
|
||||
| QDQBert | ✅ | ❌ | ❌ |
|
||||
|
||||
190
docs/source/en/model_doc/pop2piano.md
Normal file
190
docs/source/en/model_doc/pop2piano.md
Normal file
@@ -0,0 +1,190 @@
|
||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# Pop2Piano
|
||||
|
||||
## Overview
|
||||
|
||||
The Pop2Piano model was proposed in [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee.
|
||||
|
||||
Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great
|
||||
expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you
|
||||
can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover
|
||||
from pop audio without melody and chord extraction modules.
|
||||
|
||||
Pop2Piano is an encoder-decoder Transformer model based on [T5](https://arxiv.org/pdf/1910.10683.pdf). The input audio
|
||||
is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder
|
||||
uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four
|
||||
different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file.
|
||||
|
||||
|
||||
The abstract from the paper is the following:
|
||||
|
||||
*Piano covers of pop music are enjoyed by many people. However, the
|
||||
task of automatically generating piano covers of pop music is still
|
||||
understudied. This is partly due to the lack of synchronized
|
||||
{Pop, Piano Cover} data pairs, which made it challenging to apply
|
||||
the latest data-intensive deep learning-based methods. To leverage
|
||||
the power of the data-driven approach, we make a large amount of
|
||||
paired and synchronized {Pop, Piano Cover} data using an automated
|
||||
pipeline. In this paper, we present Pop2Piano, a Transformer network
|
||||
that generates piano covers given waveforms of pop music. To the best
|
||||
of our knowledge, this is the first model to generate a piano cover
|
||||
directly from pop audio without using melody and chord extraction
|
||||
modules. We show that Pop2Piano, trained with our dataset, is capable
|
||||
of producing plausible piano covers.*
|
||||
|
||||
|
||||
Tips:
|
||||
|
||||
1. To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules:
|
||||
```
|
||||
pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
|
||||
```
|
||||
Please note that you may need to restart your runtime after installation.
|
||||
2. Pop2Piano is an Encoder-Decoder based model like T5.
|
||||
3. Pop2Piano can be used to generate midi-audio files for a given audio sequence.
|
||||
4. Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results.
|
||||
5. Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance.
|
||||
6. Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs.
|
||||
|
||||
This model was contributed by [Susnato Dhar](https://huggingface.co/susnato).
|
||||
The original code can be found [here](https://github.com/sweetcocoa/pop2piano).
|
||||
|
||||
## Examples
|
||||
|
||||
- Example using HuggingFace Dataset:
|
||||
|
||||
```python
|
||||
>>> from datasets import load_dataset
|
||||
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
|
||||
|
||||
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
|
||||
>>> ds = load_dataset("sweetcocoa/pop2piano_ci", split="test")
|
||||
|
||||
>>> inputs = processor(
|
||||
... audio=ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt"
|
||||
... )
|
||||
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
|
||||
>>> tokenizer_output = processor.batch_decode(
|
||||
... token_ids=model_output, feature_extractor_output=inputs
|
||||
... )["pretty_midi_objects"][0]
|
||||
>>> tokenizer_output.write("./Outputs/midi_output.mid")
|
||||
```
|
||||
|
||||
- Example using your own audio file:
|
||||
|
||||
```python
|
||||
>>> import librosa
|
||||
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
|
||||
|
||||
>>> audio, sr = librosa.load("<your_audio_file_here>", sr=44100) # feel free to change the sr to a suitable value.
|
||||
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
|
||||
|
||||
>>> inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt")
|
||||
>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
|
||||
>>> tokenizer_output = processor.batch_decode(
|
||||
... token_ids=model_output, feature_extractor_output=inputs
|
||||
... )["pretty_midi_objects"][0]
|
||||
>>> tokenizer_output.write("./Outputs/midi_output.mid")
|
||||
```
|
||||
|
||||
- Example of processing multiple audio files in batch:
|
||||
|
||||
```python
|
||||
>>> import librosa
|
||||
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
|
||||
|
||||
>>> # feel free to change the sr to a suitable value.
|
||||
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)
|
||||
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
|
||||
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
|
||||
|
||||
>>> inputs = processor(audio=[audio1, audio2], sampling_rate=[sr1, sr2], return_attention_mask=True, return_tensors="pt")
|
||||
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
|
||||
>>> model_output = model.generate(
|
||||
... input_features=inputs["input_features"],
|
||||
... attention_mask=inputs["attention_mask"],
|
||||
... composer="composer1",
|
||||
... )
|
||||
>>> tokenizer_output = processor.batch_decode(
|
||||
... token_ids=model_output, feature_extractor_output=inputs
|
||||
... )["pretty_midi_objects"]
|
||||
|
||||
>>> # Since we now have 2 generated MIDI files
|
||||
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
|
||||
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")
|
||||
```
|
||||
|
||||
|
||||
- Example of processing multiple audio files in batch (Using `Pop2PianoFeatureExtractor` and `Pop2PianoTokenizer`):
|
||||
|
||||
```python
|
||||
>>> import librosa
|
||||
>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoFeatureExtractor, Pop2PianoTokenizer
|
||||
|
||||
>>> # feel free to change the sr to a suitable value.
|
||||
>>> audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)
|
||||
>>> audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
|
||||
>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
>>> feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("sweetcocoa/pop2piano")
|
||||
>>> tokenizer = Pop2PianoTokenizer.from_pretrained("sweetcocoa/pop2piano")
|
||||
|
||||
>>> inputs = feature_extractor(
|
||||
... audio=[audio1, audio2],
|
||||
... sampling_rate=[sr1, sr2],
|
||||
... return_attention_mask=True,
|
||||
... return_tensors="pt",
|
||||
... )
|
||||
>>> # Since we now generating in batch(2 audios) we must pass the attention_mask
|
||||
>>> model_output = model.generate(
|
||||
... input_features=inputs["input_features"],
|
||||
... attention_mask=inputs["attention_mask"],
|
||||
... composer="composer1",
|
||||
... )
|
||||
>>> tokenizer_output = tokenizer.batch_decode(
|
||||
... token_ids=model_output, feature_extractor_output=inputs
|
||||
... )["pretty_midi_objects"]
|
||||
|
||||
>>> # Since we now have 2 generated MIDI files
|
||||
>>> tokenizer_output[0].write("./Outputs/midi_output1.mid")
|
||||
>>> tokenizer_output[1].write("./Outputs/midi_output2.mid")
|
||||
```
|
||||
|
||||
|
||||
## Pop2PianoConfig
|
||||
|
||||
[[autodoc]] Pop2PianoConfig
|
||||
|
||||
## Pop2PianoFeatureExtractor
|
||||
|
||||
[[autodoc]] Pop2PianoFeatureExtractor
|
||||
- __call__
|
||||
|
||||
## Pop2PianoForConditionalGeneration
|
||||
|
||||
[[autodoc]] Pop2PianoForConditionalGeneration
|
||||
- forward
|
||||
- generate
|
||||
|
||||
## Pop2PianoTokenizer
|
||||
|
||||
[[autodoc]] Pop2PianoTokenizer
|
||||
- __call__
|
||||
|
||||
## Pop2PianoProcessor
|
||||
|
||||
[[autodoc]] Pop2PianoProcessor
|
||||
- __call__
|
||||
@@ -28,8 +28,12 @@ from .utils import (
|
||||
OptionalDependencyNotAvailable,
|
||||
_LazyModule,
|
||||
is_bitsandbytes_available,
|
||||
is_essentia_available,
|
||||
is_flax_available,
|
||||
is_keras_nlp_available,
|
||||
is_librosa_available,
|
||||
is_pretty_midi_available,
|
||||
is_scipy_available,
|
||||
is_sentencepiece_available,
|
||||
is_speech_available,
|
||||
is_tensorflow_text_available,
|
||||
@@ -475,6 +479,10 @@ _import_structure = {
|
||||
],
|
||||
"models.plbart": ["PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP", "PLBartConfig"],
|
||||
"models.poolformer": ["POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PoolFormerConfig"],
|
||||
"models.pop2piano": [
|
||||
"POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP",
|
||||
"Pop2PianoConfig",
|
||||
],
|
||||
"models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"],
|
||||
"models.pvt": ["PVT_PRETRAINED_CONFIG_ARCHIVE_MAP", "PvtConfig"],
|
||||
"models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"],
|
||||
@@ -2430,6 +2438,13 @@ else:
|
||||
"PoolFormerPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.pop2piano"].extend(
|
||||
[
|
||||
"POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"Pop2PianoForConditionalGeneration",
|
||||
"Pop2PianoPreTrainedModel",
|
||||
]
|
||||
)
|
||||
_import_structure["models.prophetnet"].extend(
|
||||
[
|
||||
"PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
@@ -3783,6 +3798,29 @@ else:
|
||||
_import_structure["trainer_tf"] = ["TFTrainer"]
|
||||
|
||||
|
||||
try:
|
||||
if not (
|
||||
is_librosa_available()
|
||||
and is_essentia_available()
|
||||
and is_scipy_available()
|
||||
and is_torch_available()
|
||||
and is_pretty_midi_available()
|
||||
):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
from .utils import dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects
|
||||
|
||||
_import_structure["utils.dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects"] = [
|
||||
name
|
||||
for name in dir(dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects)
|
||||
if not name.startswith("_")
|
||||
]
|
||||
else:
|
||||
_import_structure["models.pop2piano"].append("Pop2PianoFeatureExtractor")
|
||||
_import_structure["models.pop2piano"].append("Pop2PianoTokenizer")
|
||||
_import_structure["models.pop2piano"].append("Pop2PianoProcessor")
|
||||
|
||||
|
||||
# FLAX-backed objects
|
||||
try:
|
||||
if not is_flax_available():
|
||||
@@ -4478,6 +4516,10 @@ if TYPE_CHECKING:
|
||||
)
|
||||
from .models.plbart import PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP, PLBartConfig
|
||||
from .models.poolformer import POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, PoolFormerConfig
|
||||
from .models.pop2piano import (
|
||||
POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP,
|
||||
Pop2PianoConfig,
|
||||
)
|
||||
from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer
|
||||
from .models.pvt import PVT_PRETRAINED_CONFIG_ARCHIVE_MAP, PvtConfig
|
||||
from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig
|
||||
@@ -6122,6 +6164,11 @@ if TYPE_CHECKING:
|
||||
PoolFormerModel,
|
||||
PoolFormerPreTrainedModel,
|
||||
)
|
||||
from .models.pop2piano import (
|
||||
POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
Pop2PianoForConditionalGeneration,
|
||||
Pop2PianoPreTrainedModel,
|
||||
)
|
||||
from .models.prophetnet import (
|
||||
PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
ProphetNetDecoder,
|
||||
@@ -7212,6 +7259,20 @@ if TYPE_CHECKING:
|
||||
# Trainer
|
||||
from .trainer_tf import TFTrainer
|
||||
|
||||
try:
|
||||
if not (
|
||||
is_librosa_available()
|
||||
and is_essentia_available()
|
||||
and is_scipy_available()
|
||||
and is_torch_available()
|
||||
and is_pretty_midi_available()
|
||||
):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
from .utils.dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects import *
|
||||
else:
|
||||
from .models.pop2piano import Pop2PianoFeatureExtractor, Pop2PianoProcessor, Pop2PianoTokenizer
|
||||
|
||||
try:
|
||||
if not is_flax_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
|
||||
@@ -157,6 +157,7 @@ from . import (
|
||||
pix2struct,
|
||||
plbart,
|
||||
poolformer,
|
||||
pop2piano,
|
||||
prophetnet,
|
||||
pvt,
|
||||
qdqbert,
|
||||
|
||||
@@ -162,6 +162,7 @@ CONFIG_MAPPING_NAMES = OrderedDict(
|
||||
("pix2struct", "Pix2StructConfig"),
|
||||
("plbart", "PLBartConfig"),
|
||||
("poolformer", "PoolFormerConfig"),
|
||||
("pop2piano", "Pop2PianoConfig"),
|
||||
("prophetnet", "ProphetNetConfig"),
|
||||
("pvt", "PvtConfig"),
|
||||
("qdqbert", "QDQBertConfig"),
|
||||
@@ -361,6 +362,7 @@ CONFIG_ARCHIVE_MAP_MAPPING_NAMES = OrderedDict(
|
||||
("pix2struct", "PIX2STRUCT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("plbart", "PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("pop2piano", "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("prophetnet", "PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("pvt", "PVT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
("qdqbert", "QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
|
||||
@@ -578,6 +580,7 @@ MODEL_NAMES_MAPPING = OrderedDict(
|
||||
("pix2struct", "Pix2Struct"),
|
||||
("plbart", "PLBart"),
|
||||
("poolformer", "PoolFormer"),
|
||||
("pop2piano", "Pop2Piano"),
|
||||
("prophetnet", "ProphetNet"),
|
||||
("pvt", "PVT"),
|
||||
("qdqbert", "QDQBert"),
|
||||
|
||||
@@ -73,6 +73,7 @@ FEATURE_EXTRACTOR_MAPPING_NAMES = OrderedDict(
|
||||
("owlvit", "OwlViTFeatureExtractor"),
|
||||
("perceiver", "PerceiverFeatureExtractor"),
|
||||
("poolformer", "PoolFormerFeatureExtractor"),
|
||||
("pop2piano", "Pop2PianoFeatureExtractor"),
|
||||
("regnet", "ConvNextFeatureExtractor"),
|
||||
("resnet", "ConvNextFeatureExtractor"),
|
||||
("segformer", "SegformerFeatureExtractor"),
|
||||
|
||||
@@ -346,6 +346,7 @@ MODEL_WITH_LM_HEAD_MAPPING_NAMES = OrderedDict(
|
||||
("openai-gpt", "OpenAIGPTLMHeadModel"),
|
||||
("pegasus_x", "PegasusXForConditionalGeneration"),
|
||||
("plbart", "PLBartForConditionalGeneration"),
|
||||
("pop2piano", "Pop2PianoForConditionalGeneration"),
|
||||
("qdqbert", "QDQBertForMaskedLM"),
|
||||
("reformer", "ReformerModelWithLMHead"),
|
||||
("rembert", "RemBertForMaskedLM"),
|
||||
@@ -670,6 +671,7 @@ MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING_NAMES = OrderedDict(
|
||||
|
||||
MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
|
||||
[
|
||||
("pop2piano", "Pop2PianoForConditionalGeneration"),
|
||||
("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
|
||||
("speech_to_text", "Speech2TextForConditionalGeneration"),
|
||||
("speecht5", "SpeechT5ForSpeechToText"),
|
||||
|
||||
@@ -67,6 +67,7 @@ PROCESSOR_MAPPING_NAMES = OrderedDict(
|
||||
("oneformer", "OneFormerProcessor"),
|
||||
("owlvit", "OwlViTProcessor"),
|
||||
("pix2struct", "Pix2StructProcessor"),
|
||||
("pop2piano", "Pop2PianoProcessor"),
|
||||
("sam", "SamProcessor"),
|
||||
("sew", "Wav2Vec2Processor"),
|
||||
("sew-d", "Wav2Vec2Processor"),
|
||||
|
||||
122
src/transformers/models/pop2piano/__init__.py
Normal file
122
src/transformers/models/pop2piano/__init__.py
Normal file
@@ -0,0 +1,122 @@
|
||||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ...utils import (
|
||||
OptionalDependencyNotAvailable,
|
||||
_LazyModule,
|
||||
is_essentia_available,
|
||||
is_librosa_available,
|
||||
is_pretty_midi_available,
|
||||
is_scipy_available,
|
||||
is_torch_available,
|
||||
)
|
||||
|
||||
|
||||
_import_structure = {
|
||||
"configuration_pop2piano": ["POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP", "Pop2PianoConfig"],
|
||||
}
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["modeling_pop2piano"] = [
|
||||
"POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST",
|
||||
"Pop2PianoForConditionalGeneration",
|
||||
"Pop2PianoPreTrainedModel",
|
||||
]
|
||||
|
||||
try:
|
||||
if not (is_librosa_available() and is_essentia_available() and is_scipy_available() and is_torch_available()):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["feature_extraction_pop2piano"] = ["Pop2PianoFeatureExtractor"]
|
||||
|
||||
try:
|
||||
if not (is_pretty_midi_available() and is_torch_available()):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["tokenization_pop2piano"] = ["Pop2PianoTokenizer"]
|
||||
|
||||
try:
|
||||
if not (
|
||||
is_pretty_midi_available()
|
||||
and is_torch_available()
|
||||
and is_librosa_available()
|
||||
and is_essentia_available()
|
||||
and is_scipy_available()
|
||||
):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
_import_structure["processing_pop2piano"] = ["Pop2PianoProcessor"]
|
||||
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .configuration_pop2piano import POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP, Pop2PianoConfig
|
||||
|
||||
try:
|
||||
if not is_torch_available():
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .modeling_pop2piano import (
|
||||
POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST,
|
||||
Pop2PianoForConditionalGeneration,
|
||||
Pop2PianoPreTrainedModel,
|
||||
)
|
||||
|
||||
try:
|
||||
if not (is_librosa_available() and is_essentia_available() and is_scipy_available() and is_torch_available()):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .feature_extraction_pop2piano import Pop2PianoFeatureExtractor
|
||||
|
||||
try:
|
||||
if not (is_pretty_midi_available() and is_torch_available()):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .tokenization_pop2piano import Pop2PianoTokenizer
|
||||
|
||||
try:
|
||||
if not (
|
||||
is_pretty_midi_available()
|
||||
and is_torch_available()
|
||||
and is_librosa_available()
|
||||
and is_essentia_available()
|
||||
and is_scipy_available()
|
||||
):
|
||||
raise OptionalDependencyNotAvailable()
|
||||
except OptionalDependencyNotAvailable:
|
||||
pass
|
||||
else:
|
||||
from .processing_pop2piano import Pop2PianoProcessor
|
||||
|
||||
else:
|
||||
import sys
|
||||
|
||||
sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
|
||||
129
src/transformers/models/pop2piano/configuration_pop2piano.py
Normal file
129
src/transformers/models/pop2piano/configuration_pop2piano.py
Normal file
@@ -0,0 +1,129 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Pop2Piano model configuration"""
|
||||
|
||||
|
||||
from ...configuration_utils import PretrainedConfig
|
||||
from ...utils import logging
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP = {
|
||||
"sweetcocoa/pop2piano": "https://huggingface.co/sweetcocoa/pop2piano/blob/main/config.json"
|
||||
}
|
||||
|
||||
|
||||
class Pop2PianoConfig(PretrainedConfig):
|
||||
r"""
|
||||
This is the configuration class to store the configuration of a [`Pop2PianoForConditionalGeneration`]. It is used
|
||||
to instantiate a Pop2PianoForConditionalGeneration model according to the specified arguments, defining the model
|
||||
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
|
||||
Pop2Piano [sweetcocoa/pop2piano](https://huggingface.co/sweetcocoa/pop2piano) architecture.
|
||||
|
||||
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||
documentation from [`PretrainedConfig`] for more information.
|
||||
|
||||
Arguments:
|
||||
vocab_size (`int`, *optional*, defaults to 2400):
|
||||
Vocabulary size of the `Pop2PianoForConditionalGeneration` model. Defines the number of different tokens
|
||||
that can be represented by the `inputs_ids` passed when calling [`Pop2PianoForConditionalGeneration`].
|
||||
composer_vocab_size (`int`, *optional*, defaults to 21):
|
||||
Denotes the number of composers.
|
||||
d_model (`int`, *optional*, defaults to 512):
|
||||
Size of the encoder layers and the pooler layer.
|
||||
d_kv (`int`, *optional*, defaults to 64):
|
||||
Size of the key, query, value projections per attention head. The `inner_dim` of the projection layer will
|
||||
be defined as `num_heads * d_kv`.
|
||||
d_ff (`int`, *optional*, defaults to 2048):
|
||||
Size of the intermediate feed forward layer in each `Pop2PianoBlock`.
|
||||
num_layers (`int`, *optional*, defaults to 6):
|
||||
Number of hidden layers in the Transformer encoder.
|
||||
num_decoder_layers (`int`, *optional*):
|
||||
Number of hidden layers in the Transformer decoder. Will use the same value as `num_layers` if not set.
|
||||
num_heads (`int`, *optional*, defaults to 8):
|
||||
Number of attention heads for each attention layer in the Transformer encoder.
|
||||
relative_attention_num_buckets (`int`, *optional*, defaults to 32):
|
||||
The number of buckets to use for each attention layer.
|
||||
relative_attention_max_distance (`int`, *optional*, defaults to 128):
|
||||
The maximum distance of the longer sequences for the bucket separation.
|
||||
dropout_rate (`float`, *optional*, defaults to 0.1):
|
||||
The ratio for all dropout layers.
|
||||
layer_norm_epsilon (`float`, *optional*, defaults to 1e-6):
|
||||
The epsilon used by the layer normalization layers.
|
||||
initializer_factor (`float`, *optional*, defaults to 1.0):
|
||||
A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization
|
||||
testing).
|
||||
feed_forward_proj (`string`, *optional*, defaults to `"gated-gelu"`):
|
||||
Type of feed forward layer to be used. Should be one of `"relu"` or `"gated-gelu"`.
|
||||
use_cache (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not the model should return the last key/values attentions (not used by all models).
|
||||
dense_act_fn (`string`, *optional*, defaults to `"relu"`):
|
||||
Type of Activation Function to be used in `Pop2PianoDenseActDense` and in `Pop2PianoDenseGatedActDense`.
|
||||
"""
|
||||
|
||||
model_type = "pop2piano"
|
||||
keys_to_ignore_at_inference = ["past_key_values"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab_size=2400,
|
||||
composer_vocab_size=21,
|
||||
d_model=512,
|
||||
d_kv=64,
|
||||
d_ff=2048,
|
||||
num_layers=6,
|
||||
num_decoder_layers=None,
|
||||
num_heads=8,
|
||||
relative_attention_num_buckets=32,
|
||||
relative_attention_max_distance=128,
|
||||
dropout_rate=0.1,
|
||||
layer_norm_epsilon=1e-6,
|
||||
initializer_factor=1.0,
|
||||
feed_forward_proj="gated-gelu", # noqa
|
||||
is_encoder_decoder=True,
|
||||
use_cache=True,
|
||||
pad_token_id=0,
|
||||
eos_token_id=1,
|
||||
dense_act_fn="relu",
|
||||
**kwargs,
|
||||
):
|
||||
self.vocab_size = vocab_size
|
||||
self.composer_vocab_size = composer_vocab_size
|
||||
self.d_model = d_model
|
||||
self.d_kv = d_kv
|
||||
self.d_ff = d_ff
|
||||
self.num_layers = num_layers
|
||||
self.num_decoder_layers = num_decoder_layers if num_decoder_layers is not None else self.num_layers
|
||||
self.num_heads = num_heads
|
||||
self.relative_attention_num_buckets = relative_attention_num_buckets
|
||||
self.relative_attention_max_distance = relative_attention_max_distance
|
||||
self.dropout_rate = dropout_rate
|
||||
self.layer_norm_epsilon = layer_norm_epsilon
|
||||
self.initializer_factor = initializer_factor
|
||||
self.feed_forward_proj = feed_forward_proj
|
||||
self.use_cache = use_cache
|
||||
self.dense_act_fn = dense_act_fn
|
||||
self.is_gated_act = self.feed_forward_proj.split("-")[0] == "gated"
|
||||
self.hidden_size = self.d_model
|
||||
self.num_attention_heads = num_heads
|
||||
self.num_hidden_layers = num_layers
|
||||
|
||||
super().__init__(
|
||||
pad_token_id=pad_token_id,
|
||||
eos_token_id=eos_token_id,
|
||||
is_encoder_decoder=is_encoder_decoder,
|
||||
**kwargs,
|
||||
)
|
||||
@@ -0,0 +1,190 @@
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
""" File for loading the Pop2Piano model weights from the official repository and to show how tokenizer vocab was
|
||||
constructed"""
|
||||
|
||||
import json
|
||||
|
||||
import torch
|
||||
|
||||
from transformers import Pop2PianoConfig, Pop2PianoForConditionalGeneration
|
||||
|
||||
|
||||
########################## MODEL WEIGHTS ##########################
|
||||
|
||||
# This weights were downloaded from the official pop2piano repository
|
||||
# https://huggingface.co/sweetcocoa/pop2piano/blob/main/model-1999-val_0.67311615.ckpt
|
||||
official_weights = torch.load("./model-1999-val_0.67311615.ckpt")
|
||||
state_dict = {}
|
||||
|
||||
|
||||
# load the config and init the model
|
||||
cfg = Pop2PianoConfig.from_pretrained("sweetcocoa/pop2piano")
|
||||
model = Pop2PianoForConditionalGeneration(cfg)
|
||||
|
||||
|
||||
# load relative attention bias
|
||||
state_dict["encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = official_weights["state_dict"][
|
||||
"transformer.encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"
|
||||
]
|
||||
state_dict["decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"] = official_weights["state_dict"][
|
||||
"transformer.decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight"
|
||||
]
|
||||
|
||||
# load embed tokens and final layer norm for both encoder and decoder
|
||||
state_dict["encoder.embed_tokens.weight"] = official_weights["state_dict"]["transformer.encoder.embed_tokens.weight"]
|
||||
state_dict["decoder.embed_tokens.weight"] = official_weights["state_dict"]["transformer.decoder.embed_tokens.weight"]
|
||||
|
||||
state_dict["encoder.final_layer_norm.weight"] = official_weights["state_dict"][
|
||||
"transformer.encoder.final_layer_norm.weight"
|
||||
]
|
||||
state_dict["decoder.final_layer_norm.weight"] = official_weights["state_dict"][
|
||||
"transformer.decoder.final_layer_norm.weight"
|
||||
]
|
||||
|
||||
# load lm_head, mel_conditioner.emb and shared
|
||||
state_dict["lm_head.weight"] = official_weights["state_dict"]["transformer.lm_head.weight"]
|
||||
state_dict["mel_conditioner.embedding.weight"] = official_weights["state_dict"]["mel_conditioner.embedding.weight"]
|
||||
state_dict["shared.weight"] = official_weights["state_dict"]["transformer.shared.weight"]
|
||||
|
||||
# load each encoder blocks
|
||||
for i in range(cfg.num_layers):
|
||||
# layer 0
|
||||
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.q.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.0.SelfAttention.q.weight"
|
||||
]
|
||||
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.k.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.0.SelfAttention.k.weight"
|
||||
]
|
||||
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.v.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.0.SelfAttention.v.weight"
|
||||
]
|
||||
state_dict[f"encoder.block.{i}.layer.0.SelfAttention.o.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.0.SelfAttention.o.weight"
|
||||
]
|
||||
state_dict[f"encoder.block.{i}.layer.0.layer_norm.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.0.layer_norm.weight"
|
||||
]
|
||||
|
||||
# layer 1
|
||||
state_dict[f"encoder.block.{i}.layer.1.DenseReluDense.wi_0.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.1.DenseReluDense.wi_0.weight"
|
||||
]
|
||||
state_dict[f"encoder.block.{i}.layer.1.DenseReluDense.wi_1.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.1.DenseReluDense.wi_1.weight"
|
||||
]
|
||||
state_dict[f"encoder.block.{i}.layer.1.DenseReluDense.wo.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.1.DenseReluDense.wo.weight"
|
||||
]
|
||||
state_dict[f"encoder.block.{i}.layer.1.layer_norm.weight"] = official_weights["state_dict"][
|
||||
f"transformer.encoder.block.{i}.layer.1.layer_norm.weight"
|
||||
]
|
||||
|
||||
# load each decoder blocks
|
||||
for i in range(6):
|
||||
# layer 0
|
||||
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.q.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.0.SelfAttention.q.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.k.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.0.SelfAttention.k.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.v.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.0.SelfAttention.v.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.0.SelfAttention.o.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.0.SelfAttention.o.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.0.layer_norm.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.0.layer_norm.weight"
|
||||
]
|
||||
|
||||
# layer 1
|
||||
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.q.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.q.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.k.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.k.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.v.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.v.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.1.EncDecAttention.o.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.1.EncDecAttention.o.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.1.layer_norm.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.1.layer_norm.weight"
|
||||
]
|
||||
|
||||
# layer 2
|
||||
state_dict[f"decoder.block.{i}.layer.2.DenseReluDense.wi_0.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.2.DenseReluDense.wi_0.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.2.DenseReluDense.wi_1.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.2.DenseReluDense.wi_1.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.2.DenseReluDense.wo.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.2.DenseReluDense.wo.weight"
|
||||
]
|
||||
state_dict[f"decoder.block.{i}.layer.2.layer_norm.weight"] = official_weights["state_dict"][
|
||||
f"transformer.decoder.block.{i}.layer.2.layer_norm.weight"
|
||||
]
|
||||
|
||||
model.load_state_dict(state_dict, strict=True)
|
||||
|
||||
# save the weights
|
||||
torch.save(state_dict, "./pytorch_model.bin")
|
||||
|
||||
########################## TOKENIZER ##########################
|
||||
|
||||
# the tokenize and detokenize methods are taken from the official implementation
|
||||
|
||||
|
||||
# link : https://github.com/sweetcocoa/pop2piano/blob/fac11e8dcfc73487513f4588e8d0c22a22f2fdc5/midi_tokenizer.py#L34
|
||||
def tokenize(idx, token_type, n_special=4, n_note=128, n_velocity=2):
|
||||
if token_type == "TOKEN_TIME":
|
||||
return n_special + n_note + n_velocity + idx
|
||||
elif token_type == "TOKEN_VELOCITY":
|
||||
return n_special + n_note + idx
|
||||
elif token_type == "TOKEN_NOTE":
|
||||
return n_special + idx
|
||||
elif token_type == "TOKEN_SPECIAL":
|
||||
return idx
|
||||
else:
|
||||
return -1
|
||||
|
||||
|
||||
# link : https://github.com/sweetcocoa/pop2piano/blob/fac11e8dcfc73487513f4588e8d0c22a22f2fdc5/midi_tokenizer.py#L48
|
||||
def detokenize(idx, n_special=4, n_note=128, n_velocity=2, time_idx_offset=0):
|
||||
if idx >= n_special + n_note + n_velocity:
|
||||
return "TOKEN_TIME", (idx - (n_special + n_note + n_velocity)) + time_idx_offset
|
||||
elif idx >= n_special + n_note:
|
||||
return "TOKEN_VELOCITY", idx - (n_special + n_note)
|
||||
elif idx >= n_special:
|
||||
return "TOKEN_NOTE", idx - n_special
|
||||
else:
|
||||
return "TOKEN_SPECIAL", idx
|
||||
|
||||
|
||||
# crate the decoder and then the encoder of the tokenizer
|
||||
decoder = {}
|
||||
for i in range(cfg.vocab_size):
|
||||
decoder.update({i: f"{detokenize(i)[1]}_{detokenize(i)[0]}"})
|
||||
|
||||
encoder = {v: k for k, v in decoder.items()}
|
||||
|
||||
# save the vocab
|
||||
with open("./vocab.json", "w") as file:
|
||||
file.write(json.dumps(encoder))
|
||||
@@ -0,0 +1,463 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Feature extractor class for Pop2Piano"""
|
||||
|
||||
import copy
|
||||
import warnings
|
||||
from typing import List, Optional, Union
|
||||
|
||||
import numpy
|
||||
import numpy as np
|
||||
|
||||
from ...audio_utils import mel_filter_bank, spectrogram
|
||||
from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
|
||||
from ...feature_extraction_utils import BatchFeature
|
||||
from ...utils import (
|
||||
TensorType,
|
||||
is_essentia_available,
|
||||
is_librosa_available,
|
||||
is_scipy_available,
|
||||
logging,
|
||||
requires_backends,
|
||||
)
|
||||
|
||||
|
||||
if is_essentia_available():
|
||||
import essentia
|
||||
import essentia.standard
|
||||
|
||||
if is_librosa_available():
|
||||
import librosa
|
||||
|
||||
if is_scipy_available():
|
||||
import scipy
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
|
||||
class Pop2PianoFeatureExtractor(SequenceFeatureExtractor):
|
||||
r"""
|
||||
Constructs a Pop2Piano feature extractor.
|
||||
|
||||
This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
|
||||
most of the main methods. Users should refer to this superclass for more information regarding those methods.
|
||||
|
||||
This class extracts rhythm and preprocesses the audio before it is passed to the model. First the audio is passed
|
||||
to `RhythmExtractor2013` algorithm which extracts the beat_times, beat positions and estimates their confidence as
|
||||
well as tempo in bpm, then beat_times is interpolated and to get beatsteps. Later we calculate
|
||||
extrapolated_beatsteps from it to be used in tokenizer. On the other hand audio is resampled to self.sampling_rate
|
||||
and preprocessed and then log mel spectogram is computed from that to be used in our transformer model.
|
||||
|
||||
Args:
|
||||
sampling_rate (`int`, *optional*, defaults to 22050):
|
||||
Target Sampling rate of audio signal. It's the sampling rate that we forward to the model.
|
||||
padding_value (`int`, *optional*, defaults to 0):
|
||||
Padding value used to pad the audio. Should correspond to silences.
|
||||
window_size (`int`, *optional*, defaults to 4096):
|
||||
Length of the window in samples to which the Fourier transform is applied.
|
||||
hop_length (`int`, *optional*, defaults to 1024):
|
||||
Step size between each window of the waveform, in samples.
|
||||
min_frequency (`float`, *optional*, defaults to 10.0):
|
||||
Lowest frequency that will be used in the log-mel spectrogram.
|
||||
feature_size (`int`, *optional*, defaults to 512):
|
||||
The feature dimension of the extracted features.
|
||||
num_bars (`int`, *optional*, defaults to 2):
|
||||
Determines interval between each sequence.
|
||||
"""
|
||||
model_input_names = ["input_features", "beatsteps", "extrapolated_beatstep"]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
sampling_rate: int = 22050,
|
||||
padding_value: int = 0,
|
||||
window_size: int = 4096,
|
||||
hop_length: int = 1024,
|
||||
min_frequency: float = 10.0,
|
||||
feature_size: int = 512,
|
||||
num_bars: int = 2,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__(
|
||||
feature_size=feature_size,
|
||||
sampling_rate=sampling_rate,
|
||||
padding_value=padding_value,
|
||||
**kwargs,
|
||||
)
|
||||
self.sampling_rate = sampling_rate
|
||||
self.padding_value = padding_value
|
||||
self.window_size = window_size
|
||||
self.hop_length = hop_length
|
||||
self.min_frequency = min_frequency
|
||||
self.feature_size = feature_size
|
||||
self.num_bars = num_bars
|
||||
self.mel_filters = mel_filter_bank(
|
||||
num_frequency_bins=(self.window_size // 2) + 1,
|
||||
num_mel_filters=self.feature_size,
|
||||
min_frequency=self.min_frequency,
|
||||
max_frequency=float(self.sampling_rate // 2),
|
||||
sampling_rate=self.sampling_rate,
|
||||
norm=None,
|
||||
mel_scale="htk",
|
||||
)
|
||||
|
||||
def mel_spectrogram(self, sequence: np.ndarray):
|
||||
"""
|
||||
Generates MelSpectrogram.
|
||||
|
||||
Args:
|
||||
sequence (`numpy.ndarray`):
|
||||
The sequence of which the mel-spectrogram will be computed.
|
||||
"""
|
||||
mel_specs = []
|
||||
for seq in sequence:
|
||||
window = np.hanning(self.window_size + 1)[:-1]
|
||||
mel_specs.append(
|
||||
spectrogram(
|
||||
waveform=seq,
|
||||
window=window,
|
||||
frame_length=self.window_size,
|
||||
hop_length=self.hop_length,
|
||||
power=2.0,
|
||||
mel_filters=self.mel_filters,
|
||||
)
|
||||
)
|
||||
mel_specs = np.array(mel_specs)
|
||||
|
||||
return mel_specs
|
||||
|
||||
def extract_rhythm(self, audio: np.ndarray):
|
||||
"""
|
||||
This algorithm(`RhythmExtractor2013`) extracts the beat positions and estimates their confidence as well as
|
||||
tempo in bpm for an audio signal. For more information please visit
|
||||
https://essentia.upf.edu/reference/std_RhythmExtractor2013.html .
|
||||
|
||||
Args:
|
||||
audio(`numpy.ndarray`):
|
||||
raw audio waveform which is passed to the Rhythm Extractor.
|
||||
"""
|
||||
requires_backends(self, ["essentia"])
|
||||
essentia_tracker = essentia.standard.RhythmExtractor2013(method="multifeature")
|
||||
bpm, beat_times, confidence, estimates, essentia_beat_intervals = essentia_tracker(audio)
|
||||
|
||||
return bpm, beat_times, confidence, estimates, essentia_beat_intervals
|
||||
|
||||
def interpolate_beat_times(
|
||||
self, beat_times: numpy.ndarray, steps_per_beat: numpy.ndarray, n_extend: numpy.ndarray
|
||||
):
|
||||
"""
|
||||
This method takes beat_times and then interpolates that using `scipy.interpolate.interp1d` and the output is
|
||||
then used to convert raw audio to log-mel-spectrogram.
|
||||
|
||||
Args:
|
||||
beat_times (`numpy.ndarray`):
|
||||
beat_times is passed into `scipy.interpolate.interp1d` for processing.
|
||||
steps_per_beat (`int`):
|
||||
used as an parameter to control the interpolation.
|
||||
n_extend (`int`):
|
||||
used as an parameter to control the interpolation.
|
||||
"""
|
||||
|
||||
requires_backends(self, ["scipy"])
|
||||
beat_times_function = scipy.interpolate.interp1d(
|
||||
np.arange(beat_times.size),
|
||||
beat_times,
|
||||
bounds_error=False,
|
||||
fill_value="extrapolate",
|
||||
)
|
||||
|
||||
ext_beats = beat_times_function(
|
||||
np.linspace(0, beat_times.size + n_extend - 1, beat_times.size * steps_per_beat + n_extend)
|
||||
)
|
||||
|
||||
return ext_beats
|
||||
|
||||
def preprocess_mel(self, audio: np.ndarray, beatstep: np.ndarray):
|
||||
"""
|
||||
Preprocessing for log-mel-spectrogram
|
||||
|
||||
Args:
|
||||
audio (`numpy.ndarray` of shape `(audio_length, )` ):
|
||||
Raw audio waveform to be processed.
|
||||
beatstep (`numpy.ndarray`):
|
||||
Interpolated values of the raw audio. If beatstep[0] is greater than 0.0, then it will be shifted by
|
||||
the value at beatstep[0].
|
||||
"""
|
||||
|
||||
if audio is not None and len(audio.shape) != 1:
|
||||
raise ValueError(
|
||||
f"Expected `audio` to be a single channel audio input of shape `(n, )` but found shape {audio.shape}."
|
||||
)
|
||||
if beatstep[0] > 0.0:
|
||||
beatstep = beatstep - beatstep[0]
|
||||
|
||||
num_steps = self.num_bars * 4
|
||||
num_target_steps = len(beatstep)
|
||||
extrapolated_beatstep = self.interpolate_beat_times(
|
||||
beat_times=beatstep, steps_per_beat=1, n_extend=(self.num_bars + 1) * 4 + 1
|
||||
)
|
||||
|
||||
sample_indices = []
|
||||
max_feature_length = 0
|
||||
for i in range(0, num_target_steps, num_steps):
|
||||
start_idx = i
|
||||
end_idx = min(i + num_steps, num_target_steps)
|
||||
start_sample = int(extrapolated_beatstep[start_idx] * self.sampling_rate)
|
||||
end_sample = int(extrapolated_beatstep[end_idx] * self.sampling_rate)
|
||||
sample_indices.append((start_sample, end_sample))
|
||||
max_feature_length = max(max_feature_length, end_sample - start_sample)
|
||||
padded_batch = []
|
||||
for start_sample, end_sample in sample_indices:
|
||||
feature = audio[start_sample:end_sample]
|
||||
padded_feature = np.pad(
|
||||
feature,
|
||||
((0, max_feature_length - feature.shape[0]),),
|
||||
"constant",
|
||||
constant_values=0,
|
||||
)
|
||||
padded_batch.append(padded_feature)
|
||||
|
||||
padded_batch = np.asarray(padded_batch)
|
||||
return padded_batch, extrapolated_beatstep
|
||||
|
||||
def _pad(self, features: np.ndarray, add_zero_line=True):
|
||||
features_shapes = [each_feature.shape for each_feature in features]
|
||||
attention_masks, padded_features = [], []
|
||||
for i, each_feature in enumerate(features):
|
||||
# To pad "input_features".
|
||||
if len(each_feature.shape) == 3:
|
||||
features_pad_value = max([*zip(*features_shapes)][1]) - features_shapes[i][1]
|
||||
attention_mask = np.ones(features_shapes[i][:2], dtype=np.int64)
|
||||
feature_padding = ((0, 0), (0, features_pad_value), (0, 0))
|
||||
attention_mask_padding = (feature_padding[0], feature_padding[1])
|
||||
|
||||
# To pad "beatsteps" and "extrapolated_beatstep".
|
||||
else:
|
||||
each_feature = each_feature.reshape(1, -1)
|
||||
features_pad_value = max([*zip(*features_shapes)][0]) - features_shapes[i][0]
|
||||
attention_mask = np.ones(features_shapes[i], dtype=np.int64).reshape(1, -1)
|
||||
feature_padding = attention_mask_padding = ((0, 0), (0, features_pad_value))
|
||||
|
||||
each_padded_feature = np.pad(each_feature, feature_padding, "constant", constant_values=self.padding_value)
|
||||
attention_mask = np.pad(
|
||||
attention_mask, attention_mask_padding, "constant", constant_values=self.padding_value
|
||||
)
|
||||
|
||||
if add_zero_line:
|
||||
# if it is batched then we seperate each examples using zero array
|
||||
zero_array_len = max([*zip(*features_shapes)][1])
|
||||
|
||||
# we concatenate the zero array line here
|
||||
each_padded_feature = np.concatenate(
|
||||
[each_padded_feature, np.zeros([1, zero_array_len, self.feature_size])], axis=0
|
||||
)
|
||||
attention_mask = np.concatenate(
|
||||
[attention_mask, np.zeros([1, zero_array_len], dtype=attention_mask.dtype)], axis=0
|
||||
)
|
||||
|
||||
padded_features.append(each_padded_feature)
|
||||
attention_masks.append(attention_mask)
|
||||
|
||||
padded_features = np.concatenate(padded_features, axis=0).astype(np.float32)
|
||||
attention_masks = np.concatenate(attention_masks, axis=0).astype(np.int64)
|
||||
|
||||
return padded_features, attention_masks
|
||||
|
||||
def pad(
|
||||
self,
|
||||
inputs: BatchFeature,
|
||||
is_batched: bool,
|
||||
return_attention_mask: bool,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
):
|
||||
"""
|
||||
Pads the inputs to same length and returns attention_mask.
|
||||
|
||||
Args:
|
||||
inputs (`BatchFeature`):
|
||||
Processed audio features.
|
||||
is_batched (`bool`):
|
||||
Whether inputs are batched or not.
|
||||
return_attention_mask (`bool`):
|
||||
Whether to return attention mask or not.
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors instead of list of python integers. Acceptable values are:
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return Numpy `np.ndarray` objects.
|
||||
If nothing is specified, it will return list of `np.ndarray` arrays.
|
||||
Return:
|
||||
`BatchFeature` with attention_mask, attention_mask_beatsteps and attention_mask_extrapolated_beatstep added
|
||||
to it:
|
||||
- **attention_mask** numpy.ndarray of shape `(batch_size, max_input_features_seq_length)` --
|
||||
Example :
|
||||
1, 1, 1, 0, 0 (audio 1, also here it is padded to max length of 5 thats why there are 2 zeros at
|
||||
the end indicating they are padded)
|
||||
|
||||
0, 0, 0, 0, 0 (zero pad to seperate audio 1 and 2)
|
||||
|
||||
1, 1, 1, 1, 1 (audio 2)
|
||||
|
||||
0, 0, 0, 0, 0 (zero pad to seperate audio 2 and 3)
|
||||
|
||||
1, 1, 1, 1, 1 (audio 3)
|
||||
- **attention_mask_beatsteps** numpy.ndarray of shape `(batch_size, max_beatsteps_seq_length)`
|
||||
- **attention_mask_extrapolated_beatstep** numpy.ndarray of shape `(batch_size,
|
||||
max_extrapolated_beatstep_seq_length)`
|
||||
"""
|
||||
|
||||
processed_features_dict = {}
|
||||
for feature_name, feature_value in inputs.items():
|
||||
if feature_name == "input_features":
|
||||
padded_feature_values, attention_mask = self._pad(feature_value, add_zero_line=True)
|
||||
processed_features_dict[feature_name] = padded_feature_values
|
||||
if return_attention_mask:
|
||||
processed_features_dict["attention_mask"] = attention_mask
|
||||
else:
|
||||
padded_feature_values, attention_mask = self._pad(feature_value, add_zero_line=False)
|
||||
processed_features_dict[feature_name] = padded_feature_values
|
||||
if return_attention_mask:
|
||||
processed_features_dict[f"attention_mask_{feature_name}"] = attention_mask
|
||||
|
||||
# If we are processing only one example, we should remove the zero array line since we don't need it to
|
||||
# seperate examples from each other.
|
||||
if not is_batched and not return_attention_mask:
|
||||
processed_features_dict["input_features"] = processed_features_dict["input_features"][:-1, ...]
|
||||
|
||||
outputs = BatchFeature(processed_features_dict, tensor_type=return_tensors)
|
||||
|
||||
return outputs
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
audio: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
|
||||
sampling_rate: Union[int, List[int]],
|
||||
steps_per_beat: int = 2,
|
||||
resample: Optional[bool] = True,
|
||||
return_attention_mask: Optional[bool] = False,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
**kwargs,
|
||||
) -> BatchFeature:
|
||||
"""
|
||||
Main method to featurize and prepare for the model.
|
||||
|
||||
Args:
|
||||
audio (`np.ndarray`, `List`):
|
||||
The audio or batch of audio to be processed. Each audio can be a numpy array, a list of float values, a
|
||||
list of numpy arrays or a list of list of float values.
|
||||
sampling_rate (`int`):
|
||||
The sampling rate at which the `audio` input was sampled. It is strongly recommended to pass
|
||||
`sampling_rate` at the forward call to prevent silent errors.
|
||||
steps_per_beat (`int`, *optional*, defaults to 2):
|
||||
This is used in interpolating `beat_times`.
|
||||
resample (`bool`, *optional*, defaults to `True`):
|
||||
Determines whether to resample the audio to `sampling_rate` or not before processing. Must be True
|
||||
during inference.
|
||||
return_attention_mask (`bool` *optional*, defaults to `False`):
|
||||
Denotes if attention_mask for input_features, beatsteps and extrapolated_beatstep will be given as
|
||||
output or not. Automatically set to True for batched inputs.
|
||||
return_tensors (`str` or [`~utils.TensorType`], *optional*):
|
||||
If set, will return tensors instead of list of python integers. Acceptable values are:
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return Numpy `np.ndarray` objects.
|
||||
If nothing is specified, it will return list of `np.ndarray` arrays.
|
||||
"""
|
||||
|
||||
requires_backends(self, ["librosa"])
|
||||
is_batched = bool(isinstance(audio, (list, tuple)) and isinstance(audio[0], (np.ndarray, tuple, list)))
|
||||
if is_batched:
|
||||
# This enables the user to process files of different sampling_rate at same time
|
||||
if not isinstance(sampling_rate, list):
|
||||
raise ValueError(
|
||||
"Please give sampling_rate of each audio separately when you are passing multiple raw_audios at the same time. "
|
||||
f"Received {sampling_rate}, expected [audio_1_sr, ..., audio_n_sr]."
|
||||
)
|
||||
return_attention_mask = True if return_attention_mask is None else return_attention_mask
|
||||
else:
|
||||
audio = [audio]
|
||||
sampling_rate = [sampling_rate]
|
||||
return_attention_mask = False if return_attention_mask is None else return_attention_mask
|
||||
|
||||
batch_input_features, batch_beatsteps, batch_ext_beatstep = [], [], []
|
||||
for single_raw_audio, single_sampling_rate in zip(audio, sampling_rate):
|
||||
bpm, beat_times, confidence, estimates, essentia_beat_intervals = self.extract_rhythm(
|
||||
audio=single_raw_audio
|
||||
)
|
||||
beatsteps = self.interpolate_beat_times(beat_times=beat_times, steps_per_beat=steps_per_beat, n_extend=1)
|
||||
|
||||
if self.sampling_rate != single_sampling_rate and self.sampling_rate is not None:
|
||||
if resample:
|
||||
# Change sampling_rate to self.sampling_rate
|
||||
single_raw_audio = librosa.core.resample(
|
||||
single_raw_audio,
|
||||
orig_sr=single_sampling_rate,
|
||||
target_sr=self.sampling_rate,
|
||||
res_type="kaiser_best",
|
||||
)
|
||||
else:
|
||||
warnings.warn(
|
||||
f"The sampling_rate of the provided audio is different from the target sampling_rate"
|
||||
f"of the Feature Extractor, {self.sampling_rate} vs {single_sampling_rate}. "
|
||||
f"In these cases it is recommended to use `resample=True` in the `__call__` method to"
|
||||
f"get the optimal behaviour."
|
||||
)
|
||||
|
||||
single_sampling_rate = self.sampling_rate
|
||||
start_sample = int(beatsteps[0] * single_sampling_rate)
|
||||
end_sample = int(beatsteps[-1] * single_sampling_rate)
|
||||
|
||||
input_features, extrapolated_beatstep = self.preprocess_mel(
|
||||
single_raw_audio[start_sample:end_sample], beatsteps - beatsteps[0]
|
||||
)
|
||||
|
||||
mel_specs = self.mel_spectrogram(input_features.astype(np.float32))
|
||||
|
||||
# apply np.log to get log mel-spectrograms
|
||||
log_mel_specs = np.log(np.clip(mel_specs, a_min=1e-6, a_max=None))
|
||||
|
||||
input_features = np.transpose(log_mel_specs, (0, -1, -2))
|
||||
|
||||
batch_input_features.append(input_features)
|
||||
batch_beatsteps.append(beatsteps)
|
||||
batch_ext_beatstep.append(extrapolated_beatstep)
|
||||
|
||||
output = BatchFeature(
|
||||
{
|
||||
"input_features": batch_input_features,
|
||||
"beatsteps": batch_beatsteps,
|
||||
"extrapolated_beatstep": batch_ext_beatstep,
|
||||
}
|
||||
)
|
||||
|
||||
output = self.pad(
|
||||
output,
|
||||
is_batched=is_batched,
|
||||
return_attention_mask=return_attention_mask,
|
||||
return_tensors=return_tensors,
|
||||
)
|
||||
|
||||
return output
|
||||
|
||||
def to_dict(self):
|
||||
"""
|
||||
Serializes this instance to a Python dictionary.
|
||||
|
||||
Returns:
|
||||
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
|
||||
"""
|
||||
output = copy.deepcopy(self.__dict__)
|
||||
output["feature_extractor_type"] = self.__class__.__name__
|
||||
if "mel_filters" in output:
|
||||
del output["mel_filters"]
|
||||
return output
|
||||
1377
src/transformers/models/pop2piano/modeling_pop2piano.py
Normal file
1377
src/transformers/models/pop2piano/modeling_pop2piano.py
Normal file
File diff suppressed because it is too large
Load Diff
138
src/transformers/models/pop2piano/processing_pop2piano.py
Normal file
138
src/transformers/models/pop2piano/processing_pop2piano.py
Normal file
@@ -0,0 +1,138 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Processor class for Pop2Piano."""
|
||||
|
||||
import os
|
||||
from typing import List, Optional, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...feature_extraction_utils import BatchFeature
|
||||
from ...processing_utils import ProcessorMixin
|
||||
from ...tokenization_utils import BatchEncoding, PaddingStrategy, TruncationStrategy
|
||||
from ...utils import TensorType
|
||||
|
||||
|
||||
class Pop2PianoProcessor(ProcessorMixin):
|
||||
r"""
|
||||
Constructs an Pop2Piano processor which wraps a Pop2Piano Feature Extractor and Pop2Piano Tokenizer into a single
|
||||
processor.
|
||||
|
||||
[`Pop2PianoProcessor`] offers all the functionalities of [`Pop2PianoFeatureExtractor`] and [`Pop2PianoTokenizer`].
|
||||
See the docstring of [`~Pop2PianoProcessor.__call__`] and [`~Pop2PianoProcessor.decode`] for more information.
|
||||
|
||||
Args:
|
||||
feature_extractor (`Pop2PianoFeatureExtractor`):
|
||||
An instance of [`Pop2PianoFeatureExtractor`]. The feature extractor is a required input.
|
||||
tokenizer (`Pop2PianoTokenizer`):
|
||||
An instance of ['Pop2PianoTokenizer`]. The tokenizer is a required input.
|
||||
"""
|
||||
attributes = ["feature_extractor", "tokenizer"]
|
||||
feature_extractor_class = "Pop2PianoFeatureExtractor"
|
||||
tokenizer_class = "Pop2PianoTokenizer"
|
||||
|
||||
def __init__(self, feature_extractor, tokenizer):
|
||||
super().__init__(feature_extractor, tokenizer)
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
audio: Union[np.ndarray, List[float], List[np.ndarray]] = None,
|
||||
sampling_rate: Union[int, List[int]] = None,
|
||||
steps_per_beat: int = 2,
|
||||
resample: Optional[bool] = True,
|
||||
notes: Union[List, TensorType] = None,
|
||||
padding: Union[bool, str, PaddingStrategy] = False,
|
||||
truncation: Union[bool, str, TruncationStrategy] = None,
|
||||
max_length: Optional[int] = None,
|
||||
pad_to_multiple_of: Optional[int] = None,
|
||||
verbose: bool = True,
|
||||
**kwargs,
|
||||
) -> Union[BatchFeature, BatchEncoding]:
|
||||
"""
|
||||
This method uses [`Pop2PianoFeatureExtractor.__call__`] method to prepare log-mel-spectrograms for the model,
|
||||
and [`Pop2PianoTokenizer.__call__`] to prepare token_ids from notes.
|
||||
|
||||
Please refer to the docstring of the above two methods for more information.
|
||||
"""
|
||||
|
||||
# Since Feature Extractor needs both audio and sampling_rate and tokenizer needs both token_ids and
|
||||
# feature_extractor_output, we must check for both.
|
||||
if (audio is None and sampling_rate is None) and (notes is None):
|
||||
raise ValueError(
|
||||
"You have to specify at least audios and sampling_rate in order to use feature extractor or "
|
||||
"notes to use the tokenizer part."
|
||||
)
|
||||
|
||||
if audio is not None and sampling_rate is not None:
|
||||
inputs = self.feature_extractor(
|
||||
audio=audio,
|
||||
sampling_rate=sampling_rate,
|
||||
steps_per_beat=steps_per_beat,
|
||||
resample=resample,
|
||||
**kwargs,
|
||||
)
|
||||
if notes is not None:
|
||||
encoded_token_ids = self.tokenizer(
|
||||
notes=notes,
|
||||
padding=padding,
|
||||
truncation=truncation,
|
||||
max_length=max_length,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
verbose=verbose,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
if notes is None:
|
||||
return inputs
|
||||
|
||||
elif audio is None or sampling_rate is None:
|
||||
return encoded_token_ids
|
||||
|
||||
else:
|
||||
inputs["token_ids"] = encoded_token_ids["token_ids"]
|
||||
return inputs
|
||||
|
||||
def batch_decode(
|
||||
self,
|
||||
token_ids,
|
||||
feature_extractor_output: BatchFeature,
|
||||
return_midi: bool = True,
|
||||
) -> BatchEncoding:
|
||||
"""
|
||||
This method uses [`Pop2PianoTokenizer.batch_decode`] method to convert model generated token_ids to midi_notes.
|
||||
|
||||
Please refer to the docstring of the above two methods for more information.
|
||||
"""
|
||||
|
||||
return self.tokenizer.batch_decode(
|
||||
token_ids=token_ids, feature_extractor_output=feature_extractor_output, return_midi=return_midi
|
||||
)
|
||||
|
||||
@property
|
||||
def model_input_names(self):
|
||||
tokenizer_input_names = self.tokenizer.model_input_names
|
||||
feature_extractor_input_names = self.feature_extractor.model_input_names
|
||||
return list(dict.fromkeys(tokenizer_input_names + feature_extractor_input_names))
|
||||
|
||||
def save_pretrained(self, save_directory, **kwargs):
|
||||
if os.path.isfile(save_directory):
|
||||
raise ValueError(f"Provided path ({save_directory}) should be a directory, not a file")
|
||||
os.makedirs(save_directory, exist_ok=True)
|
||||
return super().save_pretrained(save_directory, **kwargs)
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
|
||||
args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
|
||||
return cls(*args)
|
||||
714
src/transformers/models/pop2piano/tokenization_pop2piano.py
Normal file
714
src/transformers/models/pop2piano/tokenization_pop2piano.py
Normal file
@@ -0,0 +1,714 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The Pop2Piano Authors and The HuggingFace Inc. team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Tokenization class for Pop2Piano."""
|
||||
|
||||
import json
|
||||
import os
|
||||
from typing import List, Optional, Tuple, Union
|
||||
|
||||
import numpy as np
|
||||
|
||||
from ...feature_extraction_utils import BatchFeature
|
||||
from ...tokenization_utils import AddedToken, BatchEncoding, PaddingStrategy, PreTrainedTokenizer, TruncationStrategy
|
||||
from ...utils import TensorType, is_pretty_midi_available, logging, requires_backends, to_numpy
|
||||
|
||||
|
||||
if is_pretty_midi_available():
|
||||
import pretty_midi
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
## TODO : changing checkpoints from `susnato/pop2piano_dev` to `sweetcocoa/pop2piano` after the PR is approved
|
||||
|
||||
VOCAB_FILES_NAMES = {
|
||||
"vocab": "vocab.json",
|
||||
}
|
||||
|
||||
PRETRAINED_VOCAB_FILES_MAP = {
|
||||
"vocab": {
|
||||
"susnato/pop2piano_dev": "https://huggingface.co/susnato/pop2piano_dev/blob/main/vocab.json",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def token_time_to_note(number, cutoff_time_idx, current_idx):
|
||||
current_idx += number
|
||||
if cutoff_time_idx is not None:
|
||||
current_idx = min(current_idx, cutoff_time_idx)
|
||||
|
||||
return current_idx
|
||||
|
||||
|
||||
def token_note_to_note(number, current_velocity, default_velocity, note_onsets_ready, current_idx, notes):
|
||||
if note_onsets_ready[number] is not None:
|
||||
# offset with onset
|
||||
onset_idx = note_onsets_ready[number]
|
||||
if onset_idx < current_idx:
|
||||
# Time shift after previous note_on
|
||||
offset_idx = current_idx
|
||||
notes.append([onset_idx, offset_idx, number, default_velocity])
|
||||
onsets_ready = None if current_velocity == 0 else current_idx
|
||||
note_onsets_ready[number] = onsets_ready
|
||||
else:
|
||||
note_onsets_ready[number] = current_idx
|
||||
return notes
|
||||
|
||||
|
||||
class Pop2PianoTokenizer(PreTrainedTokenizer):
|
||||
"""
|
||||
Constructs a Pop2Piano tokenizer. This tokenizer does not require training.
|
||||
|
||||
This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
|
||||
this superclass for more information regarding those methods.
|
||||
|
||||
Args:
|
||||
vocab (`str`):
|
||||
Path to the vocab file which contains the vocabulary.
|
||||
default_velocity (`int`, *optional*, defaults to 77):
|
||||
Determines the default velocity to be used while creating midi Notes.
|
||||
num_bars (`int`, *optional*, defaults to 2):
|
||||
Determines cutoff_time_idx in for each token.
|
||||
"""
|
||||
|
||||
model_input_names = ["token_ids", "attention_mask"]
|
||||
vocab_files_names = VOCAB_FILES_NAMES
|
||||
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab,
|
||||
default_velocity=77,
|
||||
num_bars=2,
|
||||
unk_token="-1",
|
||||
eos_token="1",
|
||||
pad_token="0",
|
||||
bos_token="2",
|
||||
**kwargs,
|
||||
):
|
||||
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
|
||||
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
|
||||
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
|
||||
bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
|
||||
|
||||
super().__init__(
|
||||
unk_token=unk_token,
|
||||
eos_token=eos_token,
|
||||
pad_token=pad_token,
|
||||
bos_token=bos_token,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self.default_velocity = default_velocity
|
||||
self.num_bars = num_bars
|
||||
|
||||
# Load the vocab
|
||||
with open(vocab, "rb") as file:
|
||||
self.encoder = json.load(file)
|
||||
|
||||
# create mappings for encoder
|
||||
self.decoder = {v: k for k, v in self.encoder.items()}
|
||||
|
||||
@property
|
||||
def vocab_size(self):
|
||||
"""Returns the vocabulary size of the tokenizer."""
|
||||
return len(self.encoder)
|
||||
|
||||
def get_vocab(self):
|
||||
"""Returns the vocabulary of the tokenizer."""
|
||||
return dict(self.encoder, **self.added_tokens_encoder)
|
||||
|
||||
def _convert_id_to_token(self, token_id: int) -> list:
|
||||
"""
|
||||
Decodes the token ids generated by the transformer into notes.
|
||||
|
||||
Args:
|
||||
token_id (`int`):
|
||||
This denotes the ids generated by the transformers to be converted to Midi tokens.
|
||||
|
||||
Returns:
|
||||
`List`: A list consists of token_type (`str`) and value (`int`).
|
||||
"""
|
||||
|
||||
token_type_value = self.decoder.get(token_id, f"{self.unk_token}_TOKEN_TIME")
|
||||
token_type_value = token_type_value.split("_")
|
||||
token_type, value = "_".join(token_type_value[1:]), int(token_type_value[0])
|
||||
|
||||
return [token_type, value]
|
||||
|
||||
def _convert_token_to_id(self, token, token_type="TOKEN_TIME") -> int:
|
||||
"""
|
||||
Encodes the Midi tokens to transformer generated token ids.
|
||||
|
||||
Args:
|
||||
token (`int`):
|
||||
This denotes the token value.
|
||||
token_type (`str`):
|
||||
This denotes the type of the token. There are four types of midi tokens such as "TOKEN_TIME",
|
||||
"TOKEN_VELOCITY", "TOKEN_NOTE" and "TOKEN_SPECIAL".
|
||||
|
||||
Returns:
|
||||
`int`: returns the id of the token.
|
||||
"""
|
||||
return self.encoder.get(f"{token}_{token_type}", int(self.unk_token))
|
||||
|
||||
def relative_batch_tokens_ids_to_notes(
|
||||
self,
|
||||
tokens: np.ndarray,
|
||||
beat_offset_idx: int,
|
||||
bars_per_batch: int,
|
||||
cutoff_time_idx: int,
|
||||
):
|
||||
"""
|
||||
Converts relative tokens to notes which are then used to generate pretty midi object.
|
||||
|
||||
Args:
|
||||
tokens (`numpy.ndarray`):
|
||||
Tokens to be converted to notes.
|
||||
beat_offset_idx (`int`):
|
||||
Denotes beat offset index for each note in generated Midi.
|
||||
bars_per_batch (`int`):
|
||||
A parameter to control the Midi output generation.
|
||||
cutoff_time_idx (`int`):
|
||||
Denotes the cutoff time index for each note in generated Midi.
|
||||
"""
|
||||
|
||||
notes = None
|
||||
|
||||
for index in range(len(tokens)):
|
||||
_tokens = tokens[index]
|
||||
_start_idx = beat_offset_idx + index * bars_per_batch * 4
|
||||
_cutoff_time_idx = cutoff_time_idx + _start_idx
|
||||
_notes = self.relative_tokens_ids_to_notes(
|
||||
_tokens,
|
||||
start_idx=_start_idx,
|
||||
cutoff_time_idx=_cutoff_time_idx,
|
||||
)
|
||||
|
||||
if len(_notes) == 0:
|
||||
pass
|
||||
elif notes is None:
|
||||
notes = _notes
|
||||
else:
|
||||
notes = np.concatenate((notes, _notes), axis=0)
|
||||
|
||||
if notes is None:
|
||||
return []
|
||||
return notes
|
||||
|
||||
def relative_batch_tokens_ids_to_midi(
|
||||
self,
|
||||
tokens: np.ndarray,
|
||||
beatstep: np.ndarray,
|
||||
beat_offset_idx: int = 0,
|
||||
bars_per_batch: int = 2,
|
||||
cutoff_time_idx: int = 12,
|
||||
):
|
||||
"""
|
||||
Converts tokens to Midi. This method calls `relative_batch_tokens_ids_to_notes` method to convert batch tokens
|
||||
to notes then uses `notes_to_midi` method to convert them to Midi.
|
||||
|
||||
Args:
|
||||
tokens (`numpy.ndarray`):
|
||||
Denotes tokens which alongside beatstep will be converted to Midi.
|
||||
beatstep (`np.ndarray`):
|
||||
We get beatstep from feature extractor which is also used to get Midi.
|
||||
beat_offset_idx (`int`, *optional*, defaults to 0):
|
||||
Denotes beat offset index for each note in generated Midi.
|
||||
bars_per_batch (`int`, *optional*, defaults to 2):
|
||||
A parameter to control the Midi output generation.
|
||||
cutoff_time_idx (`int`, *optional*, defaults to 12):
|
||||
Denotes the cutoff time index for each note in generated Midi.
|
||||
"""
|
||||
beat_offset_idx = 0 if beat_offset_idx is None else beat_offset_idx
|
||||
notes = self.relative_batch_tokens_ids_to_notes(
|
||||
tokens=tokens,
|
||||
beat_offset_idx=beat_offset_idx,
|
||||
bars_per_batch=bars_per_batch,
|
||||
cutoff_time_idx=cutoff_time_idx,
|
||||
)
|
||||
midi = self.notes_to_midi(notes, beatstep, offset_sec=beatstep[beat_offset_idx])
|
||||
return midi
|
||||
|
||||
# Taken from the original code
|
||||
# Please see https://github.com/sweetcocoa/pop2piano/blob/fac11e8dcfc73487513f4588e8d0c22a22f2fdc5/midi_tokenizer.py#L257
|
||||
def relative_tokens_ids_to_notes(self, tokens: np.ndarray, start_idx: float, cutoff_time_idx: float = None):
|
||||
"""
|
||||
Converts relative tokens to notes which will then be used to create Pretty Midi objects.
|
||||
|
||||
Args:
|
||||
tokens (`numpy.ndarray`):
|
||||
Relative Tokens which will be converted to notes.
|
||||
start_idx (`float`):
|
||||
A parameter which denotes the starting index.
|
||||
cutoff_time_idx (`float`, *optional*):
|
||||
A parameter used while converting tokens to notes.
|
||||
"""
|
||||
words = [self._convert_id_to_token(token) for token in tokens]
|
||||
|
||||
current_idx = start_idx
|
||||
current_velocity = 0
|
||||
note_onsets_ready = [None for i in range(sum([k.endswith("NOTE") for k in self.encoder.keys()]) + 1)]
|
||||
notes = []
|
||||
for token_type, number in words:
|
||||
if token_type == "TOKEN_SPECIAL":
|
||||
if number == 1:
|
||||
break
|
||||
elif token_type == "TOKEN_TIME":
|
||||
current_idx = token_time_to_note(
|
||||
number=number, cutoff_time_idx=cutoff_time_idx, current_idx=current_idx
|
||||
)
|
||||
elif token_type == "TOKEN_VELOCITY":
|
||||
current_velocity = number
|
||||
|
||||
elif token_type == "TOKEN_NOTE":
|
||||
notes = token_note_to_note(
|
||||
number=number,
|
||||
current_velocity=current_velocity,
|
||||
default_velocity=self.default_velocity,
|
||||
note_onsets_ready=note_onsets_ready,
|
||||
current_idx=current_idx,
|
||||
notes=notes,
|
||||
)
|
||||
else:
|
||||
raise ValueError("Token type not understood!")
|
||||
|
||||
for pitch, note_onset in enumerate(note_onsets_ready):
|
||||
# force offset if no offset for each pitch
|
||||
if note_onset is not None:
|
||||
if cutoff_time_idx is None:
|
||||
cutoff = note_onset + 1
|
||||
else:
|
||||
cutoff = max(cutoff_time_idx, note_onset + 1)
|
||||
|
||||
offset_idx = max(current_idx, cutoff)
|
||||
notes.append([note_onset, offset_idx, pitch, self.default_velocity])
|
||||
|
||||
if len(notes) == 0:
|
||||
return []
|
||||
else:
|
||||
notes = np.array(notes)
|
||||
note_order = notes[:, 0] * 128 + notes[:, 1]
|
||||
notes = notes[note_order.argsort()]
|
||||
return notes
|
||||
|
||||
def notes_to_midi(self, notes: np.ndarray, beatstep: np.ndarray, offset_sec: int = 0.0):
|
||||
"""
|
||||
Converts notes to Midi.
|
||||
|
||||
Args:
|
||||
notes (`numpy.ndarray`):
|
||||
This is used to create Pretty Midi objects.
|
||||
beatstep (`numpy.ndarray`):
|
||||
This is the extrapolated beatstep that we get from feature extractor.
|
||||
offset_sec (`int`, *optional*, defaults to 0.0):
|
||||
This represents the offset seconds which is used while creating each Pretty Midi Note.
|
||||
"""
|
||||
|
||||
requires_backends(self, ["pretty_midi"])
|
||||
|
||||
new_pm = pretty_midi.PrettyMIDI(resolution=384, initial_tempo=120.0)
|
||||
new_inst = pretty_midi.Instrument(program=0)
|
||||
new_notes = []
|
||||
|
||||
for onset_idx, offset_idx, pitch, velocity in notes:
|
||||
new_note = pretty_midi.Note(
|
||||
velocity=velocity,
|
||||
pitch=pitch,
|
||||
start=beatstep[onset_idx] - offset_sec,
|
||||
end=beatstep[offset_idx] - offset_sec,
|
||||
)
|
||||
new_notes.append(new_note)
|
||||
new_inst.notes = new_notes
|
||||
new_pm.instruments.append(new_inst)
|
||||
new_pm.remove_invalid_notes()
|
||||
return new_pm
|
||||
|
||||
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
|
||||
"""
|
||||
Saves the tokenizer's vocabulary dictionary to the provided save_directory.
|
||||
|
||||
Args:
|
||||
save_directory (`str`):
|
||||
A path to the directory where to saved. It will be created if it doesn't exist.
|
||||
filename_prefix (`Optional[str]`, *optional*):
|
||||
A prefix to add to the names of the files saved by the tokenizer.
|
||||
"""
|
||||
if not os.path.isdir(save_directory):
|
||||
logger.error(f"Vocabulary path ({save_directory}) should be a directory")
|
||||
return
|
||||
|
||||
# Save the encoder.
|
||||
out_vocab_file = os.path.join(
|
||||
save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab"]
|
||||
)
|
||||
with open(out_vocab_file, "w") as file:
|
||||
file.write(json.dumps(self.encoder))
|
||||
|
||||
return (out_vocab_file,)
|
||||
|
||||
def encode_plus(
|
||||
self,
|
||||
notes: Union[np.ndarray, List[pretty_midi.Note]],
|
||||
truncation_strategy: Optional[TruncationStrategy] = None,
|
||||
max_length: Optional[int] = None,
|
||||
**kwargs,
|
||||
) -> BatchEncoding:
|
||||
r"""
|
||||
This is the `encode_plus` method for `Pop2PianoTokenizer`. It converts the midi notes to the transformer
|
||||
generated token ids. It only works on a single batch, to process multiple batches please use
|
||||
`batch_encode_plus` or `__call__` method.
|
||||
|
||||
Args:
|
||||
notes (`numpy.ndarray` of shape `[sequence_length, 4]` or `list` of `pretty_midi.Note` objects):
|
||||
This represents the midi notes. If `notes` is a `numpy.ndarray`:
|
||||
- Each sequence must have 4 values, they are `onset idx`, `offset idx`, `pitch` and `velocity`.
|
||||
If `notes` is a `list` containing `pretty_midi.Note` objects:
|
||||
- Each sequence must have 4 attributes, they are `start`, `end`, `pitch` and `velocity`.
|
||||
truncation_strategy ([`~tokenization_utils_base.TruncationStrategy`], *optional*):
|
||||
Indicates the truncation strategy that is going to be used during truncation.
|
||||
max_length (`int`, *optional*):
|
||||
Maximum length of the returned list and optionally padding length (see above).
|
||||
|
||||
Returns:
|
||||
`BatchEncoding` containing the tokens ids.
|
||||
"""
|
||||
|
||||
requires_backends(self, ["pretty_midi"])
|
||||
|
||||
# check if notes is a pretty_midi object or not, if yes then extract the attributes and put them into a numpy
|
||||
# array.
|
||||
if isinstance(notes[0], pretty_midi.Note):
|
||||
notes = np.array(
|
||||
[[each_note.start, each_note.end, each_note.pitch, each_note.velocity] for each_note in notes]
|
||||
).reshape(-1, 4)
|
||||
|
||||
# to round up all the values to the closest int values.
|
||||
notes = np.round(notes).astype(np.int32)
|
||||
max_time_idx = notes[:, :2].max()
|
||||
|
||||
times = [[] for i in range((max_time_idx + 1))]
|
||||
for onset, offset, pitch, velocity in notes:
|
||||
times[onset].append([pitch, velocity])
|
||||
times[offset].append([pitch, 0])
|
||||
|
||||
tokens = []
|
||||
current_velocity = 0
|
||||
for i, time in enumerate(times):
|
||||
if len(time) == 0:
|
||||
continue
|
||||
tokens.append(self._convert_token_to_id(i, "TOKEN_TIME"))
|
||||
for pitch, velocity in time:
|
||||
velocity = int(velocity > 0)
|
||||
if current_velocity != velocity:
|
||||
current_velocity = velocity
|
||||
tokens.append(self._convert_token_to_id(velocity, "TOKEN_VELOCITY"))
|
||||
tokens.append(self._convert_token_to_id(pitch, "TOKEN_NOTE"))
|
||||
|
||||
total_len = len(tokens)
|
||||
|
||||
# truncation
|
||||
if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
|
||||
tokens, _, _ = self.truncate_sequences(
|
||||
ids=tokens,
|
||||
num_tokens_to_remove=total_len - max_length,
|
||||
truncation_strategy=truncation_strategy,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return BatchEncoding({"token_ids": tokens})
|
||||
|
||||
def batch_encode_plus(
|
||||
self,
|
||||
notes: Union[np.ndarray, List[pretty_midi.Note]],
|
||||
truncation_strategy: Optional[TruncationStrategy] = None,
|
||||
max_length: Optional[int] = None,
|
||||
**kwargs,
|
||||
) -> BatchEncoding:
|
||||
r"""
|
||||
This is the `batch_encode_plus` method for `Pop2PianoTokenizer`. It converts the midi notes to the transformer
|
||||
generated token ids. It works on multiple batches by calling `encode_plus` multiple times in a loop.
|
||||
|
||||
Args:
|
||||
notes (`numpy.ndarray` of shape `[batch_size, sequence_length, 4]` or `list` of `pretty_midi.Note` objects):
|
||||
This represents the midi notes. If `notes` is a `numpy.ndarray`:
|
||||
- Each sequence must have 4 values, they are `onset idx`, `offset idx`, `pitch` and `velocity`.
|
||||
If `notes` is a `list` containing `pretty_midi.Note` objects:
|
||||
- Each sequence must have 4 attributes, they are `start`, `end`, `pitch` and `velocity`.
|
||||
truncation_strategy ([`~tokenization_utils_base.TruncationStrategy`], *optional*):
|
||||
Indicates the truncation strategy that is going to be used during truncation.
|
||||
max_length (`int`, *optional*):
|
||||
Maximum length of the returned list and optionally padding length (see above).
|
||||
|
||||
Returns:
|
||||
`BatchEncoding` containing the tokens ids.
|
||||
"""
|
||||
|
||||
encoded_batch_token_ids = []
|
||||
for i in range(len(notes)):
|
||||
encoded_batch_token_ids.append(
|
||||
self.encode_plus(
|
||||
notes[i],
|
||||
truncation_strategy=truncation_strategy,
|
||||
max_length=max_length,
|
||||
**kwargs,
|
||||
)["token_ids"]
|
||||
)
|
||||
|
||||
return BatchEncoding({"token_ids": encoded_batch_token_ids})
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
notes: Union[
|
||||
np.ndarray,
|
||||
List[pretty_midi.Note],
|
||||
List[List[pretty_midi.Note]],
|
||||
],
|
||||
padding: Union[bool, str, PaddingStrategy] = False,
|
||||
truncation: Union[bool, str, TruncationStrategy] = None,
|
||||
max_length: Optional[int] = None,
|
||||
pad_to_multiple_of: Optional[int] = None,
|
||||
return_attention_mask: Optional[bool] = None,
|
||||
return_tensors: Optional[Union[str, TensorType]] = None,
|
||||
verbose: bool = True,
|
||||
**kwargs,
|
||||
) -> BatchEncoding:
|
||||
r"""
|
||||
This is the `__call__` method for `Pop2PianoTokenizer`. It converts the midi notes to the transformer generated
|
||||
token ids.
|
||||
|
||||
Args:
|
||||
notes (`numpy.ndarray` of shape `[batch_size, max_sequence_length, 4]` or `list` of `pretty_midi.Note` objects):
|
||||
This represents the midi notes.
|
||||
|
||||
If `notes` is a `numpy.ndarray`:
|
||||
- Each sequence must have 4 values, they are `onset idx`, `offset idx`, `pitch` and `velocity`.
|
||||
If `notes` is a `list` containing `pretty_midi.Note` objects:
|
||||
- Each sequence must have 4 attributes, they are `start`, `end`, `pitch` and `velocity`.
|
||||
padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `False`):
|
||||
Activates and controls padding. Accepts the following values:
|
||||
|
||||
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
|
||||
sequence if provided).
|
||||
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
|
||||
acceptable input length for the model if that argument is not provided.
|
||||
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
|
||||
lengths).
|
||||
truncation (`bool`, `str` or [`~tokenization_utils_base.TruncationStrategy`], *optional*, defaults to `False`):
|
||||
Activates and controls truncation. Accepts the following values:
|
||||
|
||||
- `True` or `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or
|
||||
to the maximum acceptable input length for the model if that argument is not provided. This will
|
||||
truncate token by token, removing a token from the longest sequence in the pair if a pair of
|
||||
sequences (or a batch of pairs) is provided.
|
||||
- `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
|
||||
maximum acceptable input length for the model if that argument is not provided. This will only
|
||||
truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
|
||||
- `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
|
||||
maximum acceptable input length for the model if that argument is not provided. This will only
|
||||
truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
|
||||
- `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths
|
||||
greater than the model maximum admissible input size).
|
||||
max_length (`int`, *optional*):
|
||||
Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set to
|
||||
`None`, this will use the predefined model maximum length if a maximum length is required by one of the
|
||||
truncation/padding parameters. If the model has no specific maximum input length (like XLNet)
|
||||
truncation/padding to a maximum length will be deactivated.
|
||||
pad_to_multiple_of (`int`, *optional*):
|
||||
If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
|
||||
the use of Tensor Cores on NVIDIA hardware with compute capability `>= 7.5` (Volta).
|
||||
return_attention_mask (`bool`, *optional*):
|
||||
Whether to return the attention mask. If left to the default, will return the attention mask according
|
||||
to the specific tokenizer's default, defined by the `return_outputs` attribute.
|
||||
|
||||
[What are attention masks?](../glossary#attention-mask)
|
||||
return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
|
||||
If set, will return tensors instead of list of python integers. Acceptable values are:
|
||||
|
||||
- `'tf'`: Return TensorFlow `tf.constant` objects.
|
||||
- `'pt'`: Return PyTorch `torch.Tensor` objects.
|
||||
- `'np'`: Return Numpy `np.ndarray` objects.
|
||||
verbose (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to print more information and warnings.
|
||||
|
||||
Returns:
|
||||
`BatchEncoding` containing the token_ids.
|
||||
"""
|
||||
|
||||
# check if it is batched or not
|
||||
# it is batched if its a list containing a list of `pretty_midi.Notes` where the outer list contains all the
|
||||
# batches and the inner list contains all Notes for a single batch. Otherwise if np.ndarray is passed it will be
|
||||
# considered batched if it has shape of `[batch_size, seqence_length, 4]` or ndim=3.
|
||||
is_batched = notes.ndim == 3 if isinstance(notes, np.ndarray) else isinstance(notes[0], list)
|
||||
|
||||
# get the truncation and padding strategy
|
||||
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
|
||||
padding=padding,
|
||||
truncation=truncation,
|
||||
max_length=max_length,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
verbose=verbose,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
if is_batched:
|
||||
# If the user has not explicitly mentioned `return_attention_mask` as False, we change it to True
|
||||
return_attention_mask = True if return_attention_mask is None else return_attention_mask
|
||||
token_ids = self.batch_encode_plus(
|
||||
notes=notes,
|
||||
truncation_strategy=truncation_strategy,
|
||||
max_length=max_length,
|
||||
**kwargs,
|
||||
)
|
||||
else:
|
||||
token_ids = self.encode_plus(
|
||||
notes=notes,
|
||||
truncation_strategy=truncation_strategy,
|
||||
max_length=max_length,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
# since we already have truncated sequnences we are just left to do padding
|
||||
token_ids = self.pad(
|
||||
token_ids,
|
||||
padding=padding_strategy,
|
||||
max_length=max_length,
|
||||
pad_to_multiple_of=pad_to_multiple_of,
|
||||
return_attention_mask=return_attention_mask,
|
||||
return_tensors=return_tensors,
|
||||
verbose=verbose,
|
||||
)
|
||||
|
||||
return token_ids
|
||||
|
||||
def batch_decode(
|
||||
self,
|
||||
token_ids,
|
||||
feature_extractor_output: BatchFeature,
|
||||
return_midi: bool = True,
|
||||
):
|
||||
r"""
|
||||
This is the `batch_decode` method for `Pop2PianoTokenizer`. It converts the token_ids generated by the
|
||||
transformer to midi_notes and returns them.
|
||||
|
||||
Args:
|
||||
token_ids (`Union[np.ndarray, torch.Tensor, tf.Tensor]`):
|
||||
Output token_ids of `Pop2PianoConditionalGeneration` model.
|
||||
feature_extractor_output (`BatchFeature`):
|
||||
Denotes the output of `Pop2PianoFeatureExtractor.__call__`. It must contain `"beatstep"` and
|
||||
`"extrapolated_beatstep"`. Also `"attention_mask_beatsteps"` and
|
||||
`"attention_mask_extrapolated_beatstep"`
|
||||
should be present if they were returned by the feature extractor.
|
||||
return_midi (`bool`, *optional*, defaults to `True`):
|
||||
Whether to return midi object or not.
|
||||
Returns:
|
||||
If `return_midi` is True:
|
||||
- `BatchEncoding` containing both `notes` and `pretty_midi.pretty_midi.PrettyMIDI` objects.
|
||||
If `return_midi` is False:
|
||||
- `BatchEncoding` containing `notes`.
|
||||
"""
|
||||
|
||||
# check if they have attention_masks(attention_mask, attention_mask_beatsteps, attention_mask_extrapolated_beatstep) or not
|
||||
attention_masks_present = bool(
|
||||
hasattr(feature_extractor_output, "attention_mask")
|
||||
and hasattr(feature_extractor_output, "attention_mask_beatsteps")
|
||||
and hasattr(feature_extractor_output, "attention_mask_extrapolated_beatstep")
|
||||
)
|
||||
|
||||
# if we are processing batched inputs then we must need attention_masks
|
||||
if not attention_masks_present and feature_extractor_output["beatsteps"].shape[0] > 1:
|
||||
raise ValueError(
|
||||
"attention_mask, attention_mask_beatsteps and attention_mask_extrapolated_beatstep must be present "
|
||||
"for batched inputs! But one of them were not present."
|
||||
)
|
||||
|
||||
# check for length mismatch between inputs_embeds, beatsteps and extrapolated_beatstep
|
||||
if attention_masks_present:
|
||||
# since we know about the number of examples in token_ids from attention_mask
|
||||
if (
|
||||
sum(feature_extractor_output["attention_mask"][:, 0] == 0)
|
||||
!= feature_extractor_output["beatsteps"].shape[0]
|
||||
or feature_extractor_output["beatsteps"].shape[0]
|
||||
!= feature_extractor_output["extrapolated_beatstep"].shape[0]
|
||||
):
|
||||
raise ValueError(
|
||||
"Length mistamtch between token_ids, beatsteps and extrapolated_beatstep! Found "
|
||||
f"token_ids length - {token_ids.shape[0]}, beatsteps shape - {feature_extractor_output['beatsteps'].shape[0]} "
|
||||
f"and extrapolated_beatsteps shape - {feature_extractor_output['extrapolated_beatstep'].shape[0]}"
|
||||
)
|
||||
if feature_extractor_output["attention_mask"].shape[0] != token_ids.shape[0]:
|
||||
raise ValueError(
|
||||
f"Found attention_mask of length - {feature_extractor_output['attention_mask'].shape[0]} but token_ids of length - {token_ids.shape[0]}"
|
||||
)
|
||||
else:
|
||||
# if there is no attention mask present then it's surely a single example
|
||||
if (
|
||||
feature_extractor_output["beatsteps"].shape[0] != 1
|
||||
or feature_extractor_output["extrapolated_beatstep"].shape[0] != 1
|
||||
):
|
||||
raise ValueError(
|
||||
"Length mistamtch of beatsteps and extrapolated_beatstep! Since attention_mask is not present the number of examples must be 1, "
|
||||
f"But found beatsteps length - {feature_extractor_output['beatsteps'].shape[0]}, extrapolated_beatsteps length - {feature_extractor_output['extrapolated_beatstep'].shape[0]}."
|
||||
)
|
||||
|
||||
if attention_masks_present:
|
||||
# check for zeros(since token_ids are seperated by zero arrays)
|
||||
batch_idx = np.where(feature_extractor_output["attention_mask"][:, 0] == 0)[0]
|
||||
else:
|
||||
batch_idx = [token_ids.shape[0]]
|
||||
|
||||
notes_list = []
|
||||
pretty_midi_objects_list = []
|
||||
start_idx = 0
|
||||
for index, end_idx in enumerate(batch_idx):
|
||||
each_tokens_ids = token_ids[start_idx:end_idx]
|
||||
# check where the whole example ended by searching for eos_token_id and getting the upper bound
|
||||
each_tokens_ids = each_tokens_ids[:, : np.max(np.where(each_tokens_ids == int(self.eos_token))[1]) + 1]
|
||||
beatsteps = feature_extractor_output["beatsteps"][index]
|
||||
extrapolated_beatstep = feature_extractor_output["extrapolated_beatstep"][index]
|
||||
|
||||
# if attention mask is present then mask out real array/tensor
|
||||
if attention_masks_present:
|
||||
attention_mask_beatsteps = feature_extractor_output["attention_mask_beatsteps"][index]
|
||||
attention_mask_extrapolated_beatstep = feature_extractor_output[
|
||||
"attention_mask_extrapolated_beatstep"
|
||||
][index]
|
||||
beatsteps = beatsteps[: np.max(np.where(attention_mask_beatsteps == 1)[0]) + 1]
|
||||
extrapolated_beatstep = extrapolated_beatstep[
|
||||
: np.max(np.where(attention_mask_extrapolated_beatstep == 1)[0]) + 1
|
||||
]
|
||||
|
||||
each_tokens_ids = to_numpy(each_tokens_ids)
|
||||
beatsteps = to_numpy(beatsteps)
|
||||
extrapolated_beatstep = to_numpy(extrapolated_beatstep)
|
||||
|
||||
pretty_midi_object = self.relative_batch_tokens_ids_to_midi(
|
||||
tokens=each_tokens_ids,
|
||||
beatstep=extrapolated_beatstep,
|
||||
bars_per_batch=self.num_bars,
|
||||
cutoff_time_idx=(self.num_bars + 1) * 4,
|
||||
)
|
||||
|
||||
for note in pretty_midi_object.instruments[0].notes:
|
||||
note.start += beatsteps[0]
|
||||
note.end += beatsteps[0]
|
||||
notes_list.append(note)
|
||||
|
||||
pretty_midi_objects_list.append(pretty_midi_object)
|
||||
start_idx += end_idx + 1 # 1 represents the zero array
|
||||
|
||||
if return_midi:
|
||||
return BatchEncoding({"notes": notes_list, "pretty_midi_objects": pretty_midi_objects_list})
|
||||
|
||||
return BatchEncoding({"notes": notes_list})
|
||||
@@ -32,6 +32,7 @@ is_torch_greater_or_equal_than_2_0 = parsed_torch_version_base >= version.parse(
|
||||
is_torch_greater_or_equal_than_1_12 = parsed_torch_version_base >= version.parse("1.12")
|
||||
is_torch_greater_or_equal_than_1_11 = parsed_torch_version_base >= version.parse("1.11")
|
||||
is_torch_less_than_1_11 = parsed_torch_version_base < version.parse("1.11")
|
||||
is_torch_1_8_0 = parsed_torch_version_base == version.parse("1.8.0")
|
||||
|
||||
|
||||
def softmax_backward_data(parent, grad_output, output, dim, self):
|
||||
|
||||
@@ -57,6 +57,7 @@ from .utils import (
|
||||
is_cython_available,
|
||||
is_decord_available,
|
||||
is_detectron2_available,
|
||||
is_essentia_available,
|
||||
is_faiss_available,
|
||||
is_flax_available,
|
||||
is_ftfy_available,
|
||||
@@ -71,6 +72,7 @@ from .utils import (
|
||||
is_pandas_available,
|
||||
is_peft_available,
|
||||
is_phonemizer_available,
|
||||
is_pretty_midi_available,
|
||||
is_pyctcdecode_available,
|
||||
is_pytesseract_available,
|
||||
is_pytest_available,
|
||||
@@ -825,6 +827,20 @@ def require_librosa(test_case):
|
||||
return unittest.skipUnless(is_librosa_available(), "test requires librosa")(test_case)
|
||||
|
||||
|
||||
def require_essentia(test_case):
|
||||
"""
|
||||
Decorator marking a test that requires essentia
|
||||
"""
|
||||
return unittest.skipUnless(is_essentia_available(), "test requires essentia")(test_case)
|
||||
|
||||
|
||||
def require_pretty_midi(test_case):
|
||||
"""
|
||||
Decorator marking a test that requires pretty_midi
|
||||
"""
|
||||
return unittest.skipUnless(is_pretty_midi_available(), "test requires pretty_midi")(test_case)
|
||||
|
||||
|
||||
def cmd_exists(cmd):
|
||||
return shutil.which(cmd) is not None
|
||||
|
||||
|
||||
@@ -112,6 +112,7 @@ from .import_utils import (
|
||||
is_datasets_available,
|
||||
is_decord_available,
|
||||
is_detectron2_available,
|
||||
is_essentia_available,
|
||||
is_faiss_available,
|
||||
is_flax_available,
|
||||
is_ftfy_available,
|
||||
@@ -130,6 +131,7 @@ from .import_utils import (
|
||||
is_pandas_available,
|
||||
is_peft_available,
|
||||
is_phonemizer_available,
|
||||
is_pretty_midi_available,
|
||||
is_protobuf_available,
|
||||
is_psutil_available,
|
||||
is_py3nvml_available,
|
||||
|
||||
@@ -0,0 +1,23 @@
|
||||
# This file is autogenerated by the command `make fix-copies`, do not edit.
|
||||
from ..utils import DummyObject, requires_backends
|
||||
|
||||
|
||||
class Pop2PianoFeatureExtractor(metaclass=DummyObject):
|
||||
_backends = ["essentia", "librosa", "pretty_midi", "scipy", "torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["essentia", "librosa", "pretty_midi", "scipy", "torch"])
|
||||
|
||||
|
||||
class Pop2PianoTokenizer(metaclass=DummyObject):
|
||||
_backends = ["essentia", "librosa", "pretty_midi", "scipy", "torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["essentia", "librosa", "pretty_midi", "scipy", "torch"])
|
||||
|
||||
|
||||
class Pop2PianoProcessor(metaclass=DummyObject):
|
||||
_backends = ["essentia", "librosa", "pretty_midi", "scipy", "torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["essentia", "librosa", "pretty_midi", "scipy", "torch"])
|
||||
16
src/transformers/utils/dummy_music_objects.py
Normal file
16
src/transformers/utils/dummy_music_objects.py
Normal file
@@ -0,0 +1,16 @@
|
||||
# This file is autogenerated by the command `make fix-copies`, do not edit.
|
||||
from ..utils import DummyObject, requires_backends
|
||||
|
||||
|
||||
class Pop2PianoFeatureExtractor(metaclass=DummyObject):
|
||||
_backends = ["music"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["music"])
|
||||
|
||||
|
||||
class Pop2PianoTokenizer(metaclass=DummyObject):
|
||||
_backends = ["music"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["music"])
|
||||
@@ -5935,6 +5935,23 @@ class PoolFormerPreTrainedModel(metaclass=DummyObject):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
class Pop2PianoForConditionalGeneration(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
class Pop2PianoPreTrainedModel(metaclass=DummyObject):
|
||||
_backends = ["torch"]
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
requires_backends(self, ["torch"])
|
||||
|
||||
|
||||
PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST = None
|
||||
|
||||
|
||||
|
||||
@@ -185,6 +185,22 @@ else:
|
||||
logger.info("Disabling Tensorflow because USE_TORCH is set")
|
||||
|
||||
|
||||
_essentia_available = importlib.util.find_spec("essentia") is not None
|
||||
try:
|
||||
_essentia_version = importlib.metadata.version("essentia")
|
||||
logger.debug(f"Successfully imported essentia version {_essentia_version}")
|
||||
except importlib.metadata.PackageNotFoundError:
|
||||
_essentia_version = False
|
||||
|
||||
|
||||
_pretty_midi_available = importlib.util.find_spec("pretty_midi") is not None
|
||||
try:
|
||||
_pretty_midi_version = importlib.metadata.version("pretty_midi")
|
||||
logger.debug(f"Successfully imported pretty_midi version {_pretty_midi_version}")
|
||||
except importlib.metadata.PackageNotFoundError:
|
||||
_pretty_midi_available = False
|
||||
|
||||
|
||||
ccl_version = "N/A"
|
||||
_is_ccl_available = (
|
||||
importlib.util.find_spec("torch_ccl") is not None
|
||||
@@ -242,6 +258,14 @@ def is_librosa_available():
|
||||
return _librosa_available
|
||||
|
||||
|
||||
def is_essentia_available():
|
||||
return _essentia_available
|
||||
|
||||
|
||||
def is_pretty_midi_available():
|
||||
return _pretty_midi_available
|
||||
|
||||
|
||||
def is_torch_cuda_available():
|
||||
if is_torch_available():
|
||||
import torch
|
||||
@@ -986,6 +1010,27 @@ CCL_IMPORT_ERROR = """
|
||||
Please note that you may need to restart your runtime after installation.
|
||||
"""
|
||||
|
||||
# docstyle-ignore
|
||||
ESSENTIA_IMPORT_ERROR = """
|
||||
{0} requires essentia library. But that was not found in your environment. You can install them with pip:
|
||||
`pip install essentia==2.1b6.dev1034`
|
||||
Please note that you may need to restart your runtime after installation.
|
||||
"""
|
||||
|
||||
# docstyle-ignore
|
||||
LIBROSA_IMPORT_ERROR = """
|
||||
{0} requires thes librosa library. But that was not found in your environment. You can install them with pip:
|
||||
`pip install librosa`
|
||||
Please note that you may need to restart your runtime after installation.
|
||||
"""
|
||||
|
||||
# docstyle-ignore
|
||||
PRETTY_MIDI_IMPORT_ERROR = """
|
||||
{0} requires thes pretty_midi library. But that was not found in your environment. You can install them with pip:
|
||||
`pip install pretty_midi`
|
||||
Please note that you may need to restart your runtime after installation.
|
||||
"""
|
||||
|
||||
DECORD_IMPORT_ERROR = """
|
||||
{0} requires the decord library but it was not found in your environment. You can install it with pip: `pip install
|
||||
decord`. Please note that you may need to restart your runtime after installation.
|
||||
@@ -1011,11 +1056,14 @@ BACKENDS_MAPPING = OrderedDict(
|
||||
("bs4", (is_bs4_available, BS4_IMPORT_ERROR)),
|
||||
("datasets", (is_datasets_available, DATASETS_IMPORT_ERROR)),
|
||||
("detectron2", (is_detectron2_available, DETECTRON2_IMPORT_ERROR)),
|
||||
("essentia", (is_essentia_available, ESSENTIA_IMPORT_ERROR)),
|
||||
("faiss", (is_faiss_available, FAISS_IMPORT_ERROR)),
|
||||
("flax", (is_flax_available, FLAX_IMPORT_ERROR)),
|
||||
("ftfy", (is_ftfy_available, FTFY_IMPORT_ERROR)),
|
||||
("pandas", (is_pandas_available, PANDAS_IMPORT_ERROR)),
|
||||
("phonemizer", (is_phonemizer_available, PHONEMIZER_IMPORT_ERROR)),
|
||||
("pretty_midi", (is_pretty_midi_available, PRETTY_MIDI_IMPORT_ERROR)),
|
||||
("librosa", (is_librosa_available, LIBROSA_IMPORT_ERROR)),
|
||||
("protobuf", (is_protobuf_available, PROTOBUF_IMPORT_ERROR)),
|
||||
("pyctcdecode", (is_pyctcdecode_available, PYCTCDECODE_IMPORT_ERROR)),
|
||||
("pytesseract", (is_pytesseract_available, PYTESSERACT_IMPORT_ERROR)),
|
||||
|
||||
0
tests/models/pop2piano/__init__.py
Normal file
0
tests/models/pop2piano/__init__.py
Normal file
291
tests/models/pop2piano/test_feature_extraction_pop2piano.py
Normal file
291
tests/models/pop2piano/test_feature_extraction_pop2piano.py
Normal file
@@ -0,0 +1,291 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 HuggingFace Inc.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers.testing_utils import (
|
||||
check_json_file_has_correct_format,
|
||||
require_essentia,
|
||||
require_librosa,
|
||||
require_scipy,
|
||||
require_tf,
|
||||
require_torch,
|
||||
)
|
||||
from transformers.utils.import_utils import (
|
||||
is_essentia_available,
|
||||
is_librosa_available,
|
||||
is_scipy_available,
|
||||
is_torch_available,
|
||||
)
|
||||
|
||||
from ...test_sequence_feature_extraction_common import SequenceFeatureExtractionTestMixin
|
||||
|
||||
|
||||
requirements_available = (
|
||||
is_torch_available() and is_essentia_available() and is_scipy_available() and is_librosa_available()
|
||||
)
|
||||
|
||||
if requirements_available:
|
||||
import torch
|
||||
|
||||
from transformers import Pop2PianoFeatureExtractor
|
||||
|
||||
|
||||
class Pop2PianoFeatureExtractionTester(unittest.TestCase):
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
n_bars=2,
|
||||
sample_rate=22050,
|
||||
use_mel=True,
|
||||
padding_value=0,
|
||||
vocab_size_special=4,
|
||||
vocab_size_note=128,
|
||||
vocab_size_velocity=2,
|
||||
vocab_size_time=100,
|
||||
):
|
||||
self.parent = parent
|
||||
self.n_bars = n_bars
|
||||
self.sample_rate = sample_rate
|
||||
self.use_mel = use_mel
|
||||
self.padding_value = padding_value
|
||||
self.vocab_size_special = vocab_size_special
|
||||
self.vocab_size_note = vocab_size_note
|
||||
self.vocab_size_velocity = vocab_size_velocity
|
||||
self.vocab_size_time = vocab_size_time
|
||||
|
||||
def prepare_feat_extract_dict(self):
|
||||
return {
|
||||
"n_bars": self.n_bars,
|
||||
"sample_rate": self.sample_rate,
|
||||
"use_mel": self.use_mel,
|
||||
"padding_value": self.padding_value,
|
||||
"vocab_size_special": self.vocab_size_special,
|
||||
"vocab_size_note": self.vocab_size_note,
|
||||
"vocab_size_velocity": self.vocab_size_velocity,
|
||||
"vocab_size_time": self.vocab_size_time,
|
||||
}
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_essentia
|
||||
@require_librosa
|
||||
@require_scipy
|
||||
class Pop2PianoFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):
|
||||
feature_extraction_class = Pop2PianoFeatureExtractor if requirements_available else None
|
||||
|
||||
def setUp(self):
|
||||
self.feat_extract_tester = Pop2PianoFeatureExtractionTester(self)
|
||||
|
||||
def test_feat_extract_from_and_save_pretrained(self):
|
||||
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
saved_file = feat_extract_first.save_pretrained(tmpdirname)[0]
|
||||
check_json_file_has_correct_format(saved_file)
|
||||
feat_extract_second = self.feature_extraction_class.from_pretrained(tmpdirname)
|
||||
|
||||
dict_first = feat_extract_first.to_dict()
|
||||
dict_second = feat_extract_second.to_dict()
|
||||
mel_1 = feat_extract_first.use_mel
|
||||
mel_2 = feat_extract_second.use_mel
|
||||
self.assertTrue(np.allclose(mel_1, mel_2))
|
||||
self.assertEqual(dict_first, dict_second)
|
||||
|
||||
def test_feat_extract_to_json_file(self):
|
||||
feat_extract_first = self.feature_extraction_class(**self.feat_extract_dict)
|
||||
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
json_file_path = os.path.join(tmpdirname, "feat_extract.json")
|
||||
feat_extract_first.to_json_file(json_file_path)
|
||||
feat_extract_second = self.feature_extraction_class.from_json_file(json_file_path)
|
||||
|
||||
dict_first = feat_extract_first.to_dict()
|
||||
dict_second = feat_extract_second.to_dict()
|
||||
mel_1 = feat_extract_first.use_mel
|
||||
mel_2 = feat_extract_second.use_mel
|
||||
self.assertTrue(np.allclose(mel_1, mel_2))
|
||||
self.assertEqual(dict_first, dict_second)
|
||||
|
||||
def test_call(self):
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
speech_input = np.zeros([1000000], dtype=np.float32)
|
||||
|
||||
input_features = feature_extractor(speech_input, sampling_rate=16_000, return_tensors="np")
|
||||
self.assertTrue(input_features.input_features.ndim == 3)
|
||||
self.assertEqual(input_features.input_features.shape[-1], 512)
|
||||
|
||||
self.assertTrue(input_features.beatsteps.ndim == 2)
|
||||
self.assertTrue(input_features.extrapolated_beatstep.ndim == 2)
|
||||
|
||||
def test_integration(self):
|
||||
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
speech_samples = ds.sort("id").select([0])["audio"]
|
||||
input_speech = [x["array"] for x in speech_samples][0]
|
||||
sampling_rate = [x["sampling_rate"] for x in speech_samples][0]
|
||||
feaure_extractor = Pop2PianoFeatureExtractor.from_pretrained("sweetcocoa/pop2piano")
|
||||
input_features = feaure_extractor(
|
||||
input_speech, sampling_rate=sampling_rate, return_tensors="pt"
|
||||
).input_features
|
||||
|
||||
EXPECTED_INPUT_FEATURES = torch.tensor(
|
||||
[[-7.1493, -6.8701, -4.3214], [-5.9473, -5.7548, -3.8438], [-6.1324, -5.9018, -4.3778]]
|
||||
)
|
||||
self.assertTrue(torch.allclose(input_features[0, :3, :3], EXPECTED_INPUT_FEATURES, atol=1e-4))
|
||||
|
||||
def test_attention_mask(self):
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
speech_input1 = np.zeros([1_000_000], dtype=np.float32)
|
||||
speech_input2 = np.random.randint(low=0, high=10, size=500_000).astype(np.float32)
|
||||
input_features = feature_extractor(
|
||||
[speech_input1, speech_input2],
|
||||
sampling_rate=[44_100, 16_000],
|
||||
return_tensors="np",
|
||||
return_attention_mask=True,
|
||||
)
|
||||
|
||||
self.assertTrue(hasattr(input_features, "attention_mask"))
|
||||
|
||||
# check shapes
|
||||
self.assertTrue(input_features["attention_mask"].ndim == 2)
|
||||
self.assertEqual(input_features["attention_mask_beatsteps"].shape[0], 2)
|
||||
self.assertEqual(input_features["attention_mask_extrapolated_beatstep"].shape[0], 2)
|
||||
|
||||
# check if they are any values except 0 and 1
|
||||
self.assertTrue(np.max(input_features["attention_mask"]) == 1)
|
||||
self.assertTrue(np.max(input_features["attention_mask_beatsteps"]) == 1)
|
||||
self.assertTrue(np.max(input_features["attention_mask_extrapolated_beatstep"]) == 1)
|
||||
|
||||
self.assertTrue(np.min(input_features["attention_mask"]) == 0)
|
||||
self.assertTrue(np.min(input_features["attention_mask_beatsteps"]) == 0)
|
||||
self.assertTrue(np.min(input_features["attention_mask_extrapolated_beatstep"]) == 0)
|
||||
|
||||
def test_batch_feature(self):
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
speech_input1 = np.zeros([1_000_000], dtype=np.float32)
|
||||
speech_input2 = np.ones([2_000_000], dtype=np.float32)
|
||||
speech_input3 = np.random.randint(low=0, high=10, size=500_000).astype(np.float32)
|
||||
|
||||
input_features = feature_extractor(
|
||||
[speech_input1, speech_input2, speech_input3],
|
||||
sampling_rate=[44_100, 16_000, 48_000],
|
||||
return_attention_mask=True,
|
||||
)
|
||||
|
||||
self.assertEqual(len(input_features["input_features"].shape), 3)
|
||||
# check shape
|
||||
self.assertEqual(input_features["beatsteps"].shape[0], 3)
|
||||
self.assertEqual(input_features["extrapolated_beatstep"].shape[0], 3)
|
||||
|
||||
def test_batch_feature_np(self):
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
speech_input1 = np.zeros([1_000_000], dtype=np.float32)
|
||||
speech_input2 = np.ones([2_000_000], dtype=np.float32)
|
||||
speech_input3 = np.random.randint(low=0, high=10, size=500_000).astype(np.float32)
|
||||
|
||||
input_features = feature_extractor(
|
||||
[speech_input1, speech_input2, speech_input3],
|
||||
sampling_rate=[44_100, 16_000, 48_000],
|
||||
return_tensors="np",
|
||||
return_attention_mask=True,
|
||||
)
|
||||
|
||||
# check np array or not
|
||||
self.assertEqual(type(input_features["input_features"]), np.ndarray)
|
||||
|
||||
# check shape
|
||||
self.assertEqual(len(input_features["input_features"].shape), 3)
|
||||
|
||||
def test_batch_feature_pt(self):
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
speech_input1 = np.zeros([1_000_000], dtype=np.float32)
|
||||
speech_input2 = np.ones([2_000_000], dtype=np.float32)
|
||||
speech_input3 = np.random.randint(low=0, high=10, size=500_000).astype(np.float32)
|
||||
|
||||
input_features = feature_extractor(
|
||||
[speech_input1, speech_input2, speech_input3],
|
||||
sampling_rate=[44_100, 16_000, 48_000],
|
||||
return_tensors="pt",
|
||||
return_attention_mask=True,
|
||||
)
|
||||
|
||||
# check pt tensor or not
|
||||
self.assertEqual(type(input_features["input_features"]), torch.Tensor)
|
||||
|
||||
# check shape
|
||||
self.assertEqual(len(input_features["input_features"].shape), 3)
|
||||
|
||||
@require_tf
|
||||
def test_batch_feature_tf(self):
|
||||
import tensorflow as tf
|
||||
|
||||
feature_extractor = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
|
||||
speech_input1 = np.zeros([1_000_000], dtype=np.float32)
|
||||
speech_input2 = np.ones([2_000_000], dtype=np.float32)
|
||||
speech_input3 = np.random.randint(low=0, high=10, size=500_000).astype(np.float32)
|
||||
|
||||
input_features = feature_extractor(
|
||||
[speech_input1, speech_input2, speech_input3],
|
||||
sampling_rate=[44_100, 16_000, 48_000],
|
||||
return_tensors="tf",
|
||||
return_attention_mask=True,
|
||||
)
|
||||
|
||||
# check tf tensor or not
|
||||
self.assertTrue(tf.is_tensor(input_features["input_features"]))
|
||||
|
||||
# check shape
|
||||
self.assertEqual(len(input_features["input_features"].shape), 3)
|
||||
|
||||
@unittest.skip(
|
||||
"Pop2PianoFeatureExtractor does not supports padding externally (while processing audios in batches padding is automatically applied to max_length)"
|
||||
)
|
||||
def test_padding_accepts_tensors_pt(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Pop2PianoFeatureExtractor does not supports padding externally (while processing audios in batches padding is automatically applied to max_length)"
|
||||
)
|
||||
def test_padding_accepts_tensors_tf(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Pop2PianoFeatureExtractor does not supports padding externally (while processing audios in batches padding is automatically applied to max_length)"
|
||||
)
|
||||
def test_padding_from_list(self):
|
||||
pass
|
||||
|
||||
@unittest.skip(
|
||||
"Pop2PianoFeatureExtractor does not supports padding externally (while processing audios in batches padding is automatically applied to max_length)"
|
||||
)
|
||||
def test_padding_from_array(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("Pop2PianoFeatureExtractor does not support truncation")
|
||||
def test_attention_mask_with_truncation(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("Pop2PianoFeatureExtractor does not supports truncation")
|
||||
def test_truncation_from_array(self):
|
||||
pass
|
||||
|
||||
@unittest.skip("Pop2PianoFeatureExtractor does not supports truncation")
|
||||
def test_truncation_from_list(self):
|
||||
pass
|
||||
778
tests/models/pop2piano/test_modeling_pop2piano.py
Normal file
778
tests/models/pop2piano/test_modeling_pop2piano.py
Normal file
@@ -0,0 +1,778 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" Testing suite for the PyTorch Pop2Piano model. """
|
||||
|
||||
import copy
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers import Pop2PianoConfig
|
||||
from transformers.feature_extraction_utils import BatchFeature
|
||||
from transformers.testing_utils import (
|
||||
require_essentia,
|
||||
require_librosa,
|
||||
require_onnx,
|
||||
require_scipy,
|
||||
require_torch,
|
||||
slow,
|
||||
torch_device,
|
||||
)
|
||||
from transformers.utils import is_essentia_available, is_librosa_available, is_scipy_available, is_torch_available
|
||||
|
||||
from ...generation.test_utils import GenerationTesterMixin
|
||||
from ...test_configuration_common import ConfigTester
|
||||
from ...test_modeling_common import ModelTesterMixin, ids_tensor
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
from transformers import Pop2PianoForConditionalGeneration
|
||||
from transformers.models.pop2piano.modeling_pop2piano import POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST
|
||||
from transformers.pytorch_utils import is_torch_1_8_0
|
||||
|
||||
else:
|
||||
is_torch_1_8_0 = False
|
||||
|
||||
|
||||
@require_torch
|
||||
class Pop2PianoModelTester:
|
||||
def __init__(
|
||||
self,
|
||||
parent,
|
||||
vocab_size=99,
|
||||
batch_size=13,
|
||||
encoder_seq_length=7,
|
||||
decoder_seq_length=9,
|
||||
# For common tests
|
||||
is_training=False,
|
||||
use_attention_mask=True,
|
||||
use_labels=True,
|
||||
hidden_size=64,
|
||||
num_hidden_layers=5,
|
||||
num_attention_heads=4,
|
||||
d_ff=37,
|
||||
relative_attention_num_buckets=8,
|
||||
dropout_rate=0.1,
|
||||
initializer_factor=0.002,
|
||||
eos_token_id=1,
|
||||
pad_token_id=0,
|
||||
decoder_start_token_id=0,
|
||||
scope=None,
|
||||
decoder_layers=None,
|
||||
):
|
||||
self.parent = parent
|
||||
self.batch_size = batch_size
|
||||
self.encoder_seq_length = encoder_seq_length
|
||||
self.decoder_seq_length = decoder_seq_length
|
||||
# For common tests
|
||||
self.seq_length = self.decoder_seq_length
|
||||
self.is_training = is_training
|
||||
self.use_attention_mask = use_attention_mask
|
||||
self.use_labels = use_labels
|
||||
self.vocab_size = vocab_size
|
||||
self.hidden_size = hidden_size
|
||||
self.num_hidden_layers = num_hidden_layers
|
||||
self.num_attention_heads = num_attention_heads
|
||||
self.d_ff = d_ff
|
||||
self.relative_attention_num_buckets = relative_attention_num_buckets
|
||||
self.dropout_rate = dropout_rate
|
||||
self.initializer_factor = initializer_factor
|
||||
self.eos_token_id = eos_token_id
|
||||
self.pad_token_id = pad_token_id
|
||||
self.decoder_start_token_id = decoder_start_token_id
|
||||
self.scope = None
|
||||
self.decoder_layers = decoder_layers
|
||||
|
||||
def prepare_config_and_inputs(self):
|
||||
input_ids = ids_tensor([self.batch_size, self.encoder_seq_length], self.vocab_size)
|
||||
decoder_input_ids = ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size)
|
||||
|
||||
attention_mask = None
|
||||
decoder_attention_mask = None
|
||||
if self.use_attention_mask:
|
||||
attention_mask = ids_tensor([self.batch_size, self.encoder_seq_length], vocab_size=2)
|
||||
decoder_attention_mask = ids_tensor([self.batch_size, self.decoder_seq_length], vocab_size=2)
|
||||
|
||||
lm_labels = (
|
||||
ids_tensor([self.batch_size, self.decoder_seq_length], self.vocab_size) if self.use_labels else None
|
||||
)
|
||||
|
||||
return self.get_config(), input_ids, decoder_input_ids, attention_mask, decoder_attention_mask, lm_labels
|
||||
|
||||
def get_pipeline_config(self):
|
||||
return Pop2PianoConfig(
|
||||
vocab_size=166, # Pop2Piano forces 100 extra tokens
|
||||
d_model=self.hidden_size,
|
||||
d_ff=self.d_ff,
|
||||
d_kv=self.hidden_size // self.num_attention_heads,
|
||||
num_layers=self.num_hidden_layers,
|
||||
num_decoder_layers=self.decoder_layers,
|
||||
num_heads=self.num_attention_heads,
|
||||
relative_attention_num_buckets=self.relative_attention_num_buckets,
|
||||
dropout_rate=self.dropout_rate,
|
||||
initializer_factor=self.initializer_factor,
|
||||
eos_token_id=self.eos_token_id,
|
||||
bos_token_id=self.pad_token_id,
|
||||
pad_token_id=self.pad_token_id,
|
||||
decoder_start_token_id=self.decoder_start_token_id,
|
||||
)
|
||||
|
||||
def get_config(self):
|
||||
return Pop2PianoConfig(
|
||||
vocab_size=self.vocab_size,
|
||||
d_model=self.hidden_size,
|
||||
d_ff=self.d_ff,
|
||||
d_kv=self.hidden_size // self.num_attention_heads,
|
||||
num_layers=self.num_hidden_layers,
|
||||
num_decoder_layers=self.decoder_layers,
|
||||
num_heads=self.num_attention_heads,
|
||||
relative_attention_num_buckets=self.relative_attention_num_buckets,
|
||||
dropout_rate=self.dropout_rate,
|
||||
initializer_factor=self.initializer_factor,
|
||||
eos_token_id=self.eos_token_id,
|
||||
bos_token_id=self.pad_token_id,
|
||||
pad_token_id=self.pad_token_id,
|
||||
decoder_start_token_id=self.decoder_start_token_id,
|
||||
)
|
||||
|
||||
def check_prepare_lm_labels_via_shift_left(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
# make sure that lm_labels are correctly padded from the right
|
||||
lm_labels.masked_fill_((lm_labels == self.decoder_start_token_id), self.eos_token_id)
|
||||
|
||||
# add causal pad token mask
|
||||
triangular_mask = torch.tril(lm_labels.new_ones(lm_labels.shape)).logical_not()
|
||||
lm_labels.masked_fill_(triangular_mask, self.pad_token_id)
|
||||
decoder_input_ids = model._shift_right(lm_labels)
|
||||
|
||||
for i, (decoder_input_ids_slice, lm_labels_slice) in enumerate(zip(decoder_input_ids, lm_labels)):
|
||||
# first item
|
||||
self.parent.assertEqual(decoder_input_ids_slice[0].item(), self.decoder_start_token_id)
|
||||
if i < decoder_input_ids_slice.shape[-1]:
|
||||
if i < decoder_input_ids.shape[-1] - 1:
|
||||
# items before diagonal
|
||||
self.parent.assertListEqual(
|
||||
decoder_input_ids_slice[1 : i + 1].tolist(), lm_labels_slice[:i].tolist()
|
||||
)
|
||||
# pad items after diagonal
|
||||
if i < decoder_input_ids.shape[-1] - 2:
|
||||
self.parent.assertListEqual(
|
||||
decoder_input_ids_slice[i + 2 :].tolist(), lm_labels_slice[i + 1 : -1].tolist()
|
||||
)
|
||||
else:
|
||||
# all items after square
|
||||
self.parent.assertListEqual(decoder_input_ids_slice[1:].tolist(), lm_labels_slice[:-1].tolist())
|
||||
|
||||
def create_and_check_model(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config)
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
result = model(
|
||||
input_ids=input_ids,
|
||||
decoder_input_ids=decoder_input_ids,
|
||||
attention_mask=attention_mask,
|
||||
decoder_attention_mask=decoder_attention_mask,
|
||||
)
|
||||
result = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
|
||||
decoder_past = result.past_key_values
|
||||
encoder_output = result.encoder_last_hidden_state
|
||||
|
||||
self.parent.assertEqual(encoder_output.size(), (self.batch_size, self.encoder_seq_length, self.hidden_size))
|
||||
# There should be `num_layers` key value embeddings stored in decoder_past
|
||||
self.parent.assertEqual(len(decoder_past), config.num_layers)
|
||||
# There should be a self attn key, a self attn value, a cross attn key and a cross attn value stored in each decoder_past tuple
|
||||
self.parent.assertEqual(len(decoder_past[0]), 4)
|
||||
|
||||
def create_and_check_with_lm_head(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config).to(torch_device).eval()
|
||||
outputs = model(
|
||||
input_ids=input_ids,
|
||||
decoder_input_ids=decoder_input_ids,
|
||||
decoder_attention_mask=decoder_attention_mask,
|
||||
labels=lm_labels,
|
||||
)
|
||||
self.parent.assertEqual(len(outputs), 4)
|
||||
self.parent.assertEqual(outputs["logits"].size(), (self.batch_size, self.decoder_seq_length, self.vocab_size))
|
||||
self.parent.assertEqual(outputs["loss"].size(), ())
|
||||
|
||||
def create_and_check_decoder_model_past(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config).get_decoder().to(torch_device).eval()
|
||||
# first forward pass
|
||||
outputs = model(input_ids, use_cache=True)
|
||||
outputs_use_cache_conf = model(input_ids)
|
||||
outputs_no_past = model(input_ids, use_cache=False)
|
||||
|
||||
self.parent.assertTrue(len(outputs) == len(outputs_use_cache_conf))
|
||||
self.parent.assertTrue(len(outputs) == len(outputs_no_past) + 1)
|
||||
|
||||
output, past_key_values = outputs.to_tuple()
|
||||
|
||||
# create hypothetical next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size)
|
||||
|
||||
# append to next input_ids and
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
|
||||
output_from_no_past = model(next_input_ids)["last_hidden_state"]
|
||||
output_from_past = model(next_tokens, past_key_values=past_key_values)["last_hidden_state"]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
def create_and_check_decoder_model_attention_mask_past(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config).get_decoder()
|
||||
model.to(torch_device)
|
||||
model.eval()
|
||||
|
||||
# create attention mask
|
||||
attn_mask = torch.ones(input_ids.shape, dtype=torch.long, device=torch_device)
|
||||
|
||||
half_seq_length = input_ids.shape[-1] // 2
|
||||
attn_mask[:, half_seq_length:] = 0
|
||||
|
||||
# first forward pass
|
||||
output, past_key_values = model(input_ids, attention_mask=attn_mask, use_cache=True).to_tuple()
|
||||
|
||||
# create hypothetical next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size)
|
||||
|
||||
# change a random masked slice from input_ids
|
||||
random_seq_idx_to_change = ids_tensor((1,), half_seq_length).item() + 1
|
||||
random_other_next_tokens = ids_tensor((self.batch_size, 1), config.vocab_size).squeeze(-1)
|
||||
input_ids[:, -random_seq_idx_to_change] = random_other_next_tokens
|
||||
|
||||
# append to next input_ids and attn_mask
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
attn_mask = torch.cat(
|
||||
[attn_mask, torch.ones((attn_mask.shape[0], 1), dtype=torch.long, device=torch_device)],
|
||||
dim=1,
|
||||
)
|
||||
|
||||
# get two different outputs
|
||||
output_from_no_past = model(next_input_ids, attention_mask=attn_mask)["last_hidden_state"]
|
||||
output_from_past = model(next_tokens, past_key_values=past_key_values, attention_mask=attn_mask)[
|
||||
"last_hidden_state"
|
||||
]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -1, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, 0, random_slice_idx].detach()
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
def create_and_check_decoder_model_past_large_inputs(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config).get_decoder().to(torch_device).eval()
|
||||
# first forward pass
|
||||
outputs = model(input_ids, attention_mask=attention_mask, use_cache=True)
|
||||
|
||||
output, past_key_values = outputs.to_tuple()
|
||||
|
||||
# create hypothetical multiple next token and extent to next_input_ids
|
||||
next_tokens = ids_tensor((self.batch_size, 3), config.vocab_size)
|
||||
next_mask = ids_tensor((self.batch_size, 3), vocab_size=2)
|
||||
|
||||
# append to next input_ids and
|
||||
next_input_ids = torch.cat([input_ids, next_tokens], dim=-1)
|
||||
next_attention_mask = torch.cat([attention_mask, next_mask], dim=-1)
|
||||
|
||||
output_from_no_past = model(next_input_ids, attention_mask=next_attention_mask)["last_hidden_state"]
|
||||
output_from_past = model(next_tokens, attention_mask=next_attention_mask, past_key_values=past_key_values)[
|
||||
"last_hidden_state"
|
||||
]
|
||||
|
||||
# select random slice
|
||||
random_slice_idx = ids_tensor((1,), output_from_past.shape[-1]).item()
|
||||
output_from_no_past_slice = output_from_no_past[:, -3:, random_slice_idx].detach()
|
||||
output_from_past_slice = output_from_past[:, :, random_slice_idx].detach()
|
||||
|
||||
self.parent.assertTrue(output_from_past_slice.shape[1] == next_tokens.shape[1])
|
||||
|
||||
# test that outputs are equal for slice
|
||||
self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))
|
||||
|
||||
def create_and_check_generate_with_past_key_values(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config).to(torch_device).eval()
|
||||
torch.manual_seed(0)
|
||||
output_without_past_cache = model.generate(
|
||||
input_ids[:1], num_beams=2, max_length=5, do_sample=True, use_cache=False
|
||||
)
|
||||
torch.manual_seed(0)
|
||||
output_with_past_cache = model.generate(input_ids[:1], num_beams=2, max_length=5, do_sample=True)
|
||||
self.parent.assertTrue(torch.all(output_with_past_cache == output_without_past_cache))
|
||||
|
||||
def create_and_check_model_fp16_forward(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
model = Pop2PianoForConditionalGeneration(config=config).to(torch_device).half().eval()
|
||||
output = model(input_ids, decoder_input_ids=input_ids, attention_mask=attention_mask)[
|
||||
"encoder_last_hidden_state"
|
||||
]
|
||||
self.parent.assertFalse(torch.isnan(output).any().item())
|
||||
|
||||
def create_and_check_encoder_decoder_shared_weights(
|
||||
self,
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
):
|
||||
for model_class in [Pop2PianoForConditionalGeneration]:
|
||||
torch.manual_seed(0)
|
||||
model = model_class(config=config).to(torch_device).eval()
|
||||
# load state dict copies weights but does not tie them
|
||||
model.encoder.load_state_dict(model.decoder.state_dict(), strict=False)
|
||||
|
||||
torch.manual_seed(0)
|
||||
tied_config = copy.deepcopy(config)
|
||||
tied_config.tie_encoder_decoder = True
|
||||
tied_model = model_class(config=tied_config).to(torch_device).eval()
|
||||
|
||||
model_result = model(
|
||||
input_ids=input_ids,
|
||||
decoder_input_ids=decoder_input_ids,
|
||||
attention_mask=attention_mask,
|
||||
decoder_attention_mask=decoder_attention_mask,
|
||||
)
|
||||
|
||||
tied_model_result = tied_model(
|
||||
input_ids=input_ids,
|
||||
decoder_input_ids=decoder_input_ids,
|
||||
attention_mask=attention_mask,
|
||||
decoder_attention_mask=decoder_attention_mask,
|
||||
)
|
||||
|
||||
# check that models has less parameters
|
||||
self.parent.assertLess(
|
||||
sum(p.numel() for p in tied_model.parameters()), sum(p.numel() for p in model.parameters())
|
||||
)
|
||||
random_slice_idx = ids_tensor((1,), model_result[0].shape[-1]).item()
|
||||
|
||||
# check that outputs are equal
|
||||
self.parent.assertTrue(
|
||||
torch.allclose(
|
||||
model_result[0][0, :, random_slice_idx], tied_model_result[0][0, :, random_slice_idx], atol=1e-4
|
||||
)
|
||||
)
|
||||
|
||||
# check that outputs after saving and loading are equal
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
tied_model.save_pretrained(tmpdirname)
|
||||
tied_model = model_class.from_pretrained(tmpdirname)
|
||||
tied_model.to(torch_device)
|
||||
tied_model.eval()
|
||||
|
||||
# check that models has less parameters
|
||||
self.parent.assertLess(
|
||||
sum(p.numel() for p in tied_model.parameters()), sum(p.numel() for p in model.parameters())
|
||||
)
|
||||
random_slice_idx = ids_tensor((1,), model_result[0].shape[-1]).item()
|
||||
|
||||
tied_model_result = tied_model(
|
||||
input_ids=input_ids,
|
||||
decoder_input_ids=decoder_input_ids,
|
||||
attention_mask=attention_mask,
|
||||
decoder_attention_mask=decoder_attention_mask,
|
||||
)
|
||||
|
||||
# check that outputs are equal
|
||||
self.parent.assertTrue(
|
||||
torch.allclose(
|
||||
model_result[0][0, :, random_slice_idx],
|
||||
tied_model_result[0][0, :, random_slice_idx],
|
||||
atol=1e-4,
|
||||
)
|
||||
)
|
||||
|
||||
def check_resize_embeddings_pop2piano_v1_1(
|
||||
self,
|
||||
config,
|
||||
):
|
||||
prev_vocab_size = config.vocab_size
|
||||
|
||||
config.tie_word_embeddings = False
|
||||
model = Pop2PianoForConditionalGeneration(config=config).to(torch_device).eval()
|
||||
model.resize_token_embeddings(prev_vocab_size - 10)
|
||||
|
||||
self.parent.assertEqual(model.get_input_embeddings().weight.shape[0], prev_vocab_size - 10)
|
||||
self.parent.assertEqual(model.get_output_embeddings().weight.shape[0], prev_vocab_size - 10)
|
||||
self.parent.assertEqual(model.config.vocab_size, prev_vocab_size - 10)
|
||||
|
||||
def prepare_config_and_inputs_for_common(self):
|
||||
config_and_inputs = self.prepare_config_and_inputs()
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
) = config_and_inputs
|
||||
|
||||
inputs_dict = {
|
||||
"input_ids": input_ids,
|
||||
"attention_mask": attention_mask,
|
||||
"decoder_input_ids": decoder_input_ids,
|
||||
"decoder_attention_mask": decoder_attention_mask,
|
||||
"use_cache": False,
|
||||
}
|
||||
return config, inputs_dict
|
||||
|
||||
|
||||
@require_torch
|
||||
class Pop2PianoModelTest(ModelTesterMixin, GenerationTesterMixin, unittest.TestCase):
|
||||
all_model_classes = (Pop2PianoForConditionalGeneration,) if is_torch_available() else ()
|
||||
all_generative_model_classes = ()
|
||||
all_parallelizable_model_classes = ()
|
||||
fx_compatible = False
|
||||
test_pruning = False
|
||||
test_resize_embeddings = True
|
||||
test_model_parallel = False
|
||||
is_encoder_decoder = True
|
||||
|
||||
def setUp(self):
|
||||
self.model_tester = Pop2PianoModelTester(self)
|
||||
self.config_tester = ConfigTester(self, config_class=Pop2PianoConfig, d_model=37)
|
||||
|
||||
def test_config(self):
|
||||
self.config_tester.run_common_tests()
|
||||
|
||||
def test_shift_right(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.check_prepare_lm_labels_via_shift_left(*config_and_inputs)
|
||||
|
||||
def test_model(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_model_v1_1(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
# check that gated gelu feed forward and different word embeddings work
|
||||
config = config_and_inputs[0]
|
||||
config.tie_word_embeddings = False
|
||||
config.feed_forward_proj = "gated-gelu"
|
||||
self.model_tester.create_and_check_model(config, *config_and_inputs[1:])
|
||||
|
||||
def test_config_and_model_silu_gated(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
config = config_and_inputs[0]
|
||||
config.feed_forward_proj = "gated-silu"
|
||||
self.model_tester.create_and_check_model(*config_and_inputs)
|
||||
|
||||
def test_with_lm_head(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_with_lm_head(*config_and_inputs)
|
||||
|
||||
def test_decoder_model_past(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_decoder_model_past(*config_and_inputs)
|
||||
|
||||
def test_decoder_model_past_with_attn_mask(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_decoder_model_attention_mask_past(*config_and_inputs)
|
||||
|
||||
def test_decoder_model_past_with_3d_attn_mask(self):
|
||||
(
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
) = self.model_tester.prepare_config_and_inputs()
|
||||
|
||||
attention_mask = ids_tensor(
|
||||
[self.model_tester.batch_size, self.model_tester.encoder_seq_length, self.model_tester.encoder_seq_length],
|
||||
vocab_size=2,
|
||||
)
|
||||
decoder_attention_mask = ids_tensor(
|
||||
[self.model_tester.batch_size, self.model_tester.decoder_seq_length, self.model_tester.decoder_seq_length],
|
||||
vocab_size=2,
|
||||
)
|
||||
|
||||
self.model_tester.create_and_check_decoder_model_attention_mask_past(
|
||||
config,
|
||||
input_ids,
|
||||
decoder_input_ids,
|
||||
attention_mask,
|
||||
decoder_attention_mask,
|
||||
lm_labels,
|
||||
)
|
||||
|
||||
def test_decoder_model_past_with_large_inputs(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_decoder_model_past_large_inputs(*config_and_inputs)
|
||||
|
||||
def test_encoder_decoder_shared_weights(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_encoder_decoder_shared_weights(*config_and_inputs)
|
||||
|
||||
@unittest.skipIf(torch_device == "cpu", "Cant do half precision")
|
||||
def test_model_fp16_forward(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
self.model_tester.create_and_check_model_fp16_forward(*config_and_inputs)
|
||||
|
||||
def test_v1_1_resize_embeddings(self):
|
||||
config = self.model_tester.prepare_config_and_inputs()[0]
|
||||
self.model_tester.check_resize_embeddings_pop2piano_v1_1(config)
|
||||
|
||||
@slow
|
||||
def test_model_from_pretrained(self):
|
||||
for model_name in POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
|
||||
model = Pop2PianoForConditionalGeneration.from_pretrained(model_name)
|
||||
self.assertIsNotNone(model)
|
||||
|
||||
@require_onnx
|
||||
@unittest.skipIf(
|
||||
is_torch_1_8_0,
|
||||
reason="Test has a segmentation fault on torch 1.8.0",
|
||||
)
|
||||
def test_export_to_onnx(self):
|
||||
config_and_inputs = self.model_tester.prepare_config_and_inputs()
|
||||
model = Pop2PianoForConditionalGeneration(config_and_inputs[0]).to(torch_device)
|
||||
with tempfile.TemporaryDirectory() as tmpdirname:
|
||||
torch.onnx.export(
|
||||
model,
|
||||
(config_and_inputs[1], config_and_inputs[3], config_and_inputs[2]),
|
||||
f"{tmpdirname}/Pop2Piano_test.onnx",
|
||||
export_params=True,
|
||||
opset_version=9,
|
||||
input_names=["input_ids", "decoder_input_ids"],
|
||||
)
|
||||
|
||||
def test_pass_with_input_features(self):
|
||||
input_features = BatchFeature(
|
||||
{
|
||||
"input_features": torch.rand((75, 100, 512)).type(torch.float32),
|
||||
"beatsteps": torch.randint(size=(1, 955), low=0, high=100).type(torch.float32),
|
||||
"extrapolated_beatstep": torch.randint(size=(1, 900), low=0, high=100).type(torch.float32),
|
||||
}
|
||||
)
|
||||
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
model_opts = model.generate(input_features=input_features["input_features"], return_dict_in_generate=True)
|
||||
|
||||
self.assertEqual(model_opts.sequences.ndim, 2)
|
||||
|
||||
def test_pass_with_batched_input_features(self):
|
||||
input_features = BatchFeature(
|
||||
{
|
||||
"input_features": torch.rand((220, 70, 512)).type(torch.float32),
|
||||
"beatsteps": torch.randint(size=(5, 955), low=0, high=100).type(torch.float32),
|
||||
"extrapolated_beatstep": torch.randint(size=(5, 900), low=0, high=100).type(torch.float32),
|
||||
"attention_mask": torch.concatenate(
|
||||
[
|
||||
torch.ones([120, 70], dtype=torch.int32),
|
||||
torch.zeros([1, 70], dtype=torch.int32),
|
||||
torch.ones([50, 70], dtype=torch.int32),
|
||||
torch.zeros([1, 70], dtype=torch.int32),
|
||||
torch.ones([47, 70], dtype=torch.int32),
|
||||
torch.zeros([1, 70], dtype=torch.int32),
|
||||
],
|
||||
axis=0,
|
||||
),
|
||||
"attention_mask_beatsteps": torch.ones((5, 955)).type(torch.int32),
|
||||
"attention_mask_extrapolated_beatstep": torch.ones((5, 900)).type(torch.int32),
|
||||
}
|
||||
)
|
||||
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
model_opts = model.generate(
|
||||
input_features=input_features["input_features"],
|
||||
attention_mask=input_features["attention_mask"],
|
||||
return_dict_in_generate=True,
|
||||
)
|
||||
|
||||
self.assertEqual(model_opts.sequences.ndim, 2)
|
||||
|
||||
|
||||
@require_torch
|
||||
class Pop2PianoModelIntegrationTests(unittest.TestCase):
|
||||
@slow
|
||||
def test_mel_conditioner_integration(self):
|
||||
composer = "composer1"
|
||||
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
input_embeds = torch.ones([10, 100, 512])
|
||||
|
||||
composer_value = model.generation_config.composer_to_feature_token[composer]
|
||||
composer_value = torch.tensor(composer_value)
|
||||
composer_value = composer_value.repeat(input_embeds.size(0))
|
||||
outputs = model.mel_conditioner(
|
||||
input_embeds, composer_value, min(model.generation_config.composer_to_feature_token.values())
|
||||
)
|
||||
|
||||
# check shape
|
||||
self.assertEqual(outputs.size(), torch.Size([10, 101, 512]))
|
||||
|
||||
# check values
|
||||
EXPECTED_OUTPUTS = torch.tensor(
|
||||
[[1.0475305318832397, 0.29052114486694336, -0.47778210043907166], [1.0, 1.0, 1.0], [1.0, 1.0, 1.0]]
|
||||
)
|
||||
|
||||
self.assertTrue(torch.allclose(outputs[0, :3, :3], EXPECTED_OUTPUTS, atol=1e-4))
|
||||
|
||||
@slow
|
||||
@require_essentia
|
||||
@require_librosa
|
||||
@require_scipy
|
||||
def test_full_model_integration(self):
|
||||
if is_librosa_available() and is_scipy_available() and is_essentia_available() and is_torch_available():
|
||||
from transformers import Pop2PianoProcessor
|
||||
|
||||
speech_input1 = np.zeros([1_000_000], dtype=np.float32)
|
||||
sampling_rate = 44_100
|
||||
|
||||
processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
|
||||
input_features = processor.feature_extractor(
|
||||
speech_input1, sampling_rate=sampling_rate, return_tensors="pt"
|
||||
)
|
||||
|
||||
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano")
|
||||
outputs = model.generate(
|
||||
input_features=input_features["input_features"], return_dict_in_generate=True
|
||||
).sequences
|
||||
|
||||
# check for shapes
|
||||
self.assertEqual(outputs.size(0), 70)
|
||||
|
||||
# check for values
|
||||
self.assertEqual(outputs[0, :2].detach().cpu().numpy().tolist(), [0, 1])
|
||||
|
||||
# This is the test for a real music from K-Pop genre.
|
||||
@slow
|
||||
@require_essentia
|
||||
@require_librosa
|
||||
@require_scipy
|
||||
def test_real_music(self):
|
||||
if is_librosa_available() and is_scipy_available() and is_essentia_available() and is_torch_available():
|
||||
from transformers import Pop2PianoFeatureExtractor, Pop2PianoTokenizer
|
||||
|
||||
model = Pop2PianoForConditionalGeneration.from_pretrained("susnato/pop2piano_dev")
|
||||
model.eval()
|
||||
feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("susnato/pop2piano_dev")
|
||||
tokenizer = Pop2PianoTokenizer.from_pretrained("susnato/pop2piano_dev")
|
||||
ds = load_dataset("sweetcocoa/pop2piano_ci", split="test")
|
||||
|
||||
output_fe = feature_extractor(
|
||||
ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt"
|
||||
)
|
||||
output_model = model.generate(input_features=output_fe["input_features"], composer="composer1")
|
||||
output_tokenizer = tokenizer.batch_decode(token_ids=output_model, feature_extractor_output=output_fe)
|
||||
pretty_midi_object = output_tokenizer["pretty_midi_objects"][0]
|
||||
|
||||
# Checking if no of notes are same
|
||||
self.assertEqual(len(pretty_midi_object.instruments[0].notes), 59)
|
||||
predicted_timings = []
|
||||
for i in pretty_midi_object.instruments[0].notes:
|
||||
predicted_timings.append(i.start)
|
||||
|
||||
# Checking note start timings(first 6)
|
||||
EXPECTED_START_TIMINGS = [
|
||||
0.4876190423965454,
|
||||
0.7314285635948181,
|
||||
0.9752380847930908,
|
||||
1.4396371841430664,
|
||||
1.6718367338180542,
|
||||
1.904036283493042,
|
||||
]
|
||||
|
||||
np.allclose(EXPECTED_START_TIMINGS, predicted_timings[:6])
|
||||
|
||||
# Checking note end timings(last 6)
|
||||
EXPECTED_END_TIMINGS = [
|
||||
12.341403007507324,
|
||||
12.567797183990479,
|
||||
12.567797183990479,
|
||||
12.567797183990479,
|
||||
12.794191360473633,
|
||||
12.794191360473633,
|
||||
]
|
||||
|
||||
np.allclose(EXPECTED_END_TIMINGS, predicted_timings[-6:])
|
||||
266
tests/models/pop2piano/test_processor_pop2piano.py
Normal file
266
tests/models/pop2piano/test_processor_pop2piano.py
Normal file
@@ -0,0 +1,266 @@
|
||||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import shutil
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
from datasets import load_dataset
|
||||
|
||||
from transformers.testing_utils import (
|
||||
require_essentia,
|
||||
require_librosa,
|
||||
require_pretty_midi,
|
||||
require_scipy,
|
||||
require_torch,
|
||||
)
|
||||
from transformers.tokenization_utils import BatchEncoding
|
||||
from transformers.utils.import_utils import (
|
||||
is_essentia_available,
|
||||
is_librosa_available,
|
||||
is_pretty_midi_available,
|
||||
is_scipy_available,
|
||||
is_torch_available,
|
||||
)
|
||||
|
||||
|
||||
requirements_available = (
|
||||
is_torch_available()
|
||||
and is_essentia_available()
|
||||
and is_scipy_available()
|
||||
and is_librosa_available()
|
||||
and is_pretty_midi_available()
|
||||
)
|
||||
|
||||
if requirements_available:
|
||||
import pretty_midi
|
||||
|
||||
from transformers import (
|
||||
Pop2PianoFeatureExtractor,
|
||||
Pop2PianoForConditionalGeneration,
|
||||
Pop2PianoProcessor,
|
||||
Pop2PianoTokenizer,
|
||||
)
|
||||
|
||||
## TODO : changing checkpoints from `susnato/pop2piano_dev` to `sweetcocoa/pop2piano` after the PR is approved
|
||||
|
||||
|
||||
@require_scipy
|
||||
@require_torch
|
||||
@require_librosa
|
||||
@require_essentia
|
||||
@require_pretty_midi
|
||||
class Pop2PianoProcessorTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("susnato/pop2piano_dev")
|
||||
tokenizer = Pop2PianoTokenizer.from_pretrained("susnato/pop2piano_dev")
|
||||
processor = Pop2PianoProcessor(feature_extractor, tokenizer)
|
||||
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
|
||||
def get_tokenizer(self, **kwargs):
|
||||
return Pop2PianoTokenizer.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def get_feature_extractor(self, **kwargs):
|
||||
return Pop2PianoFeatureExtractor.from_pretrained(self.tmpdirname, **kwargs)
|
||||
|
||||
def tearDown(self):
|
||||
shutil.rmtree(self.tmpdirname)
|
||||
|
||||
def test_save_load_pretrained_additional_features(self):
|
||||
processor = Pop2PianoProcessor(
|
||||
tokenizer=self.get_tokenizer(),
|
||||
feature_extractor=self.get_feature_extractor(),
|
||||
)
|
||||
processor.save_pretrained(self.tmpdirname)
|
||||
|
||||
tokenizer_add_kwargs = self.get_tokenizer(
|
||||
unk_token="-1",
|
||||
eos_token="1",
|
||||
pad_token="0",
|
||||
bos_token="2",
|
||||
)
|
||||
feature_extractor_add_kwargs = self.get_feature_extractor()
|
||||
|
||||
processor = Pop2PianoProcessor.from_pretrained(
|
||||
self.tmpdirname,
|
||||
unk_token="-1",
|
||||
eos_token="1",
|
||||
pad_token="0",
|
||||
bos_token="2",
|
||||
)
|
||||
|
||||
self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
|
||||
self.assertIsInstance(processor.tokenizer, Pop2PianoTokenizer)
|
||||
|
||||
self.assertEqual(processor.feature_extractor.to_json_string(), feature_extractor_add_kwargs.to_json_string())
|
||||
self.assertIsInstance(processor.feature_extractor, Pop2PianoFeatureExtractor)
|
||||
|
||||
def get_inputs(self):
|
||||
"""get inputs for both feature extractor and tokenizer"""
|
||||
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
||||
speech_samples = ds.sort("id").select([0])["audio"]
|
||||
input_speech = [x["array"] for x in speech_samples][0]
|
||||
sampling_rate = [x["sampling_rate"] for x in speech_samples][0]
|
||||
|
||||
feature_extractor_outputs = self.get_feature_extractor()(
|
||||
audio=input_speech, sampling_rate=sampling_rate, return_tensors="pt"
|
||||
)
|
||||
model = Pop2PianoForConditionalGeneration.from_pretrained("susnato/pop2piano_dev")
|
||||
token_ids = model.generate(input_features=feature_extractor_outputs["input_features"], composer="composer1")
|
||||
dummy_notes = [
|
||||
[
|
||||
pretty_midi.Note(start=0.441179, end=2.159456, pitch=70, velocity=77),
|
||||
pretty_midi.Note(start=0.673379, end=0.905578, pitch=73, velocity=77),
|
||||
pretty_midi.Note(start=0.905578, end=2.159456, pitch=73, velocity=77),
|
||||
pretty_midi.Note(start=1.114558, end=2.159456, pitch=78, velocity=77),
|
||||
pretty_midi.Note(start=1.323537, end=1.532517, pitch=80, velocity=77),
|
||||
],
|
||||
[
|
||||
pretty_midi.Note(start=0.441179, end=2.159456, pitch=70, velocity=77),
|
||||
],
|
||||
]
|
||||
|
||||
return input_speech, sampling_rate, token_ids, dummy_notes
|
||||
|
||||
def test_feature_extractor(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = Pop2PianoProcessor(
|
||||
tokenizer=tokenizer,
|
||||
feature_extractor=feature_extractor,
|
||||
)
|
||||
|
||||
input_speech, sampling_rate, _, _ = self.get_inputs()
|
||||
|
||||
feature_extractor_outputs = feature_extractor(
|
||||
audio=input_speech, sampling_rate=sampling_rate, return_tensors="np"
|
||||
)
|
||||
processor_outputs = processor(audio=input_speech, sampling_rate=sampling_rate, return_tensors="np")
|
||||
|
||||
for key in feature_extractor_outputs.keys():
|
||||
self.assertTrue(np.allclose(feature_extractor_outputs[key], processor_outputs[key], atol=1e-4))
|
||||
|
||||
def test_processor_batch_decode(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = Pop2PianoProcessor(
|
||||
tokenizer=tokenizer,
|
||||
feature_extractor=feature_extractor,
|
||||
)
|
||||
|
||||
audio, sampling_rate, token_ids, _ = self.get_inputs()
|
||||
feature_extractor_output = feature_extractor(audio=audio, sampling_rate=sampling_rate, return_tensors="pt")
|
||||
|
||||
encoded_processor = processor.batch_decode(
|
||||
token_ids=token_ids,
|
||||
feature_extractor_output=feature_extractor_output,
|
||||
return_midi=True,
|
||||
)
|
||||
|
||||
encoded_tokenizer = tokenizer.batch_decode(
|
||||
token_ids=token_ids,
|
||||
feature_extractor_output=feature_extractor_output,
|
||||
return_midi=True,
|
||||
)
|
||||
# check start timings
|
||||
encoded_processor_start_timings = [token.start for token in encoded_processor["notes"]]
|
||||
encoded_tokenizer_start_timings = [token.start for token in encoded_tokenizer["notes"]]
|
||||
self.assertListEqual(encoded_processor_start_timings, encoded_tokenizer_start_timings)
|
||||
|
||||
# check end timings
|
||||
encoded_processor_end_timings = [token.end for token in encoded_processor["notes"]]
|
||||
encoded_tokenizer_end_timings = [token.end for token in encoded_tokenizer["notes"]]
|
||||
self.assertListEqual(encoded_processor_end_timings, encoded_tokenizer_end_timings)
|
||||
|
||||
# check pitch
|
||||
encoded_processor_pitch = [token.pitch for token in encoded_processor["notes"]]
|
||||
encoded_tokenizer_pitch = [token.pitch for token in encoded_tokenizer["notes"]]
|
||||
self.assertListEqual(encoded_processor_pitch, encoded_tokenizer_pitch)
|
||||
|
||||
# check velocity
|
||||
encoded_processor_velocity = [token.velocity for token in encoded_processor["notes"]]
|
||||
encoded_tokenizer_velocity = [token.velocity for token in encoded_tokenizer["notes"]]
|
||||
self.assertListEqual(encoded_processor_velocity, encoded_tokenizer_velocity)
|
||||
|
||||
def test_tokenizer_call(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = Pop2PianoProcessor(
|
||||
tokenizer=tokenizer,
|
||||
feature_extractor=feature_extractor,
|
||||
)
|
||||
|
||||
_, _, _, notes = self.get_inputs()
|
||||
|
||||
encoded_processor = processor(
|
||||
notes=notes,
|
||||
)
|
||||
|
||||
self.assertTrue(isinstance(encoded_processor, BatchEncoding))
|
||||
|
||||
def test_processor(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = Pop2PianoProcessor(
|
||||
tokenizer=tokenizer,
|
||||
feature_extractor=feature_extractor,
|
||||
)
|
||||
|
||||
audio, sampling_rate, _, notes = self.get_inputs()
|
||||
|
||||
inputs = processor(
|
||||
audio=audio,
|
||||
sampling_rate=sampling_rate,
|
||||
notes=notes,
|
||||
)
|
||||
|
||||
self.assertListEqual(
|
||||
list(inputs.keys()),
|
||||
["input_features", "beatsteps", "extrapolated_beatstep", "token_ids"],
|
||||
)
|
||||
|
||||
# test if it raises when no input is passed
|
||||
with pytest.raises(ValueError):
|
||||
processor()
|
||||
|
||||
def test_model_input_names(self):
|
||||
feature_extractor = self.get_feature_extractor()
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
processor = Pop2PianoProcessor(
|
||||
tokenizer=tokenizer,
|
||||
feature_extractor=feature_extractor,
|
||||
)
|
||||
|
||||
audio, sampling_rate, _, notes = self.get_inputs()
|
||||
feature_extractor(audio, sampling_rate, return_tensors="pt")
|
||||
|
||||
inputs = processor(
|
||||
audio=audio,
|
||||
sampling_rate=sampling_rate,
|
||||
notes=notes,
|
||||
)
|
||||
self.assertListEqual(
|
||||
list(inputs.keys()),
|
||||
["input_features", "beatsteps", "extrapolated_beatstep", "token_ids"],
|
||||
)
|
||||
418
tests/models/pop2piano/test_tokenization_pop2piano.py
Normal file
418
tests/models/pop2piano/test_tokenization_pop2piano.py
Normal file
@@ -0,0 +1,418 @@
|
||||
# coding=utf-8
|
||||
# Copyright 2023 The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
Please note that Pop2PianoTokenizer is too far from our usual tokenizers and thus cannot use the TokenizerTesterMixin class.
|
||||
"""
|
||||
|
||||
import os
|
||||
import pickle
|
||||
import shutil
|
||||
import tempfile
|
||||
import unittest
|
||||
|
||||
from transformers.feature_extraction_utils import BatchFeature
|
||||
from transformers.testing_utils import (
|
||||
is_pretty_midi_available,
|
||||
is_torch_available,
|
||||
require_pretty_midi,
|
||||
require_torch,
|
||||
)
|
||||
from transformers.tokenization_utils import BatchEncoding
|
||||
|
||||
|
||||
if is_torch_available():
|
||||
import torch
|
||||
|
||||
|
||||
requirements_available = is_torch_available() and is_pretty_midi_available()
|
||||
if requirements_available:
|
||||
import pretty_midi
|
||||
|
||||
from transformers import Pop2PianoTokenizer
|
||||
|
||||
|
||||
## TODO : changing checkpoints from `susnato/pop2piano_dev` to `sweetcocoa/pop2piano` after the PR is approved
|
||||
|
||||
|
||||
@require_torch
|
||||
@require_pretty_midi
|
||||
class Pop2PianoTokenizerTest(unittest.TestCase):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
self.tokenizer = Pop2PianoTokenizer.from_pretrained("susnato/pop2piano_dev")
|
||||
|
||||
def get_input_notes(self):
|
||||
notes = [
|
||||
[
|
||||
pretty_midi.Note(start=0.441179, end=2.159456, pitch=70, velocity=77),
|
||||
pretty_midi.Note(start=0.673379, end=0.905578, pitch=73, velocity=77),
|
||||
pretty_midi.Note(start=0.905578, end=2.159456, pitch=73, velocity=77),
|
||||
pretty_midi.Note(start=1.114558, end=2.159456, pitch=78, velocity=77),
|
||||
pretty_midi.Note(start=1.323537, end=1.532517, pitch=80, velocity=77),
|
||||
],
|
||||
[
|
||||
pretty_midi.Note(start=0.441179, end=2.159456, pitch=70, velocity=77),
|
||||
],
|
||||
]
|
||||
|
||||
return notes
|
||||
|
||||
def test_call(self):
|
||||
notes = self.get_input_notes()
|
||||
|
||||
output = self.tokenizer(
|
||||
notes,
|
||||
return_tensors="pt",
|
||||
padding="max_length",
|
||||
truncation=True,
|
||||
max_length=10,
|
||||
return_attention_mask=True,
|
||||
)
|
||||
|
||||
# check the output type
|
||||
self.assertTrue(isinstance(output, BatchEncoding))
|
||||
|
||||
# check the values
|
||||
expected_output_token_ids = torch.tensor(
|
||||
[[134, 133, 74, 135, 77, 132, 77, 133, 77, 82], [134, 133, 74, 136, 132, 74, 134, 134, 134, 134]]
|
||||
)
|
||||
expected_output_attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])
|
||||
|
||||
self.assertTrue(torch.allclose(output["token_ids"], expected_output_token_ids, atol=1e-4))
|
||||
self.assertTrue(torch.allclose(output["attention_mask"], expected_output_attention_mask, atol=1e-4))
|
||||
|
||||
def test_batch_decode(self):
|
||||
# test batch decode with model, feature-extractor outputs(beatsteps, extrapolated_beatstep)
|
||||
|
||||
# Please note that this test does not test the accuracy of the outputs, instead it is designed to make sure that
|
||||
# the tokenizer's batch_decode can deal with attention_mask in feature-extractor outputs. For the accuracy check
|
||||
# please see the `test_batch_decode_outputs` test.
|
||||
|
||||
model_output = torch.concatenate(
|
||||
[
|
||||
torch.randint(size=[120, 96], low=0, high=70, dtype=torch.long),
|
||||
torch.zeros(size=[1, 96], dtype=torch.long),
|
||||
torch.randint(size=[50, 96], low=0, high=40, dtype=torch.long),
|
||||
torch.zeros(size=[1, 96], dtype=torch.long),
|
||||
],
|
||||
axis=0,
|
||||
)
|
||||
input_features = BatchFeature(
|
||||
{
|
||||
"beatsteps": torch.ones([2, 955]),
|
||||
"extrapolated_beatstep": torch.ones([2, 1000]),
|
||||
"attention_mask": torch.concatenate(
|
||||
[
|
||||
torch.ones([120, 96], dtype=torch.long),
|
||||
torch.zeros([1, 96], dtype=torch.long),
|
||||
torch.ones([50, 96], dtype=torch.long),
|
||||
torch.zeros([1, 96], dtype=torch.long),
|
||||
],
|
||||
axis=0,
|
||||
),
|
||||
"attention_mask_beatsteps": torch.ones([2, 955]),
|
||||
"attention_mask_extrapolated_beatstep": torch.ones([2, 1000]),
|
||||
}
|
||||
)
|
||||
|
||||
output = self.tokenizer.batch_decode(token_ids=model_output, feature_extractor_output=input_features)[
|
||||
"pretty_midi_objects"
|
||||
]
|
||||
|
||||
# check length
|
||||
self.assertTrue(len(output) == 2)
|
||||
|
||||
# check object type
|
||||
self.assertTrue(isinstance(output[0], pretty_midi.pretty_midi.PrettyMIDI))
|
||||
self.assertTrue(isinstance(output[1], pretty_midi.pretty_midi.PrettyMIDI))
|
||||
|
||||
def test_batch_decode_outputs(self):
|
||||
# test batch decode with model, feature-extractor outputs(beatsteps, extrapolated_beatstep)
|
||||
|
||||
# Please note that this test tests the accuracy of the outputs of the tokenizer's `batch_decode` method.
|
||||
|
||||
model_output = torch.tensor(
|
||||
[
|
||||
[134, 133, 74, 135, 77, 82, 84, 136, 132, 74, 77, 82, 84],
|
||||
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
|
||||
]
|
||||
)
|
||||
input_features = BatchEncoding(
|
||||
{
|
||||
"beatsteps": torch.tensor([[0.0697, 0.1103, 0.1509, 0.1916]]),
|
||||
"extrapolated_beatstep": torch.tensor([[0.0000, 0.0406, 0.0813, 0.1219]]),
|
||||
}
|
||||
)
|
||||
|
||||
output = self.tokenizer.batch_decode(token_ids=model_output, feature_extractor_output=input_features)
|
||||
|
||||
# check outputs
|
||||
self.assertEqual(len(output["notes"]), 4)
|
||||
|
||||
predicted_start_timings, predicted_end_timings = [], []
|
||||
for i in output["notes"]:
|
||||
predicted_start_timings.append(i.start)
|
||||
predicted_end_timings.append(i.end)
|
||||
|
||||
# Checking note start timings
|
||||
expected_start_timings = torch.tensor(
|
||||
[
|
||||
0.069700,
|
||||
0.110300,
|
||||
0.110300,
|
||||
0.110300,
|
||||
]
|
||||
)
|
||||
predicted_start_timings = torch.tensor(predicted_start_timings)
|
||||
|
||||
self.assertTrue(torch.allclose(expected_start_timings, predicted_start_timings, atol=1e-4))
|
||||
|
||||
# Checking note end timings
|
||||
expected_end_timings = torch.tensor(
|
||||
[
|
||||
0.191600,
|
||||
0.191600,
|
||||
0.191600,
|
||||
0.191600,
|
||||
]
|
||||
)
|
||||
predicted_end_timings = torch.tensor(predicted_end_timings)
|
||||
|
||||
self.assertTrue(torch.allclose(expected_end_timings, predicted_end_timings, atol=1e-4))
|
||||
|
||||
def test_get_vocab(self):
|
||||
vocab_dict = self.tokenizer.get_vocab()
|
||||
self.assertIsInstance(vocab_dict, dict)
|
||||
self.assertGreaterEqual(len(self.tokenizer), len(vocab_dict))
|
||||
|
||||
vocab = [self.tokenizer.convert_ids_to_tokens(i) for i in range(len(self.tokenizer))]
|
||||
self.assertEqual(len(vocab), len(self.tokenizer))
|
||||
|
||||
self.tokenizer.add_tokens(["asdfasdfasdfasdf"])
|
||||
vocab = [self.tokenizer.convert_ids_to_tokens(i) for i in range(len(self.tokenizer))]
|
||||
self.assertEqual(len(vocab), len(self.tokenizer))
|
||||
|
||||
def test_save_and_load_tokenizer(self):
|
||||
tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
sample_notes = self.get_input_notes()
|
||||
|
||||
self.tokenizer.add_tokens(["bim", "bambam"])
|
||||
additional_special_tokens = self.tokenizer.additional_special_tokens
|
||||
additional_special_tokens.append("new_additional_special_token")
|
||||
self.tokenizer.add_special_tokens({"additional_special_tokens": additional_special_tokens})
|
||||
before_token_ids = self.tokenizer(sample_notes)["token_ids"]
|
||||
before_vocab = self.tokenizer.get_vocab()
|
||||
self.tokenizer.save_pretrained(tmpdirname)
|
||||
|
||||
after_tokenizer = self.tokenizer.__class__.from_pretrained(tmpdirname)
|
||||
after_token_ids = after_tokenizer(sample_notes)["token_ids"]
|
||||
after_vocab = after_tokenizer.get_vocab()
|
||||
self.assertDictEqual(before_vocab, after_vocab)
|
||||
self.assertListEqual(before_token_ids, after_token_ids)
|
||||
self.assertIn("bim", after_vocab)
|
||||
self.assertIn("bambam", after_vocab)
|
||||
self.assertIn("new_additional_special_token", after_tokenizer.additional_special_tokens)
|
||||
|
||||
shutil.rmtree(tmpdirname)
|
||||
|
||||
def test_pickle_tokenizer(self):
|
||||
tmpdirname = tempfile.mkdtemp()
|
||||
|
||||
notes = self.get_input_notes()
|
||||
subwords = self.tokenizer(notes)["token_ids"]
|
||||
|
||||
filename = os.path.join(tmpdirname, "tokenizer.bin")
|
||||
with open(filename, "wb") as handle:
|
||||
pickle.dump(self.tokenizer, handle)
|
||||
|
||||
with open(filename, "rb") as handle:
|
||||
tokenizer_new = pickle.load(handle)
|
||||
|
||||
subwords_loaded = tokenizer_new(notes)["token_ids"]
|
||||
|
||||
self.assertListEqual(subwords, subwords_loaded)
|
||||
|
||||
def test_padding_side_in_kwargs(self):
|
||||
tokenizer_p = Pop2PianoTokenizer.from_pretrained("susnato/pop2piano_dev", padding_side="left")
|
||||
self.assertEqual(tokenizer_p.padding_side, "left")
|
||||
|
||||
tokenizer_p = Pop2PianoTokenizer.from_pretrained("susnato/pop2piano_dev", padding_side="right")
|
||||
self.assertEqual(tokenizer_p.padding_side, "right")
|
||||
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
Pop2PianoTokenizer.from_pretrained,
|
||||
"susnato/pop2piano_dev",
|
||||
padding_side="unauthorized",
|
||||
)
|
||||
|
||||
def test_truncation_side_in_kwargs(self):
|
||||
tokenizer_p = Pop2PianoTokenizer.from_pretrained("susnato/pop2piano_dev", truncation_side="left")
|
||||
self.assertEqual(tokenizer_p.truncation_side, "left")
|
||||
|
||||
tokenizer_p = Pop2PianoTokenizer.from_pretrained("susnato/pop2piano_dev", truncation_side="right")
|
||||
self.assertEqual(tokenizer_p.truncation_side, "right")
|
||||
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
Pop2PianoTokenizer.from_pretrained,
|
||||
"susnato/pop2piano_dev",
|
||||
truncation_side="unauthorized",
|
||||
)
|
||||
|
||||
def test_right_and_left_padding(self):
|
||||
tokenizer = self.tokenizer
|
||||
notes = self.get_input_notes()
|
||||
notes = notes[0]
|
||||
max_length = 20
|
||||
|
||||
padding_idx = tokenizer.pad_token_id
|
||||
|
||||
# RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
|
||||
tokenizer.padding_side = "right"
|
||||
padded_notes = tokenizer(notes, padding="max_length", max_length=max_length)["token_ids"]
|
||||
padded_notes_length = len(padded_notes)
|
||||
notes_without_padding = tokenizer(notes, padding="do_not_pad")["token_ids"]
|
||||
padding_size = max_length - len(notes_without_padding)
|
||||
|
||||
self.assertEqual(padded_notes_length, max_length)
|
||||
self.assertEqual(notes_without_padding + [padding_idx] * padding_size, padded_notes)
|
||||
|
||||
# LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
|
||||
tokenizer.padding_side = "left"
|
||||
padded_notes = tokenizer(notes, padding="max_length", max_length=max_length)["token_ids"]
|
||||
padded_notes_length = len(padded_notes)
|
||||
notes_without_padding = tokenizer(notes, padding="do_not_pad")["token_ids"]
|
||||
padding_size = max_length - len(notes_without_padding)
|
||||
|
||||
self.assertEqual(padded_notes_length, max_length)
|
||||
self.assertEqual([padding_idx] * padding_size + notes_without_padding, padded_notes)
|
||||
|
||||
# RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding'
|
||||
notes_without_padding = tokenizer(notes)["token_ids"]
|
||||
|
||||
tokenizer.padding_side = "right"
|
||||
padded_notes_right = tokenizer(notes, padding=False)["token_ids"]
|
||||
self.assertEqual(len(padded_notes_right), len(notes_without_padding))
|
||||
self.assertEqual(padded_notes_right, notes_without_padding)
|
||||
|
||||
tokenizer.padding_side = "left"
|
||||
padded_notes_left = tokenizer(notes, padding="longest")["token_ids"]
|
||||
self.assertEqual(len(padded_notes_left), len(notes_without_padding))
|
||||
self.assertEqual(padded_notes_left, notes_without_padding)
|
||||
|
||||
tokenizer.padding_side = "right"
|
||||
padded_notes_right = tokenizer(notes, padding="longest")["token_ids"]
|
||||
self.assertEqual(len(padded_notes_right), len(notes_without_padding))
|
||||
self.assertEqual(padded_notes_right, notes_without_padding)
|
||||
|
||||
tokenizer.padding_side = "left"
|
||||
padded_notes_left = tokenizer(notes, padding=False)["token_ids"]
|
||||
self.assertEqual(len(padded_notes_left), len(notes_without_padding))
|
||||
self.assertEqual(padded_notes_left, notes_without_padding)
|
||||
|
||||
def test_right_and_left_truncation(self):
|
||||
tokenizer = self.tokenizer
|
||||
notes = self.get_input_notes()
|
||||
notes = notes[0]
|
||||
truncation_size = 3
|
||||
|
||||
# RIGHT TRUNCATION - Check that it correctly truncates when a maximum length is specified along with the truncation flag set to True
|
||||
tokenizer.truncation_side = "right"
|
||||
full_encoded_notes = tokenizer(notes)["token_ids"]
|
||||
full_encoded_notes_length = len(full_encoded_notes)
|
||||
truncated_notes = tokenizer(notes, max_length=full_encoded_notes_length - truncation_size, truncation=True)[
|
||||
"token_ids"
|
||||
]
|
||||
self.assertEqual(full_encoded_notes_length, len(truncated_notes) + truncation_size)
|
||||
self.assertEqual(full_encoded_notes[:-truncation_size], truncated_notes)
|
||||
|
||||
# LEFT TRUNCATION - Check that it correctly truncates when a maximum length is specified along with the truncation flag set to True
|
||||
tokenizer.truncation_side = "left"
|
||||
full_encoded_notes = tokenizer(notes)["token_ids"]
|
||||
full_encoded_notes_length = len(full_encoded_notes)
|
||||
truncated_notes = tokenizer(notes, max_length=full_encoded_notes_length - truncation_size, truncation=True)[
|
||||
"token_ids"
|
||||
]
|
||||
self.assertEqual(full_encoded_notes_length, len(truncated_notes) + truncation_size)
|
||||
self.assertEqual(full_encoded_notes[truncation_size:], truncated_notes)
|
||||
|
||||
# RIGHT & LEFT TRUNCATION - Check that nothing is done for 'longest' and 'no_truncation'
|
||||
tokenizer.truncation_side = "right"
|
||||
truncated_notes_right = tokenizer(notes, truncation=True)["token_ids"]
|
||||
self.assertEqual(full_encoded_notes_length, len(truncated_notes_right))
|
||||
self.assertEqual(full_encoded_notes, truncated_notes_right)
|
||||
|
||||
tokenizer.truncation_side = "left"
|
||||
truncated_notes_left = tokenizer(notes, truncation="longest_first")["token_ids"]
|
||||
self.assertEqual(len(truncated_notes_left), full_encoded_notes_length)
|
||||
self.assertEqual(truncated_notes_left, full_encoded_notes)
|
||||
|
||||
tokenizer.truncation_side = "right"
|
||||
truncated_notes_right = tokenizer(notes, truncation="longest_first")["token_ids"]
|
||||
self.assertEqual(len(truncated_notes_right), full_encoded_notes_length)
|
||||
self.assertEqual(truncated_notes_right, full_encoded_notes)
|
||||
|
||||
tokenizer.truncation_side = "left"
|
||||
truncated_notes_left = tokenizer(notes, truncation=True)["token_ids"]
|
||||
self.assertEqual(len(truncated_notes_left), full_encoded_notes_length)
|
||||
self.assertEqual(truncated_notes_left, full_encoded_notes)
|
||||
|
||||
def test_padding_to_multiple_of(self):
|
||||
notes = self.get_input_notes()
|
||||
|
||||
if self.tokenizer.pad_token is None:
|
||||
self.skipTest("No padding token.")
|
||||
else:
|
||||
normal_tokens = self.tokenizer(notes[0], padding=True, pad_to_multiple_of=8)
|
||||
for key, value in normal_tokens.items():
|
||||
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
|
||||
|
||||
normal_tokens = self.tokenizer(notes[0], pad_to_multiple_of=8)
|
||||
for key, value in normal_tokens.items():
|
||||
self.assertNotEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
|
||||
|
||||
# Should also work with truncation
|
||||
normal_tokens = self.tokenizer(notes[0], padding=True, truncation=True, pad_to_multiple_of=8)
|
||||
for key, value in normal_tokens.items():
|
||||
self.assertEqual(len(value) % 8, 0, f"BatchEncoding.{key} is not multiple of 8")
|
||||
|
||||
# truncation to something which is not a multiple of pad_to_multiple_of raises an error
|
||||
self.assertRaises(
|
||||
ValueError,
|
||||
self.tokenizer.__call__,
|
||||
notes[0],
|
||||
padding=True,
|
||||
truncation=True,
|
||||
max_length=12,
|
||||
pad_to_multiple_of=8,
|
||||
)
|
||||
|
||||
def test_padding_with_attention_mask(self):
|
||||
if self.tokenizer.pad_token is None:
|
||||
self.skipTest("No padding token.")
|
||||
if "attention_mask" not in self.tokenizer.model_input_names:
|
||||
self.skipTest("This model does not use attention mask.")
|
||||
|
||||
features = [
|
||||
{"token_ids": [1, 2, 3, 4, 5, 6], "attention_mask": [1, 1, 1, 1, 1, 0]},
|
||||
{"token_ids": [1, 2, 3], "attention_mask": [1, 1, 0]},
|
||||
]
|
||||
padded_features = self.tokenizer.pad(features)
|
||||
if self.tokenizer.padding_side == "right":
|
||||
self.assertListEqual(padded_features["attention_mask"], [[1, 1, 1, 1, 1, 0], [1, 1, 0, 0, 0, 0]])
|
||||
else:
|
||||
self.assertListEqual(padded_features["attention_mask"], [[1, 1, 1, 1, 1, 0], [0, 0, 0, 1, 1, 0]])
|
||||
@@ -58,6 +58,8 @@ SPECIAL_CASES_TO_ALLOW = {
|
||||
# used internally in the configuration class file
|
||||
"LongT5Config": ["feed_forward_proj"],
|
||||
# used internally in the configuration class file
|
||||
"Pop2PianoConfig": ["feed_forward_proj"],
|
||||
# used internally in the configuration class file
|
||||
"SwitchTransformersConfig": ["feed_forward_proj"],
|
||||
# having default values other than `1e-5` - we can't fix them without breaking
|
||||
"BioGptConfig": ["layer_norm_eps"],
|
||||
|
||||
@@ -66,6 +66,7 @@ PRIVATE_MODELS = [
|
||||
"T5Stack",
|
||||
"MT5Stack",
|
||||
"UMT5Stack",
|
||||
"Pop2PianoStack",
|
||||
"SwitchTransformersStack",
|
||||
"TFDPRSpanPredictor",
|
||||
"MaskFormerSwinModel",
|
||||
|
||||
@@ -346,6 +346,8 @@ src/transformers/models/poolformer/configuration_poolformer.py
|
||||
src/transformers/models/poolformer/feature_extraction_poolformer.py
|
||||
src/transformers/models/poolformer/image_processing_poolformer.py
|
||||
src/transformers/models/poolformer/modeling_poolformer.py
|
||||
src/transformers/models/pop2piano/configuration_pop2piano.py
|
||||
src/transformers/models/pop2piano/modeling_pop2piano.py
|
||||
src/transformers/models/prophetnet/tokenization_prophetnet.py
|
||||
src/transformers/models/rag/tokenization_rag.py
|
||||
src/transformers/models/realm/configuration_realm.py
|
||||
|
||||
Reference in New Issue
Block a user