Add Audio Spectogram Transformer (#19981)

* First draft

* Make conversion script work

* Add id2label mapping, run code quality

* Fix copies

* Add first draft of feature extractor

* Update conversion script to use feature extractor

* Make more tests pass

* Add docs

* update input_features to input_values + pad by default to max length

* Fix doc tests

* Add feature extractor tests

* Add proper padding/truncation to feature extractor

* Add support for conversion of all audioset checkpoints

* Improve docs and extend conversion script

* Fix README

* Rename spectogram to spectrogram

* Fix copies

* Add integration test

* Remove dummy conv

* Update to ast

* Update organization

* Fix init

* Rename model to AST

* Add require_torchaudio annotator

* Move import of ASTFeatureExtractor under a is_speech_available

* Fix rebase

* Add pipeline config

* Update name of classifier head

* Rename time_dimension and frequency_dimension for clarity

* Remove print statement

* Fix pipeline test

* Fix pipeline test

* Fix index table

* Fix init

* Fix conversion script

* Rename to ForAudioClassification

* Fix index table

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
This commit is contained in:
NielsRogge
2022-11-21 18:58:54 +01:00
committed by GitHub
parent 1e3f17b5ab
commit 4973d2a04c
28 changed files with 2014 additions and 147 deletions

View File

@@ -447,6 +447,8 @@
title: Vision models
- isExpanded: false
sections:
- local: model_doc/audio-spectrogram-transformer
title: Audio Spectrogram Transformer
- local: model_doc/hubert
title: Hubert
- local: model_doc/mctct

View File

@@ -50,6 +50,7 @@ The documentation is organized into five sections:
<!--This list is updated automatically from the README with _make fix-copies_. Do not update manually! -->
1. **[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer.
1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis.
1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen.
@@ -216,150 +217,151 @@ Flax), PyTorch, and/or TensorFlow.
<!--This table is updated automatically from the auto modules with _make fix-copies_. Do not update manually!-->
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
|:---------------------------:|:--------------:|:--------------:|:---------------:|:------------------:|:------------:|
| ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
| BART | | | ✅ | | |
| BEiT | | | ✅ | | ✅ |
| BERT | | | ✅ | | ✅ |
| Bert Generation | ✅ | | ✅ | | |
| BigBird | ✅ | | ✅ | ❌ | |
| BigBird-Pegasus | | | ✅ | ❌ | |
| Blenderbot | | | ✅ | | |
| BlenderbotSmall | ✅ | ✅ | ✅ | ✅ | ✅ |
| BLOOM | | ✅ | ✅ | | |
| CamemBERT | | ✅ | ✅ | | ❌ |
| CANINE | ✅ | | ✅ | | ❌ |
| CLIP | ✅ | | ✅ | | |
| CLIPSeg | | | ✅ | | |
| CodeGen | | | ✅ | ❌ | ❌ |
| Conditional DETR | | | ✅ | ❌ | ❌ |
| ConvBERT | | | ✅ | | ❌ |
| ConvNeXT | | | ✅ | ✅ | ❌ |
| CTRL | | ❌ | ✅ | ✅ | ❌ |
| CvT | | ❌ | ✅ | ✅ | ❌ |
| Data2VecAudio | ❌ | ❌ | ✅ | | ❌ |
| Data2VecText | ❌ | ❌ | ✅ | ❌ | ❌ |
| Data2VecVision | ❌ | ❌ | ✅ | | ❌ |
| DeBERTa | | | ✅ | ✅ | ❌ |
| DeBERTa-v2 | ✅ | ✅ | ✅ | ✅ | ❌ |
| Decision Transformer | | | ✅ | | ❌ |
| Deformable DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
| DeiT | ❌ | ❌ | ✅ | | ❌ |
| DETR | ❌ | ❌ | ✅ | | ❌ |
| DiNAT | ❌ | ❌ | ✅ | ❌ | ❌ |
| DistilBERT | | | ✅ | | |
| DonutSwin | | | ✅ | | |
| DPR | | | ✅ | | ❌ |
| DPT | | | ✅ | | ❌ |
| ELECTRA | | | ✅ | | |
| Encoder decoder | | | ✅ | ✅ | ✅ |
| ERNIE | ❌ | ❌ | ✅ | | |
| ESM | | ❌ | ✅ | | ❌ |
| FairSeq Machine-Translation | ✅ | ❌ | ✅ | | ❌ |
| FlauBERT | ✅ | ❌ | ✅ | | ❌ |
| FLAVA | | ❌ | ✅ | | ❌ |
| FNet | | | ✅ | ❌ | ❌ |
| Funnel Transformer | ✅ | ✅ | ✅ | | ❌ |
| GLPN | | | ✅ | | ❌ |
| GPT Neo | ❌ | ❌ | ✅ | ❌ | |
| GPT NeoX | ❌ | | ✅ | ❌ | |
| GPT NeoX Japanese | | | ✅ | ❌ | ❌ |
| GPT-J | | ❌ | ✅ | | |
| GroupViT | ❌ | ❌ | ✅ | ✅ | |
| Hubert | ❌ | ❌ | ✅ | ✅ | ❌ |
| I-BERT | ❌ | ❌ | ✅ | | ❌ |
| ImageGPT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Jukebox | | ❌ | ✅ | ❌ | ❌ |
| LayoutLM | ✅ | | ✅ | | ❌ |
| LayoutLMv2 | ✅ | ✅ | ✅ | | ❌ |
| LayoutLMv3 | ✅ | ✅ | ✅ | | ❌ |
| LED | ✅ | ✅ | ✅ | ✅ | ❌ |
| LeViT | | | ✅ | | ❌ |
| LiLT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Longformer | | | ✅ | | ❌ |
| LongT5 | | | ✅ | | |
| LUKE | | ❌ | ✅ | ❌ | |
| LXMERT | ✅ | | ✅ | | ❌ |
| M-CTC-T | | | ✅ | | ❌ |
| M2M100 | | ❌ | ✅ | ❌ | ❌ |
| Marian | ✅ | ❌ | ✅ | | |
| MarkupLM | ✅ | | ✅ | | |
| MaskFormer | | | ✅ | ❌ | ❌ |
| mBART | | | ✅ | | |
| Megatron-BERT | | | ✅ | | |
| MobileBERT | | | ✅ | | ❌ |
| MobileNetV1 | | | ✅ | | ❌ |
| MobileNetV2 | ❌ | ❌ | ✅ | ❌ | ❌ |
| MobileViT | ❌ | ❌ | ✅ | | ❌ |
| MPNet | | | ✅ | ✅ | ❌ |
| MT5 | ✅ | ✅ | ✅ | ✅ | |
| MVP | ✅ | ✅ | ✅ | | |
| NAT | | | ✅ | ❌ | ❌ |
| Nezha | ❌ | ❌ | ✅ | ❌ | ❌ |
| Nyströmformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| OpenAI GPT | | | ✅ | | ❌ |
| OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | |
| OPT | | | ✅ | ✅ | ✅ |
| OWL-ViT | ❌ | ❌ | ✅ | | |
| Pegasus | | | ✅ | | |
| PEGASUS-X | | | ✅ | | |
| Perceiver | | ❌ | ✅ | ❌ | ❌ |
| PLBart | ✅ | ❌ | ✅ | ❌ | ❌ |
| PoolFormer | | ❌ | ✅ | ❌ | ❌ |
| ProphetNet | | ❌ | ✅ | ❌ | ❌ |
| QDQBert | | ❌ | ✅ | ❌ | ❌ |
| RAG | | ❌ | ✅ | | ❌ |
| REALM | ✅ | | ✅ | | ❌ |
| Reformer | ✅ | ✅ | ✅ | ❌ | ❌ |
| RegNet | | | ✅ | | ❌ |
| RemBERT | | | ✅ | ✅ | ❌ |
| ResNet | | | ✅ | ✅ | ❌ |
| RetriBERT | | | ✅ | | ❌ |
| RoBERTa | ✅ | ✅ | ✅ | | |
| RoCBert | ✅ | | ✅ | | |
| RoFormer | ✅ | | ✅ | | |
| SegFormer | | | ✅ | ✅ | |
| SEW | ❌ | ❌ | ✅ | | ❌ |
| SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ |
| Speech Encoder decoder | ❌ | ❌ | ✅ | ❌ | |
| Speech2Text | | ❌ | ✅ | | |
| Speech2Text2 | ✅ | ❌ | | | ❌ |
| Splinter | ✅ | | | ❌ | ❌ |
| SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ |
| Swin Transformer | | | ✅ | | ❌ |
| Swin Transformer V2 | ❌ | ❌ | ✅ | | ❌ |
| SwitchTransformers | ❌ | ❌ | ✅ | ❌ | ❌ |
| T5 | | | ✅ | | |
| Table Transformer | | | ✅ | | |
| TAPAS | | ❌ | ✅ | | ❌ |
| Time Series Transformer | | ❌ | ✅ | | ❌ |
| Trajectory Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| Transformer-XL | | ❌ | ✅ | | ❌ |
| TrOCR | | ❌ | ✅ | | ❌ |
| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
| UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ |
| VAN | ❌ | ❌ | ✅ | ❌ | ❌ |
| VideoMAE | ❌ | ❌ | ✅ | ❌ | ❌ |
| ViLT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Vision Encoder decoder | ❌ | ❌ | ✅ | | |
| VisionTextDualEncoder | ❌ | ❌ | ✅ | | ✅ |
| VisualBERT | ❌ | ❌ | ✅ | ❌ | |
| ViT | ❌ | ❌ | ✅ | | |
| ViTMAE | ❌ | ❌ | ✅ | ✅ | |
| ViTMSN | ❌ | ❌ | ✅ | | ❌ |
| Wav2Vec2 | | ❌ | ✅ | | |
| Wav2Vec2-Conformer | | ❌ | ✅ | | |
| WavLM | ❌ | ❌ | ✅ | ❌ | ❌ |
| Whisper | | ❌ | ✅ | | ❌ |
| X-CLIP | | ❌ | ✅ | | ❌ |
| XGLM | | | ✅ | | |
| XLM | ✅ | | ✅ | ✅ | |
| XLM-ProphetNet | ✅ | ❌ | ✅ | | ❌ |
| XLM-RoBERTa | ✅ | | ✅ | | |
| XLM-RoBERTa-XL | | | ✅ | | |
| XLNet | | | ✅ | | ❌ |
| YOLOS | | | ✅ | | ❌ |
| YOSO | ❌ | ❌ | ✅ | ❌ | ❌ |
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support |
|:-----------------------------:|:--------------:|:--------------:|:---------------:|:------------------:|:------------:|
| ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ |
| Audio Spectrogram Transformer | | | ✅ | | |
| BART | | | ✅ | | ✅ |
| BEiT | | | ✅ | | ✅ |
| BERT | ✅ | | ✅ | | |
| Bert Generation | ✅ | | ✅ | ❌ | |
| BigBird | | | ✅ | ❌ | |
| BigBird-Pegasus | | | ✅ | | |
| Blenderbot | ✅ | ✅ | ✅ | ✅ | ✅ |
| BlenderbotSmall | | ✅ | ✅ | | |
| BLOOM | | ✅ | ✅ | | ❌ |
| CamemBERT | ✅ | | ✅ | | ❌ |
| CANINE | ✅ | | ✅ | | |
| CLIP | | | ✅ | | |
| CLIPSeg | | | ✅ | ❌ | ❌ |
| CodeGen | | | ✅ | ❌ | ❌ |
| Conditional DETR | | | ✅ | | ❌ |
| ConvBERT | | | ✅ | ✅ | ❌ |
| ConvNeXT | | ❌ | ✅ | ✅ | ❌ |
| CTRL | | ❌ | ✅ | ✅ | ❌ |
| CvT | ❌ | ❌ | ✅ | | ❌ |
| Data2VecAudio | ❌ | ❌ | ✅ | ❌ | ❌ |
| Data2VecText | ❌ | ❌ | ✅ | | ❌ |
| Data2VecVision | | | ✅ | ✅ | ❌ |
| DeBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
| DeBERTa-v2 | | | ✅ | | ❌ |
| Decision Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| Deformable DETR | ❌ | ❌ | ✅ | | ❌ |
| DeiT | ❌ | ❌ | ✅ | | ❌ |
| DETR | ❌ | ❌ | ✅ | ❌ | ❌ |
| DiNAT | | | ✅ | | |
| DistilBERT | | | ✅ | | |
| DonutSwin | | | ✅ | | ❌ |
| DPR | | | ✅ | | ❌ |
| DPT | | | ✅ | | |
| ELECTRA | | | ✅ | ✅ | ✅ |
| Encoder decoder | ❌ | ❌ | ✅ | | |
| ERNIE | | ❌ | ✅ | | ❌ |
| ESM | ✅ | ❌ | ✅ | | ❌ |
| FairSeq Machine-Translation | ✅ | ❌ | ✅ | | ❌ |
| FlauBERT | | ❌ | ✅ | | ❌ |
| FLAVA | | | ✅ | ❌ | ❌ |
| FNet | ✅ | ✅ | ✅ | | ❌ |
| Funnel Transformer | | | ✅ | | ❌ |
| GLPN | ❌ | ❌ | ✅ | ❌ | |
| GPT Neo | ❌ | | ✅ | ❌ | |
| GPT NeoX | | | ✅ | ❌ | ❌ |
| GPT NeoX Japanese | | ❌ | ✅ | | |
| GPT-J | ❌ | ❌ | ✅ | ✅ | |
| GroupViT | ❌ | ❌ | ✅ | ✅ | ❌ |
| Hubert | ❌ | ❌ | ✅ | | ❌ |
| I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ |
| ImageGPT | | ❌ | ✅ | ❌ | ❌ |
| Jukebox | ✅ | | ✅ | | ❌ |
| LayoutLM | ✅ | ✅ | ✅ | | ❌ |
| LayoutLMv2 | ✅ | ✅ | ✅ | | ❌ |
| LayoutLMv3 | ✅ | ✅ | ✅ | ✅ | ❌ |
| LED | | | ✅ | | ❌ |
| LeViT | ❌ | ❌ | ✅ | ❌ | ❌ |
| LiLT | | | ✅ | | ❌ |
| Longformer | | | ✅ | | |
| LongT5 | | ❌ | ✅ | ❌ | |
| LUKE | ✅ | | ✅ | | ❌ |
| LXMERT | | | ✅ | | ❌ |
| M-CTC-T | | ❌ | ✅ | ❌ | ❌ |
| M2M100 | ✅ | ❌ | ✅ | | |
| Marian | ✅ | | ✅ | | |
| MarkupLM | | | ✅ | ❌ | ❌ |
| MaskFormer | | | ✅ | | |
| mBART | | | ✅ | | |
| Megatron-BERT | | | ✅ | | ❌ |
| MobileBERT | | | ✅ | | ❌ |
| MobileNetV1 | ❌ | ❌ | ✅ | ❌ | ❌ |
| MobileNetV2 | ❌ | ❌ | ✅ | | ❌ |
| MobileViT | | | ✅ | ✅ | ❌ |
| MPNet | ✅ | ✅ | ✅ | ✅ | |
| MT5 | ✅ | ✅ | ✅ | | |
| MVP | | | ✅ | ❌ | ❌ |
| NAT | ❌ | ❌ | ✅ | ❌ | ❌ |
| Nezha | ❌ | ❌ | ✅ | ❌ | ❌ |
| Nyströmformer | | | ✅ | | ❌ |
| OpenAI GPT | ✅ | ✅ | ✅ | ✅ | |
| OpenAI GPT-2 | | | ✅ | ✅ | ✅ |
| OPT | ❌ | ❌ | ✅ | | |
| OWL-ViT | | | ✅ | | |
| Pegasus | | | ✅ | | |
| PEGASUS-X | | ❌ | ✅ | ❌ | ❌ |
| Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ |
| PLBart | | ❌ | ✅ | ❌ | ❌ |
| PoolFormer | | ❌ | ✅ | ❌ | ❌ |
| ProphetNet | | ❌ | ✅ | ❌ | ❌ |
| QDQBert | | ❌ | ✅ | | ❌ |
| RAG | ✅ | | ✅ | | ❌ |
| REALM | ✅ | ✅ | ✅ | ❌ | ❌ |
| Reformer | | | ✅ | | ❌ |
| RegNet | | | ✅ | ✅ | ❌ |
| RemBERT | | | ✅ | ✅ | ❌ |
| ResNet | | | ✅ | | ❌ |
| RetriBERT | ✅ | ✅ | ✅ | | |
| RoBERTa | ✅ | | ✅ | | |
| RoCBert | ✅ | | ✅ | | |
| RoFormer | | | ✅ | ✅ | |
| SegFormer | ❌ | ❌ | ✅ | | ❌ |
| SEW | ❌ | ❌ | ✅ | ❌ | ❌ |
| SEW-D | ❌ | ❌ | ✅ | ❌ | |
| Speech Encoder decoder | | ❌ | ✅ | | |
| Speech2Text | ✅ | ❌ | | | ❌ |
| Speech2Text2 | ✅ | | | ❌ | ❌ |
| Splinter | ✅ | ✅ | ✅ | ❌ | ❌ |
| SqueezeBERT | | | ✅ | | ❌ |
| Swin Transformer | ❌ | ❌ | ✅ | | ❌ |
| Swin Transformer V2 | ❌ | ❌ | ✅ | ❌ | ❌ |
| SwitchTransformers | | | ✅ | | |
| T5 | | | ✅ | | |
| Table Transformer | | ❌ | ✅ | | ❌ |
| TAPAS | | ❌ | ✅ | | ❌ |
| Time Series Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| Trajectory Transformer | | ❌ | ✅ | | ❌ |
| Transformer-XL | | ❌ | ✅ | | ❌ |
| TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ |
| UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ |
| UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ |
| VAN | ❌ | ❌ | ✅ | ❌ | ❌ |
| VideoMAE | ❌ | ❌ | ✅ | ❌ | ❌ |
| ViLT | ❌ | ❌ | ✅ | | |
| Vision Encoder decoder | ❌ | ❌ | ✅ | | ✅ |
| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | |
| VisualBERT | ❌ | ❌ | ✅ | | |
| ViT | ❌ | ❌ | ✅ | ✅ | |
| ViTMAE | ❌ | ❌ | ✅ | | ❌ |
| ViTMSN | | ❌ | ✅ | | |
| Wav2Vec2 | | ❌ | ✅ | | |
| Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ |
| WavLM | | ❌ | ✅ | | ❌ |
| Whisper | | ❌ | ✅ | | ❌ |
| X-CLIP | | | ✅ | | |
| XGLM | ✅ | | ✅ | ✅ | |
| XLM | ✅ | ❌ | ✅ | | ❌ |
| XLM-ProphetNet | ✅ | | ✅ | | |
| XLM-RoBERTa | | | ✅ | | |
| XLM-RoBERTa-XL | | | ✅ | | ❌ |
| XLNet | | | ✅ | | ❌ |
| YOLOS | ❌ | ❌ | ✅ | ❌ | ❌ |
| YOSO | ❌ | ❌ | ✅ | ❌ | ❌ |
<!-- End table-->
<!-- End table-->

View File

@@ -0,0 +1,60 @@
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Audio Spectrogram Transformer
## Overview
The Audio Spectrogram Transformer model was proposed in [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass.
The Audio Spectrogram Transformer applies a [Vision Transformer](vit) to audio, by turning audio into an image (spectrogram). The model obtains state-of-the-art results
for audio classification.
The abstract from the paper is the following:
*In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.*
Tips:
- When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make
sure the input has mean of 0 and std of 0.5). [`ASTFeatureExtractor`] takes care of this. Note that it uses the AudioSet
mean and std by default. You can check [`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py) to see how
the authors compute the stats for a downstream dataset.
- Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the
[PSLA paper](https://arxiv.org/abs/2102.01243)) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/audio_spectogram_transformer_architecture.png"
alt="drawing" width="600"/>
<small> Audio pectrogram Transformer architecture. Taken from the <a href="https://arxiv.org/abs/2104.01778">original paper</a>.</small>
This model was contributed by [nielsr](https://huggingface.co/nielsr).
The original code can be found [here](https://github.com/YuanGongND/ast).
## ASTConfig
[[autodoc]] ASTConfig
## ASTFeatureExtractor
[[autodoc]] ASTFeatureExtractor
- __call__
## ASTModel
[[autodoc]] ASTModel
- forward
## ASTForAudioClassification
[[autodoc]] ASTForAudioClassification
- forward