ast finetuned audioset 10 10 0.4593
MITIntroduction
The Audio Spectrogram Transformer (AST) is a model designed for audio classification tasks and fine-tuned on AudioSet. It transforms audio data into spectrograms, which are then processed similarly to image data using a Vision Transformer (ViT) architecture. This model achieves state-of-the-art results in various audio classification benchmarks.
Architecture
The AST model adapts the Vision Transformer (ViT) architecture to handle audio data. The process begins with converting audio signals into spectrograms, which are visual representations of the audio. These spectrograms serve as inputs to the transformer, allowing the model to leverage the powerful capabilities of ViT for audio classification purposes.
Training
The model was fine-tuned on AudioSet, a large-scale dataset designed for audio event detection. The original AST model was introduced by Gong et al. in the paper "AST: Audio Spectrogram Transformer" and is maintained in the GitHub repository linked by the authors. The fine-tuning process involved adjusting the pre-trained AST model to perform optimally on AudioSet's classification tasks.
Guide: Running Locally
To run the AST model locally, follow these steps:
- Install Dependencies: Ensure Python and PyTorch are installed on your system. Install the Transformers library from Hugging Face using:
pip install transformers
- Load the Model: Use the Transformers library to download and initialize the AST model.
- Prepare Input Data: Convert your audio data into spectrograms suitable for input into the model.
- Inference: Use the model to classify audio inputs into predefined classes.
For efficient execution, especially with large datasets or real-time classification, consider using cloud GPUs provided by platforms such as AWS, Google Cloud, or Azure.
License
The Audio Spectrogram Transformer model is released under the BSD-3-Clause license. This license permits redistribution and use in source and binary forms, with or without modification, under certain conditions.