ast finetuned audioset 14 14 0.443
MITIntroduction
The Audio Spectrogram Transformer (AST) model is fine-tuned on AudioSet for audio classification tasks. It is functionally similar to the Vision Transformer (ViT) but is specifically tailored for audio data by converting it into spectrogram images. This model achieves state-of-the-art performance on various audio classification benchmarks.
Architecture
The AST model processes audio by first converting it into a visual representation, namely a spectrogram. This spectrogram is then fed into a Vision Transformer architecture, which is adept at handling image data, allowing it to effectively classify audio into different categories defined by AudioSet.
Training
The model was initially introduced by Gong et al. in the paper AST: Audio Spectrogram Transformer. It has been fine-tuned on AudioSet, a comprehensive dataset for audio classification, to enhance its accuracy and performance across various audio classification benchmarks.
Guide: Running Locally
-
Installation: Ensure you have Python and PyTorch installed. Install the
transformers
library from Hugging Face using pip:pip install transformers
-
Load the Model: Use the
transformers
library to load the AST model:from transformers import ASTForAudioClassification, ASTFeatureExtractor model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-14-14-0.443") feature_extractor = ASTFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-14-14-0.443")
-
Inference: Prepare your audio data and pass it through the model for classification:
# Assuming audio is loaded and processed inputs = feature_extractor(audio, return_tensors="pt") outputs = model(**inputs)
-
Hardware: While the model can run on a CPU, using a GPU can significantly speed up processing times. Cloud services like AWS, Google Cloud, or Azure provide GPUs for rent.
License
The Audio Spectrogram Transformer model is released under the BSD-3-Clause license, allowing for modification and distribution with certain conditions.