ast finetuned audioset 14 14 0.443

MIT

Introduction

The Audio Spectrogram Transformer (AST) model is fine-tuned on AudioSet for audio classification tasks. It is functionally similar to the Vision Transformer (ViT) but is specifically tailored for audio data by converting it into spectrogram images. This model achieves state-of-the-art performance on various audio classification benchmarks.

Architecture

The AST model processes audio by first converting it into a visual representation, namely a spectrogram. This spectrogram is then fed into a Vision Transformer architecture, which is adept at handling image data, allowing it to effectively classify audio into different categories defined by AudioSet.

Training

The model was initially introduced by Gong et al. in the paper AST: Audio Spectrogram Transformer. It has been fine-tuned on AudioSet, a comprehensive dataset for audio classification, to enhance its accuracy and performance across various audio classification benchmarks.

Guide: Running Locally

  1. Installation: Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face using pip:

    pip install transformers
    
  2. Load the Model: Use the transformers library to load the AST model:

    from transformers import ASTForAudioClassification, ASTFeatureExtractor
    model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-14-14-0.443")
    feature_extractor = ASTFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-14-14-0.443")
    
  3. Inference: Prepare your audio data and pass it through the model for classification:

    # Assuming audio is loaded and processed
    inputs = feature_extractor(audio, return_tensors="pt")
    outputs = model(**inputs)
    
  4. Hardware: While the model can run on a CPU, using a GPU can significantly speed up processing times. Cloud services like AWS, Google Cloud, or Azure provide GPUs for rent.

License

The Audio Spectrogram Transformer model is released under the BSD-3-Clause license, allowing for modification and distribution with certain conditions.

More Related APIs in Audio Classification