wav2vec2 base 100k gtzan music genres

m3hrdadfi

Introduction

The WAV2VEC2-BASE-100K-GTZAN-MUSIC-GENRES model by m3hrdadfi is designed for music genre classification using the Wav2Vec 2.0 architecture. It leverages automatic speech recognition and audio classification capabilities, built on PyTorch and Transformers libraries.

Architecture

The model is based on the Wav2Vec 2.0 architecture, which is a framework for self-supervised learning of speech representations. It uses transformers to process audio signals and classify them into different music genres.

Training

The model was trained using the GTZAN music genre dataset, which includes various genres such as blues, classical, country, disco, hip hop, jazz, metal, pop, reggae, and rock. Evaluation metrics such as precision, recall, and F1-score were used to assess performance, with the model achieving an overall accuracy of 77.5%.

Guide: Running Locally

Requirements

To run the model locally, the following packages are required:

  • Hugging Face's datasets
  • Hugging Face's transformers
  • torchaudio
  • librosa

Install them using:

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

Prediction

  1. Load the model and feature extractor:

    from transformers import AutoConfig, Wav2Vec2FeatureExtractor, Wav2Vec2ForSpeechClassification
    import torch
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_name_or_path = "m3hrdadfi/wav2vec2-base-100k-voxpopuli-gtzan-music"
    config = AutoConfig.from_pretrained(model_name_or_path)
    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
    model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)
    
  2. Define helper functions and run prediction:

    import torchaudio
    
    def speech_file_to_array_fn(path, sampling_rate):
        speech_array, _sampling_rate = torchaudio.load(path)
        resampler = torchaudio.transforms.Resample(_sampling_rate)
        speech = resampler(speech_array).squeeze().numpy()
        return speech
    
    def predict(path, sampling_rate):
        speech = speech_file_to_array_fn(path, sampling_rate)
        inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
        inputs = {key: inputs[key].to(device) for key in inputs}
    
        with torch.no_grad():
            logits = model(**inputs).logits
    
        scores = torch.nn.functional.softmax(logits, dim=1).detach().cpu().numpy()[0]
        outputs = [{"Label": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
        return outputs
    
    path = "genres_original/disco/disco.00067.wav"
    outputs = predict(path, feature_extractor.sampling_rate)
    print(outputs)
    

Suggested Cloud GPUs

For optimal performance, consider using cloud GPUs such as NVIDIA Tesla V100 or A100, available on platforms like AWS, Google Cloud, or Azure.

License

The project repository should be consulted for specific licensing details. Users are encouraged to review the license terms to ensure compliance with usage guidelines.

More Related APIs in Automatic Speech Recognition