wav2vec2 base 100k gtzan music genres LLM Model

Introduction

The WAV2VEC2-BASE-100K-GTZAN-MUSIC-GENRES model by m3hrdadfi is designed for music genre classification using the Wav2Vec 2.0 architecture. It leverages automatic speech recognition and audio classification capabilities, built on PyTorch and Transformers libraries.

Architecture

The model is based on the Wav2Vec 2.0 architecture, which is a framework for self-supervised learning of speech representations. It uses transformers to process audio signals and classify them into different music genres.

Training

The model was trained using the GTZAN music genre dataset, which includes various genres such as blues, classical, country, disco, hip hop, jazz, metal, pop, reggae, and rock. Evaluation metrics such as precision, recall, and F1-score were used to assess performance, with the model achieving an overall accuracy of 77.5%.

Guide: Running Locally

Requirements

To run the model locally, the following packages are required:

Hugging Face's datasets
Hugging Face's transformers
torchaudio
librosa

Install them using:

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

Prediction

Load the model and feature extractor:

from transformers import AutoConfig, Wav2Vec2FeatureExtractor, Wav2Vec2ForSpeechClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "m3hrdadfi/wav2vec2-base-100k-voxpopuli-gtzan-music"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

Define helper functions and run prediction:

import torchaudio

def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = torch.nn.functional.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Label": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs

path = "genres_original/disco/disco.00067.wav"
outputs = predict(path, feature_extractor.sampling_rate)
print(outputs)

Suggested Cloud GPUs

For optimal performance, consider using cloud GPUs such as NVIDIA Tesla V100 or A100, available on platforms like AWS, Google Cloud, or Azure.

License

The project repository should be consulted for specific licensing details. Users are encouraged to review the license terms to ensure compliance with usage guidelines.

More Related APIs in Automatic Speech Recognition