wav2vec2 base 100k gtzan music genres
m3hrdadfiIntroduction
The WAV2VEC2-BASE-100K-GTZAN-MUSIC-GENRES
model by m3hrdadfi
is designed for music genre classification using the Wav2Vec 2.0 architecture. It leverages automatic speech recognition and audio classification capabilities, built on PyTorch and Transformers libraries.
Architecture
The model is based on the Wav2Vec 2.0 architecture, which is a framework for self-supervised learning of speech representations. It uses transformers to process audio signals and classify them into different music genres.
Training
The model was trained using the GTZAN music genre dataset, which includes various genres such as blues, classical, country, disco, hip hop, jazz, metal, pop, reggae, and rock. Evaluation metrics such as precision, recall, and F1-score were used to assess performance, with the model achieving an overall accuracy of 77.5%.
Guide: Running Locally
Requirements
To run the model locally, the following packages are required:
- Hugging Face's
datasets
- Hugging Face's
transformers
torchaudio
librosa
Install them using:
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
Prediction
-
Load the model and feature extractor:
from transformers import AutoConfig, Wav2Vec2FeatureExtractor, Wav2Vec2ForSpeechClassification import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_name_or_path = "m3hrdadfi/wav2vec2-base-100k-voxpopuli-gtzan-music" config = AutoConfig.from_pretrained(model_name_or_path) feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path) model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)
-
Define helper functions and run prediction:
import torchaudio def speech_file_to_array_fn(path, sampling_rate): speech_array, _sampling_rate = torchaudio.load(path) resampler = torchaudio.transforms.Resample(_sampling_rate) speech = resampler(speech_array).squeeze().numpy() return speech def predict(path, sampling_rate): speech = speech_file_to_array_fn(path, sampling_rate) inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True) inputs = {key: inputs[key].to(device) for key in inputs} with torch.no_grad(): logits = model(**inputs).logits scores = torch.nn.functional.softmax(logits, dim=1).detach().cpu().numpy()[0] outputs = [{"Label": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)] return outputs path = "genres_original/disco/disco.00067.wav" outputs = predict(path, feature_extractor.sampling_rate) print(outputs)
Suggested Cloud GPUs
For optimal performance, consider using cloud GPUs such as NVIDIA Tesla V100 or A100, available on platforms like AWS, Google Cloud, or Azure.
License
The project repository should be consulted for specific licensing details. Users are encouraged to review the license terms to ensure compliance with usage guidelines.