wav2vec2 xlsr greek speech emotion recognition
m3hrdadfiIntroduction
This document provides an overview and usage guide for the WAV2VEC2-XLSR-GREEK-SPEECH-EMOTION-RECOGNITION
model, which is designed to recognize emotions in Greek speech using the Wav2Vec 2.0 model. It utilizes the aesdd
dataset and supports emotion recognition tasks.
Architecture
The model is built using the Wav2Vec 2.0 architecture, which is optimized for automatic speech recognition and emotion detection tasks. It operates on Greek language audio inputs and outputs emotion predictions based on the provided speech data.
Training
The model has been trained and evaluated with a focus on emotion recognition in Greek. It reports high precision and recall across various emotions, achieving an overall accuracy of 0.91.
Guide: Running Locally
Requirements
To run the model locally, ensure the following packages are installed:
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
Prediction
- Import Libraries: Ensure necessary libraries like
torch
,torchaudio
, andtransformers
are imported. - Load Model: Use the
AutoConfig
andWav2Vec2FeatureExtractor
to load the model configurations. - Preprocess Audio: Convert audio files to appropriate array format using
torchaudio
. - Run Prediction: Use the model to predict emotions from the audio file. The predictions will include emotion labels and their respective confidence scores.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)
def predict(path, sampling_rate):
speech = speech_file_to_array_fn(path, sampling_rate)
inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
inputs = {key: inputs[key].to(device) for key in inputs}
with torch.no_grad():
logits = model(**inputs).logits
scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
return outputs
path = "/path/to/disgust.wav"
outputs = predict(path, sampling_rate)
Cloud GPUs
For enhanced performance, consider running the model on cloud platforms that provide GPU support, such as AWS, Google Cloud, or Azure.
License
The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use.