wav2vec english speech emotion recognition
r-fIntroduction
The WAV2VEC-ENGLISH-SPEECH-EMOTION-RECOGNITION is a fine-tuned model based on jonatasgrosman/wav2vec2-large-xlsr-53-english
, specifically designed for Speech Emotion Recognition (SER). The model was trained on several emotional speech datasets and can classify audio into seven emotions: angry, disgust, fear, happy, neutral, sad, and surprise. The model achieves high accuracy and low loss on evaluation.
Architecture
This model uses the wav2vec2
architecture, which is part of the Transformers
library and implemented with PyTorch
. It is designed for automatic speech recognition tasks and has been fine-tuned for emotion recognition using specific datasets.
Training
The model was trained using the following datasets:
- Surrey Audio-Visual Expressed Emotion (SAVEE)
- Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
- Toronto emotional speech set (TESS)
Key hyperparameters used during training include:
- Learning rate: 0.0001
- Train and evaluation batch size: 4
- Number of epochs: 4
- Optimizer: Adam
The model achieved an accuracy of 0.97463 with a loss of 0.104075 on the evaluation set.
Guide: Running Locally
To use the model locally, follow these steps:
-
Install Required Libraries:
pip install transformers librosa torch
-
Load and Predict with the Model:
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForCTC import librosa import torch feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition") model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition") def predict_emotion(audio_path): audio, rate = librosa.load(audio_path, sr=16000) inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True) with torch.no_grad(): outputs = model(inputs.input_values) predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1) predicted_label = torch.argmax(predictions, dim=-1) emotion = model.config.id2label[predicted_label.item()] return emotion emotion = predict_emotion("example_audio.wav") print(f"Predicted emotion: {emotion}")
-
Suggested Environment:
- Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient computation.
License
This model is licensed under the Apache-2.0 License, allowing for free use and distribution with appropriate credit.