wav2vec english speech emotion recognition

r-f

Introduction

The WAV2VEC-ENGLISH-SPEECH-EMOTION-RECOGNITION is a fine-tuned model based on jonatasgrosman/wav2vec2-large-xlsr-53-english, specifically designed for Speech Emotion Recognition (SER). The model was trained on several emotional speech datasets and can classify audio into seven emotions: angry, disgust, fear, happy, neutral, sad, and surprise. The model achieves high accuracy and low loss on evaluation.

Architecture

This model uses the wav2vec2 architecture, which is part of the Transformers library and implemented with PyTorch. It is designed for automatic speech recognition tasks and has been fine-tuned for emotion recognition using specific datasets.

Training

The model was trained using the following datasets:

  • Surrey Audio-Visual Expressed Emotion (SAVEE)
  • Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
  • Toronto emotional speech set (TESS)

Key hyperparameters used during training include:

  • Learning rate: 0.0001
  • Train and evaluation batch size: 4
  • Number of epochs: 4
  • Optimizer: Adam

The model achieved an accuracy of 0.97463 with a loss of 0.104075 on the evaluation set.

Guide: Running Locally

To use the model locally, follow these steps:

  1. Install Required Libraries:

    pip install transformers librosa torch
    
  2. Load and Predict with the Model:

    from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForCTC
    import librosa
    import torch
    
    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
    model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
    
    def predict_emotion(audio_path):
        audio, rate = librosa.load(audio_path, sr=16000)
        inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
        
        with torch.no_grad():
            outputs = model(inputs.input_values)
            predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1)
            predicted_label = torch.argmax(predictions, dim=-1)
            emotion = model.config.id2label[predicted_label.item()]
        return emotion
    
    emotion = predict_emotion("example_audio.wav")
    print(f"Predicted emotion: {emotion}")
    
  3. Suggested Environment:

    • Consider using cloud GPU services like AWS, Google Cloud, or Azure for efficient computation.

License

This model is licensed under the Apache-2.0 License, allowing for free use and distribution with appropriate credit.

More Related APIs