wav2vec2 lg xlsr en speech emotion recognition

ehcalabres

Introduction

The wav2vec2-lg-xlsr-en-speech-emotion-recognition model is a fine-tuned version of the wav2vec2-large-xlsr-53-english model, specifically adapted for Speech Emotion Recognition (SER) tasks. It uses the RAVDESS dataset, which contains recordings of actors expressing eight different emotions in English: angry, calm, disgust, fearful, happy, neutral, sad, and surprised. The model achieves a loss of 0.5023 and an accuracy of 82.23% on the evaluation set.

Architecture

The model builds on the wav2vec 2.0 architecture, which is designed for efficient audio processing and is commonly used in tasks involving speech recognition and classification. The model is further fine-tuned to recognize emotional states from speech inputs.

Training

The model was trained using the following hyperparameters:

  • Learning Rate: 0.0001
  • Train Batch Size: 4
  • Eval Batch Size: 4
  • Seed: 42
  • Gradient Accumulation Steps: 2
  • Total Train Batch Size: 8
  • Optimizer: Adam (betas=(0.9,0.999), epsilon=1e-08)
  • Learning Rate Scheduler: Linear
  • Num Epochs: 3
  • Mixed Precision Training: Native AMP

Training results showed a gradual improvement in loss and accuracy over the epochs, reaching an accuracy of 82.23% at the final step.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Environment Setup:

    • Ensure you have Python installed.
    • Install necessary libraries with:
      pip install transformers==4.8.2 torch==1.9.0+cu102 datasets==1.9.0 tokenizers==0.10.3
      
  2. Download Model:

    • Use the Hugging Face Transformers library to load the model from the hub:
      from transformers import Wav2Vec2ForSequenceClassification
      model = Wav2Vec2ForSequenceClassification.from_pretrained("ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition")
      
  3. Prepare Data:

    • Organize your audio data to match the format expected by the model.
  4. Inference:

    • Use the model to predict emotions from your audio data.

Cloud GPUs

For improved performance, especially with large datasets, consider using cloud-based GPU services such as AWS EC2, Google Cloud Platform, or Azure.

License

This project is licensed under the Apache-2.0 License, allowing for wide use and distribution with certain conditions.

More Related APIs in Audio Classification