hubert base ch speech emotion recognition

xmj2002

Introduction

The HUBERT-BASE-CH-SPEECH-EMOTION-RECOGNITION model leverages the TencentGameMate/chinese-hubert-base as its pre-trained model and is fine-tuned on the CASIA dataset for recognizing six different emotions in Chinese speech. The emotions include anger, fear, happiness, neutrality, sadness, and surprise.

Architecture

The model architecture is based on the HuBERT model, a transformer architecture designed for processing audio data. It incorporates a HubertClassificationHead for emotion classification, which consists of a dense layer, dropout, and an output projection layer.

Training

Training involves splitting the dataset into 60% training, 20% validation, and 20% testing. The model uses a batch size of 36, a learning rate of 2e-4, and the AdamW optimizer with specific beta and weight decay settings. A step learning rate scheduler is employed with a step size of 10 and a gamma of 0.3, alongside a dropout of 0.1 for the classifier. The test set results in a loss of 0.1165 and an accuracy of 97.2%.

Guide: Running Locally

  1. Environment Setup:

    • Install necessary libraries: librosa, torch, transformers.
    • Ensure a suitable Python environment (Python 3.6+).
  2. Model and Data Preparation:

    • Load the model and processor using transformers.
    • Prepare audio files with a sample rate of 16,000 Hz and a duration of 6 seconds.
  3. Running Predictions:

    • Use the provided predict function to classify emotions from audio files.
    • Ensure the files are located in a directory, e.g., test_data.
  4. Cloud GPUs:

    • For enhanced performance, consider using cloud GPU services such as AWS EC2 with GPU instances, Google Cloud Platform (GCP) AI Notebooks, or Azure Machine Learning.

License

This model is released under the Apache 2.0 license, which allows for both personal and commercial use, distribution, and modification, provided that proper attribution is given to the original authors.

More Related APIs in Audio Classification