hubert base ch speech emotion recognition
xmj2002Introduction
The HUBERT-BASE-CH-SPEECH-EMOTION-RECOGNITION
model leverages the TencentGameMate/chinese-hubert-base
as its pre-trained model and is fine-tuned on the CASIA dataset for recognizing six different emotions in Chinese speech. The emotions include anger, fear, happiness, neutrality, sadness, and surprise.
Architecture
The model architecture is based on the HuBERT model, a transformer architecture designed for processing audio data. It incorporates a HubertClassificationHead
for emotion classification, which consists of a dense layer, dropout, and an output projection layer.
Training
Training involves splitting the dataset into 60% training, 20% validation, and 20% testing. The model uses a batch size of 36, a learning rate of 2e-4, and the AdamW optimizer with specific beta and weight decay settings. A step learning rate scheduler is employed with a step size of 10 and a gamma of 0.3, alongside a dropout of 0.1 for the classifier. The test set results in a loss of 0.1165 and an accuracy of 97.2%.
Guide: Running Locally
-
Environment Setup:
- Install necessary libraries:
librosa
,torch
,transformers
. - Ensure a suitable Python environment (Python 3.6+).
- Install necessary libraries:
-
Model and Data Preparation:
- Load the model and processor using
transformers
. - Prepare audio files with a sample rate of 16,000 Hz and a duration of 6 seconds.
- Load the model and processor using
-
Running Predictions:
- Use the provided
predict
function to classify emotions from audio files. - Ensure the files are located in a directory, e.g.,
test_data
.
- Use the provided
-
Cloud GPUs:
- For enhanced performance, consider using cloud GPU services such as AWS EC2 with GPU instances, Google Cloud Platform (GCP) AI Notebooks, or Azure Machine Learning.
License
This model is released under the Apache 2.0 license, which allows for both personal and commercial use, distribution, and modification, provided that proper attribution is given to the original authors.