wav2vec2 xlsr greek speech emotion recognition

m3hrdadfi

Introduction

This document provides an overview and usage guide for the WAV2VEC2-XLSR-GREEK-SPEECH-EMOTION-RECOGNITION model, which is designed to recognize emotions in Greek speech using the Wav2Vec 2.0 model. It utilizes the aesdd dataset and supports emotion recognition tasks.

Architecture

The model is built using the Wav2Vec 2.0 architecture, which is optimized for automatic speech recognition and emotion detection tasks. It operates on Greek language audio inputs and outputs emotion predictions based on the provided speech data.

Training

The model has been trained and evaluated with a focus on emotion recognition in Greek. It reports high precision and recall across various emotions, achieving an overall accuracy of 0.91.

Guide: Running Locally

Requirements

To run the model locally, ensure the following packages are installed:

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

Prediction

  1. Import Libraries: Ensure necessary libraries like torch, torchaudio, and transformers are imported.
  2. Load Model: Use the AutoConfig and Wav2Vec2FeatureExtractor to load the model configurations.
  3. Preprocess Audio: Convert audio files to appropriate array format using torchaudio.
  4. Run Prediction: Use the model to predict emotions from the audio file. The predictions will include emotion labels and their respective confidence scores.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs

path = "/path/to/disgust.wav"
outputs = predict(path, sampling_rate)

Cloud GPUs

For enhanced performance, consider running the model on cloud platforms that provide GPU support, such as AWS, Google Cloud, or Azure.

License

The model is licensed under the Apache 2.0 License, allowing for both personal and commercial use.

More Related APIs in Automatic Speech Recognition