wav2vec2 large robust 12 ft emotion msp dim

audeering

Introduction

The wav2vec2-large-robust-12-ft-emotion-msp-dim model by audEERING is a fine-tuned version of Wav2Vec2 for dimensional speech emotion recognition. It predicts emotional dimensions such as arousal, dominance, and valence from raw audio signals. This model is intended for research purposes, with commercial licenses available through audEERING for models trained on larger datasets.

Architecture

This model is based on the Wav2Vec2 architecture, specifically fine-tuned from the Wav2Vec2-Large-Robust variant. It has been pruned to 12 transformer layers from the original 24 before fine-tuning. The model processes raw audio inputs and outputs predictions for emotional dimensions as well as the pooled states of the last transformer layer. An ONNX export is available for deployment purposes.

Training

The model was trained using the MSP-Podcast dataset (version 1.7) and fine-tuned to predict emotional dimensions from audio. The training involved reducing the number of transformer layers and adapting the model for audio classification tasks.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install Dependencies: Ensure you have Python and PyTorch installed. Install the transformers library from Hugging Face.

    pip install transformers torch
    
  2. Load the Model: Use the Wav2Vec2Processor and EmotionModel classes to load the model and processor.

    from transformers import Wav2Vec2Processor
    from your_custom_module import EmotionModel  # Ensure this class is defined as shown in the model usage
    
    model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
    processor = Wav2Vec2Processor.from_pretrained(model_name)
    model = EmotionModel.from_pretrained(model_name).to('cpu')
    
  3. Prepare Audio Input: Prepare your audio signal as a NumPy array with the correct sampling rate (16,000 Hz).

  4. Process the Audio: Use the provided process_func to predict emotions or extract embeddings.

  5. Cloud GPUs: For faster processing, consider using cloud GPU services such as AWS, Google Cloud, or Azure to handle the computations.

License

This model is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). It can be used for non-commercial research with appropriate attribution. Commercial use requires a separate license from audEERING.

More Related APIs in Audio Classification