hubert large superb er

superb

Introduction

The hubert-large-superb-er model is a ported version of S3PRL's Hubert designed for the SUPERB (Speech processing Universal PERformance Benchmark) Emotion Recognition (ER) task. The model is based on hubert-large-ll60k, pretrained on 16kHz sampled speech audio, and it is intended for classifying emotions in speech.

Architecture

The model utilizes the Hubert architecture, specifically the large variant (hubert-large-ll60k). It is designed to process 16kHz sampled audio data, which is crucial for maintaining compatibility with the pretrained model's parameters and capabilities.

Training

The model uses the IEMOCAP dataset, a widely employed resource for emotion recognition tasks. The dataset is curated to balance emotion classes, focusing on four main classes with sufficient data points. Cross-validation is performed over five standard splits. For detailed training instructions, refer to the S3PRL downstream task documentation.

Guide: Running Locally

You can run the model locally using Python and the transformers library. Below are the basic steps:

  1. Install dependencies:

    pip install transformers datasets librosa
    
  2. Load the dataset and model:

    from datasets import load_dataset
    from transformers import pipeline
    
    dataset = load_dataset("anton-l/superb_demo", "er", split="session1")
    classifier = pipeline("audio-classification", model="superb/hubert-large-superb-er")
    labels = classifier(dataset[0]["file"], top_k=5)
    
  3. Using the model directly:

    import torch
    import librosa
    from datasets import load_dataset
    from transformers import HubertForSequenceClassification, Wav2Vec2FeatureExtractor
    
    def map_to_array(example):
        speech, _ = librosa.load(example["file"], sr=16000, mono=True)
        example["speech"] = speech
        return example
    
    dataset = load_dataset("anton-l/superb_demo", "er", split="session1")
    dataset = dataset.map(map_to_array)
    
    model = HubertForSequenceClassification.from_pretrained("superb/hubert-large-superb-er")
    feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/hubert-large-superb-er")
    
    inputs = feature_extractor(dataset[:4]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")
    logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]
    
  4. Cloud GPUs: For more intensive processing, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The hubert-large-superb-er model is licensed under the Apache 2.0 License. This allows for both personal and commercial use, modification, and distribution of the model.

More Related APIs in Audio Classification