hubert large superb er
superbIntroduction
The hubert-large-superb-er
model is a ported version of S3PRL's Hubert designed for the SUPERB (Speech processing Universal PERformance Benchmark) Emotion Recognition (ER) task. The model is based on hubert-large-ll60k
, pretrained on 16kHz sampled speech audio, and it is intended for classifying emotions in speech.
Architecture
The model utilizes the Hubert architecture, specifically the large variant (hubert-large-ll60k
). It is designed to process 16kHz sampled audio data, which is crucial for maintaining compatibility with the pretrained model's parameters and capabilities.
Training
The model uses the IEMOCAP dataset, a widely employed resource for emotion recognition tasks. The dataset is curated to balance emotion classes, focusing on four main classes with sufficient data points. Cross-validation is performed over five standard splits. For detailed training instructions, refer to the S3PRL downstream task documentation.
Guide: Running Locally
You can run the model locally using Python and the transformers library. Below are the basic steps:
-
Install dependencies:
pip install transformers datasets librosa
-
Load the dataset and model:
from datasets import load_dataset from transformers import pipeline dataset = load_dataset("anton-l/superb_demo", "er", split="session1") classifier = pipeline("audio-classification", model="superb/hubert-large-superb-er") labels = classifier(dataset[0]["file"], top_k=5)
-
Using the model directly:
import torch import librosa from datasets import load_dataset from transformers import HubertForSequenceClassification, Wav2Vec2FeatureExtractor def map_to_array(example): speech, _ = librosa.load(example["file"], sr=16000, mono=True) example["speech"] = speech return example dataset = load_dataset("anton-l/superb_demo", "er", split="session1") dataset = dataset.map(map_to_array) model = HubertForSequenceClassification.from_pretrained("superb/hubert-large-superb-er") feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/hubert-large-superb-er") inputs = feature_extractor(dataset[:4]["speech"], sampling_rate=16000, padding=True, return_tensors="pt") logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1) labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]
-
Cloud GPUs: For more intensive processing, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
The hubert-large-superb-er
model is licensed under the Apache 2.0 License. This allows for both personal and commercial use, modification, and distribution of the model.