lang id voxlingua107 ecapa

speechbrain

Introduction

The VOXLINGUA107 ECAPA-TDNN model is a spoken language identification system trained on the VoxLingua107 dataset using the SpeechBrain toolkit. It is designed to recognize and classify speech utterances into 107 different languages.

Architecture

The model utilizes the ECAPA-TDNN architecture, which is typically employed for speaker recognition. It incorporates additional fully connected hidden layers after the embedding layer to enhance performance in language identification tasks. The model is trained with recordings sampled at 16kHz and uses cross-entropy loss.

Training

The model is trained on the VoxLingua107 dataset, which comprises 6,628 hours of speech data from YouTube, covering 107 languages. The dataset was curated by labeling speech segments based on video titles and descriptions, with post-processing to filter false positives. The training procedure follows the SpeechBrain recipe, and the model achieves a 6.7% error rate on the VoxLingua107 development dataset.

Guide: Running Locally

  1. Install SpeechBrain:
    Use the command:

    pip install git+https://github.com/speechbrain/speechbrain.git@develop
    
  2. Import Required Libraries:

    import torchaudio
    from speechbrain.inference.classifiers import EncoderClassifier
    
  3. Load and Use the Model:

    language_id = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp")
    signal = language_id.load_audio("speechbrain/lang-id-voxlingua107-ecapa/udhr_th.wav")
    prediction = language_id.classify_batch(signal)
    print(prediction)
    
  4. Perform Inference with GPU: Add run_opts={"device":"cuda"} to the from_hparams method for GPU support.

  5. Cloud GPU Suggestion: Consider using cloud services like AWS or Google Cloud for access to GPUs, which can speed up processing.

License

The model and associated content are licensed under the Apache-2.0 license.

More Related APIs in Audio Classification