sepformer wham16k enhancement

speechbrain

Introduction

The SepFormer model, implemented with SpeechBrain, is designed for speech enhancement and denoising tasks. It is pre-trained on the WHAM! dataset at a 16kHz sampling frequency. This model aims to improve speech quality in noisy environments and is available on Hugging Face.

Architecture

The model employs the SepFormer architecture, a type of transformer, to separate speech sources. It is optimized for the WHAM! dataset, which includes environmental noise and reverberation. The performance of this model is measured at 14.3 dB SI-SNR on the test set.

Training

Currently, the training script is under development and not yet available. Updates will be provided once the script is finalized. Training results, including models and logs, can be accessed on Google Drive.

Guide: Running Locally

  1. Install SpeechBrain:

    pip install speechbrain
    
  2. Perform Speech Enhancement:

    from speechbrain.pretrained import SepformerSeparation as separator
    import torchaudio
    
    model = separator.from_hparams(source="speechbrain/sepformer-wham16k-enhancement", savedir='pretrained_models/sepformer-wham16k-enhancement')
    est_sources = model.separate_file(path='speechbrain/sepformer-wham16k-enhancement/example_wham16k.wav')
    torchaudio.save("enhanced_wham16k.wav", est_sources[:, :, 0].detach().cpu(), 16000)
    
  3. Inference on GPU:

    • Use run_opts={"device":"cuda"} in the from_hparams method to leverage GPU capabilities.

For optimal performance, consider utilizing cloud GPUs from providers like AWS, Google Cloud, or Azure.

License

The SepFormer model is released under the Apache 2.0 license, allowing for both personal and commercial use.

More Related APIs in Audio To Audio