sepformer wham enhancement

speechbrain

Introduction

The SepFormer-WHAM-Enhancement model, developed with SpeechBrain, is designed for speech enhancement by denoising audio. Pretrained on the WHAM! dataset, which features environmental noise and reverberation, this model operates at an 8k sampling frequency. It achieves a performance of 14.35 dB SI-SNR on the WHAM! test set.

Architecture

This model utilizes the SepFormer architecture, which is part of the SpeechBrain toolkit, leveraging transformer-based techniques to enhance speech signals. The model is implemented in PyTorch and is suitable for audio-to-audio tasks, focusing on speech enhancement.

Training

Currently, the training script is under development via a pull request. Training results, including models and logs, are accessible through a shared Google Drive link. This model's training process will be updated in the model card once finalized.

Guide: Running Locally

  1. Install SpeechBrain:
    Execute pip install speechbrain to set up the necessary toolkit.

  2. Perform Speech Enhancement:

    from speechbrain.inference.separation import SepformerSeparation as separator
    import torchaudio
    
    model = separator.from_hparams(source="speechbrain/sepformer-wham-enhancement", savedir='pretrained_models/sepformer-wham-enhancement')
    
    est_sources = model.separate_file(path='speechbrain/sepformer-wham-enhancement/example_wham.wav')
    torchaudio.save("enhanced_wham.wav", est_sources[:, :, 0].detach().cpu(), 8000)
    
  3. Inference on GPU:
    Add run_opts={"device":"cuda"} to the from_hparams method call to utilize GPU resources for faster inference.

  4. Cloud GPUs:
    Consider using cloud platforms like AWS, Google Cloud, or Azure to access GPU resources if local hardware is insufficient.

License

The SepFormer-WHAM-Enhancement model is released under the Apache 2.0 license, which allows for wide use and distribution with appropriate credit.

More Related APIs in Audio To Audio