sepformer wham enhancement
speechbrainIntroduction
The SepFormer-WHAM-Enhancement model, developed with SpeechBrain, is designed for speech enhancement by denoising audio. Pretrained on the WHAM! dataset, which features environmental noise and reverberation, this model operates at an 8k sampling frequency. It achieves a performance of 14.35 dB SI-SNR on the WHAM! test set.
Architecture
This model utilizes the SepFormer architecture, which is part of the SpeechBrain toolkit, leveraging transformer-based techniques to enhance speech signals. The model is implemented in PyTorch and is suitable for audio-to-audio tasks, focusing on speech enhancement.
Training
Currently, the training script is under development via a pull request. Training results, including models and logs, are accessible through a shared Google Drive link. This model's training process will be updated in the model card once finalized.
Guide: Running Locally
-
Install SpeechBrain:
Executepip install speechbrain
to set up the necessary toolkit. -
Perform Speech Enhancement:
from speechbrain.inference.separation import SepformerSeparation as separator import torchaudio model = separator.from_hparams(source="speechbrain/sepformer-wham-enhancement", savedir='pretrained_models/sepformer-wham-enhancement') est_sources = model.separate_file(path='speechbrain/sepformer-wham-enhancement/example_wham.wav') torchaudio.save("enhanced_wham.wav", est_sources[:, :, 0].detach().cpu(), 8000)
-
Inference on GPU:
Addrun_opts={"device":"cuda"}
to thefrom_hparams
method call to utilize GPU resources for faster inference. -
Cloud GPUs:
Consider using cloud platforms like AWS, Google Cloud, or Azure to access GPU resources if local hardware is insufficient.
License
The SepFormer-WHAM-Enhancement model is released under the Apache 2.0 license, which allows for wide use and distribution with appropriate credit.