sepformer wham16k enhancement
speechbrainIntroduction
The SepFormer model, implemented with SpeechBrain, is designed for speech enhancement and denoising tasks. It is pre-trained on the WHAM! dataset at a 16kHz sampling frequency. This model aims to improve speech quality in noisy environments and is available on Hugging Face.
Architecture
The model employs the SepFormer architecture, a type of transformer, to separate speech sources. It is optimized for the WHAM! dataset, which includes environmental noise and reverberation. The performance of this model is measured at 14.3 dB SI-SNR on the test set.
Training
Currently, the training script is under development and not yet available. Updates will be provided once the script is finalized. Training results, including models and logs, can be accessed on Google Drive.
Guide: Running Locally
-
Install SpeechBrain:
pip install speechbrain
-
Perform Speech Enhancement:
from speechbrain.pretrained import SepformerSeparation as separator import torchaudio model = separator.from_hparams(source="speechbrain/sepformer-wham16k-enhancement", savedir='pretrained_models/sepformer-wham16k-enhancement') est_sources = model.separate_file(path='speechbrain/sepformer-wham16k-enhancement/example_wham16k.wav') torchaudio.save("enhanced_wham16k.wav", est_sources[:, :, 0].detach().cpu(), 16000)
-
Inference on GPU:
- Use
run_opts={"device":"cuda"}
in thefrom_hparams
method to leverage GPU capabilities.
- Use
For optimal performance, consider utilizing cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The SepFormer model is released under the Apache 2.0 license, allowing for both personal and commercial use.