bigvgan_v2_22khz_80band_256x

nvidia

Introduction

BIGVGAN is a universal neural vocoder developed by NVIDIA, designed for large-scale training on diverse audio inputs. It employs a neural network to generate high-quality audio waveforms from mel spectrograms. The model leverages PyTorch and offers optimized CUDA kernels for accelerated inference.

Architecture

BIGVGAN utilizes a custom CUDA kernel that integrates upsampling, activation, and downsampling operations. This architecture enables efficient audio generation with improved inference speeds, especially when executed on compatible NVIDIA GPUs.

Training

The model has undergone extensive training using a wide array of audio datasets. It incorporates a multi-scale sub-band CQT discriminator and multi-scale mel spectrogram loss to enhance audio quality. BIGVGAN-v2 supports higher sampling rates and upsampling ratios, catering to various audio generation requirements.

Guide: Running Locally

  1. Prerequisites: Ensure git lfs is installed. Clone the repository using:

    git lfs install
    git clone https://huggingface.co/nvidia/bigvgan_v2_22khz_80band_256x
    
  2. Setup Environment: Install necessary libraries, such as PyTorch and any CUDA dependencies if you plan to use GPU acceleration.

  3. Load Pretrained Model:

    import torch
    import bigvgan
    
    device = 'cuda'
    model = bigvgan.BigVGAN.from_pretrained('nvidia/bigvgan_v2_22khz_80band_256x', use_cuda_kernel=True)
    model.remove_weight_norm()
    model = model.eval().to(device)
    
  4. Inference Pipeline: Load your audio file, compute its mel spectrogram, and generate audio:

    import librosa
    from meldataset import get_mel_spectrogram
    
    wav_path = '/path/to/your/audio.wav'
    wav, sr = librosa.load(wav_path, sr=model.h.sampling_rate, mono=True)
    wav = torch.FloatTensor(wav).unsqueeze(0)
    mel = get_mel_spectrogram(wav, model.h).to(device)
    
    with torch.inference_mode():
        wav_gen = model(mel)
    
  5. Recommended Cloud GPUs: For optimal performance and speed, consider using cloud GPU services such as NVIDIA A100, which can significantly enhance processing speed.

License

BIGVGAN is released under the MIT License. The full license text can be accessed here. This permissive license allows for modification, distribution, and private use, subject to the terms and conditions outlined.

More Related APIs in Audio To Audio