bigvgan_v2_44khz_128band_512x

nvidia

Introduction

BIGVGAN is a universal neural vocoder designed for large-scale training. Developed by Sang-Gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, it focuses on enhancing audio generation through neural vocoding techniques. The model supports various audio configurations, allowing for a wide range of applications in audio-to-audio transformations.

Architecture

BIGVGAN utilizes a sophisticated architecture that includes a custom CUDA kernel for accelerated inference. This architecture allows for faster processing, particularly on GPUs like NVIDIA's A100. The model is trained using a multi-scale sub-band CQT discriminator and a multi-scale mel spectrogram loss, making it highly efficient for audio generation tasks.

Training

BIGVGAN is trained on large datasets that encompass diverse audio types, including multilingual speech, environmental sounds, and musical instruments. This extensive training enables the model to support up to a 44 kHz sampling rate and a 512x upsampling ratio. The training incorporates a large-scale compilation of data to achieve high-quality audio synthesis.

Guide: Running Locally

  1. Installation

    • Ensure Git LFS is installed: git lfs install.
    • Clone the repository:
      git clone https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x
      
  2. Usage

    • Load the pretrained model, compute the mel spectrogram from an input waveform, and generate the synthesized waveform.
    • Example code snippet:
      import torch
      import bigvgan
      import librosa
      from meldataset import get_mel_spectrogram
      
      device = 'cuda'
      model = bigvgan.BigVGAN.from_pretrained('nvidia/bigvgan_v2_44khz_128band_512x', use_cuda_kernel=False)
      model.remove_weight_norm()
      model = model.eval().to(device)
      
      wav_path = '/path/to/your/audio.wav'
      wav, sr = librosa.load(wav_path, sr=model.h.sampling_rate, mono=True)
      wav = torch.FloatTensor(wav).unsqueeze(0)
      
      mel = get_mel_spectrogram(wav, model.h).to(device)
      with torch.inference_mode():
          wav_gen = model(mel)
      
  3. Using Custom CUDA Kernel

    • For faster synthesis, enable the custom CUDA kernel:
      model = bigvgan.BigVGAN.from_pretrained('nvidia/bigvgan_v2_44khz_128band_512x', use_cuda_kernel=True)
      
  4. Suggested Cloud GPUs

    • Consider using NVIDIA A100 or similar GPUs for optimal performance.

License

BIGVGAN is licensed under the MIT License. You can find the license details at this link.

More Related APIs in Audio To Audio