stable audio open 1.0

stabilityai

Introduction

Stable Audio Open 1.0 is a text-to-audio model developed by Stability AI. It generates stereo audio up to 47 seconds long at 44.1kHz using text prompts. The model uses an autoencoder, a T5-based text embedding, and a transformer-based diffusion model. Aimed at research and experimentation, it provides insights into AI-based music and audio generation.

Architecture

The architecture of Stable Audio Open 1.0 includes:

  • Autoencoder: Compresses waveforms to a manageable sequence length.
  • T5-based Text Embedding: For conditioning on text prompts.
  • Transformer-based Diffusion Model (DiT): Operates in the autoencoder's latent space, facilitating audio generation.

Training

Stable Audio Open 1.0 was trained on a dataset comprising 486,492 audio recordings, sourced from Freesound and the Free Music Archive (FMA), under licenses CC0, CC BY, or CC Sampling+. Text conditioning utilized a pre-trained T5 model. Extensive analysis ensured the removal of copyrighted music, leveraging tools like the PANNs music classifier and Audible Magic's content detection services.

Guide: Running Locally

Basic Steps

  1. Setup Environment: Ensure Python, PyTorch, and required libraries (torchaudio, einops, etc.) are installed.
  2. Download Model: Use the get_pretrained_model function from the stable-audio-tools library.
  3. Define Conditions: Set text prompts and timing for audio generation.
  4. Generate Audio: Utilize either stable-audio-tools or diffusers library for inference.
  5. Post-process Output: Normalize and save the generated audio in the desired format.

Suggested Cloud GPUs

For optimal performance, especially when using the diffusers library, it is recommended to use cloud GPUs like NVIDIA V100 or A100.

License

Stable Audio Open 1.0 is released under the Stability AI Community License. For commercial usage, refer to Stability AI's commercial license.

More Related APIs in Text To Audio