stable audio open 1.0
stabilityaiIntroduction
Stable Audio Open 1.0 is a text-to-audio model developed by Stability AI. It generates stereo audio up to 47 seconds long at 44.1kHz using text prompts. The model uses an autoencoder, a T5-based text embedding, and a transformer-based diffusion model. Aimed at research and experimentation, it provides insights into AI-based music and audio generation.
Architecture
The architecture of Stable Audio Open 1.0 includes:
- Autoencoder: Compresses waveforms to a manageable sequence length.
- T5-based Text Embedding: For conditioning on text prompts.
- Transformer-based Diffusion Model (DiT): Operates in the autoencoder's latent space, facilitating audio generation.
Training
Stable Audio Open 1.0 was trained on a dataset comprising 486,492 audio recordings, sourced from Freesound and the Free Music Archive (FMA), under licenses CC0, CC BY, or CC Sampling+. Text conditioning utilized a pre-trained T5 model. Extensive analysis ensured the removal of copyrighted music, leveraging tools like the PANNs music classifier and Audible Magic's content detection services.
Guide: Running Locally
Basic Steps
- Setup Environment: Ensure Python, PyTorch, and required libraries (torchaudio, einops, etc.) are installed.
- Download Model: Use the
get_pretrained_model
function from thestable-audio-tools
library. - Define Conditions: Set text prompts and timing for audio generation.
- Generate Audio: Utilize either
stable-audio-tools
ordiffusers
library for inference. - Post-process Output: Normalize and save the generated audio in the desired format.
Suggested Cloud GPUs
For optimal performance, especially when using the diffusers
library, it is recommended to use cloud GPUs like NVIDIA V100 or A100.
License
Stable Audio Open 1.0 is released under the Stability AI Community License. For commercial usage, refer to Stability AI's commercial license.