riffusion model v1
riffusionIntroduction
Riffusion is a real-time music generation application utilizing a latent text-to-image diffusion model to create spectrogram images from text inputs. These spectrograms can be converted into audio clips, allowing for innovative audio generation. The project was developed by Seth Forsgren and Hayk Martiros as a hobby, and it is built upon the Stable-Diffusion-v1-5 model.
Architecture
The Riffusion model is a diffusion-based text-to-image generation model that uses a pretrained text encoder, specifically CLIP ViT-L/14, to process input text. The model generates spectrogram images that can be converted into audio. It was developed by fine-tuning a Stable-Diffusion-v1-5 checkpoint, utilizing the LAION-5B dataset for training, which includes a diverse array of language and musical concepts.
Training
The Riffusion model can be fine-tuned using datasets of spectrogram images paired with descriptive text. The CLIP text encoder aids in understanding and linking text to audio, even with words not present in the dataset. Fine-tuning examples and methodologies, including using the dreambooth technique for custom styles, can be found in Hugging Face's diffusers training resources.
Guide: Running Locally
- Prerequisites: Ensure you have Python and necessary libraries installed, such as PyTorch and Hugging Face Transformers.
- Clone Repository: Obtain the Riffusion code from GitHub:
- Model: github.com/riffusion/riffusion
- Web App: github.com/hmartiro/riffusion-app
- Download Model: Access the model checkpoint from Hugging Face: riffusion-model-v1.
- Set Up Environment: Configure a Python environment and install dependencies.
- Run Locally: Execute the Riffusion app or scripts to start generating audio from text prompts.
For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
Riffusion is released under the CreativeML OpenRAIL-M license, which allows open access with specific usage rights. You cannot use the model to create illegal or harmful content. The model's outputs are free to use, and commercial use is permitted under the condition of sharing the same license restrictions with users. For full details, refer to the license document: CreativeML OpenRAIL-M License.