bark
sunoIntroduction
Bark is a transformer-based text-to-audio model developed by Suno. It can generate realistic, multilingual speech, music, background noise, and simple sound effects. Bark also produces nonverbal sounds like laughter and sighs. It is primarily intended for research, with pretrained model checkpoints available for inference.
Architecture
Bark consists of a series of transformer models that convert text into audio through several stages:
- Text to Semantic Tokens: Inputs are tokenized text, producing semantic tokens for audio generation.
- Semantic to Coarse Tokens: Converts semantic tokens into tokens from the EnCodec Codec's first two codebooks.
- Coarse to Fine Tokens: Produces 8 codebooks from EnCodec.
The models have 80/300 million parameters, using causal attention for the first stages and non-causal for the final stage, with vocab sizes up to 10,000.
Training
Bark uses a three-stage process to transform text into audio, utilizing BERT tokenization and EnCodec for audio encoding. The model is intended to enhance accessibility tools across different languages.
Guide: Running Locally
Using Hugging Face Transformers
-
Install Dependencies:
pip install --upgrade pip pip install --upgrade transformers scipy
-
Run Inference:
from transformers import pipeline import scipy synthesiser = pipeline("text-to-speech", "suno/bark") speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True}") scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])
-
Listen to Audio:
Convert the generated speech to a.wav
file and play it using available tools like IPython'sAudio
.
Using Original Bark Library
-
Install Bark Library:
Follow the instructions at Suno's GitHub. -
Generate Audio:
from bark import SAMPLE_RATE, generate_audio, preload_models from IPython.display import Audio preload_models() text_prompt = "Hello, my name is Suno. And, uh — and I like pizza. [laughs]" speech_array = generate_audio(text_prompt) Audio(speech_array, rate=SAMPLE_RATE)
-
Save as WAV:
from scipy.io.wavfile import write as write_wav write_wav("/path/to/audio.wav", SAMPLE_RATE, speech_array)
Cloud GPUs Suggestion
Consider using cloud services like Google Colab for access to GPUs, which can significantly speed up processing times.
License
Bark is released under the MIT License, allowing for broad use and modification with appropriate attribution.