Introduction

Bark is a transformer-based text-to-audio model developed by Suno. It can generate realistic, multilingual speech, music, background noise, and simple sound effects. Bark also produces nonverbal sounds like laughter and sighs. It is primarily intended for research, with pretrained model checkpoints available for inference.

Architecture

Bark consists of a series of transformer models that convert text into audio through several stages:

  • Text to Semantic Tokens: Inputs are tokenized text, producing semantic tokens for audio generation.
  • Semantic to Coarse Tokens: Converts semantic tokens into tokens from the EnCodec Codec's first two codebooks.
  • Coarse to Fine Tokens: Produces 8 codebooks from EnCodec.

The models have 80/300 million parameters, using causal attention for the first stages and non-causal for the final stage, with vocab sizes up to 10,000.

Training

Bark uses a three-stage process to transform text into audio, utilizing BERT tokenization and EnCodec for audio encoding. The model is intended to enhance accessibility tools across different languages.

Guide: Running Locally

Using Hugging Face Transformers

  1. Install Dependencies:

    pip install --upgrade pip
    pip install --upgrade transformers scipy
    
  2. Run Inference:

    from transformers import pipeline
    import scipy
    
    synthesiser = pipeline("text-to-speech", "suno/bark")
    speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True}")
    scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])
    
  3. Listen to Audio:
    Convert the generated speech to a .wav file and play it using available tools like IPython's Audio.

Using Original Bark Library

  1. Install Bark Library:
    Follow the instructions at Suno's GitHub.

  2. Generate Audio:

    from bark import SAMPLE_RATE, generate_audio, preload_models
    from IPython.display import Audio
    
    preload_models()
    text_prompt = "Hello, my name is Suno. And, uh — and I like pizza. [laughs]"
    speech_array = generate_audio(text_prompt)
    Audio(speech_array, rate=SAMPLE_RATE)
    
  3. Save as WAV:

    from scipy.io.wavfile import write as write_wav
    write_wav("/path/to/audio.wav", SAMPLE_RATE, speech_array)
    

Cloud GPUs Suggestion

Consider using cloud services like Google Colab for access to GPUs, which can significantly speed up processing times.

License

Bark is released under the MIT License, allowing for broad use and modification with appropriate attribution.

More Related APIs in Text To Speech