M E Ra Li O N Audio L L M Whisper S E A L I O N

MERaLiON

Introduction

MERaLiON-AudioLLM is a Speech-Text Large Language Model designed for Singapore’s multilingual and multicultural environment. It integrates a localized Whisper-large-v2 speech encoder and SEA-LION V3 text decoder, tailored for the diverse linguistic nuances of Singaporean accents and dialects. This model supports automatic speech recognition, speech translation, spoken question answering, dialogue summarization, speech instruction, and paralinguistics.

Architecture

The architecture of MERaLiON-AudioLLM comprises:

  • An audio encoder that converts speech inputs into vector representations.
  • A text decoder that interprets and responds to natural language instructions.
  • An adaptor module that aligns the encoder's output with the text decoder's embedding size.

The model uses the Whisper-large-v2 as the audio encoder and SEA-LION V3 as the text decoder.

Training

MERaLiON-AudioLLM is trained on 260,000 hours of audio from diverse datasets, including synthesized and augmented samples. Training was conducted on the ASPIRE 2A+ Supercomputer Cluster in Singapore, using 128 Nvidia H100 GPUs over 200,000 steps, completing in approximately two days.

Guide: Running Locally

Basic Steps

  1. Install the necessary Python libraries:

    pip install transformers datasets vllm==0.6.4.post1
    
  2. Load the model and processor:

    from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
    
    repo_id = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"
    processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
    model = AutoModelForSpeechSeq2Seq.from_pretrained(repo_id, use_safetensors=True, trust_remote_code=True)
    
  3. Prepare and process audio data:

    from datasets import load_dataset
    
    libri_data = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
    audio_array = libri_data[0]["audio"]["array"]
    
  4. Generate and decode outputs:

    inputs = processor(text=chat_prompt, audios=audio_array)
    outputs = model.generate(**inputs, max_new_tokens=128)
    response = processor.batch_decode(outputs[:, inputs['input_ids'].size(1):], skip_special_tokens=True)[0]
    

Cloud GPUs

For optimal performance, using cloud GPU resources such as AWS, Google Cloud, or Azure is recommended.

License

MERaLiON-AudioLLM is distributed under the MERaLiON Public License. For more details, visit the license document.

More Related APIs in Automatic Speech Recognition