Introduction

The Tamil Text-to-Speech (TTS) model is a part of Facebook's Massively Multilingual Speech (MMS) project, designed to provide speech technology across a diverse range of languages. This model uses the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) framework to generate speech waveforms from text inputs in Tamil.

Architecture

The VITS model is an end-to-end speech synthesis system that uses a conditional variational autoencoder (VAE) architecture. It includes a posterior encoder, decoder, and a conditional prior. The model predicts acoustic features using a Transformer-based text encoder and multiple coupling layers. These features are then converted into speech waveforms through transposed convolutional layers similar to a HiFi-GAN vocoder. The model also incorporates a stochastic duration predictor, enabling varied speech rhythms from identical text inputs. Training combines variational lower bound and adversarial losses, with normalizing flows enhancing expressiveness.

Training

Each language in the MMS project, including Tamil, has its own VITS checkpoint. The training is end-to-end, leveraging a combination of variational and adversarial strategies to improve model performance.

Guide: Running Locally

  1. Install Dependencies: Ensure you have the latest version of the Transformers library.

    pip install --upgrade transformers accelerate
    
  2. Load and Run the Model:

    from transformers import VitsModel, AutoTokenizer
    import torch
    
    model = VitsModel.from_pretrained("facebook/mms-tts-tam")
    tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-tam")
    
    text = "some example text in the Tamil language"
    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
        output = model(**inputs).waveform
    
  3. Output the Waveform: Save or play the resulting waveform.

    import scipy
    scipy.io.wavfile.write("output.wav", rate=model.config.sampling_rate, data=output.numpy())
    

    In a Jupyter Notebook:

    from IPython.display import Audio
    Audio(output.numpy(), rate=model.config.sampling_rate)
    
  4. Cloud GPUs: For enhanced performance, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.

License

The model is licensed under CC-BY-NC 4.0, permitting non-commercial use with attribution.

More Related APIs in Text To Speech