Indic B A R T

ai4bharat

Introduction

IndicBART is a multilingual, sequence-to-sequence pre-trained model designed for Indic languages and English. It supports 11 Indian languages and is based on the mBART architecture. The model is suitable for natural language generation tasks such as machine translation, summarization, and question generation.

Architecture

IndicBART is a smaller model compared to mBART and mT5, making it less computationally expensive for finetuning and decoding. It has been trained on a large corpus of Indic languages, comprising 452 million sentences and 9 billion tokens. All languages, except English, are represented in the Devanagari script to facilitate transfer learning among related languages.

Training

The model was trained using the text-infilling objective, similar to mBART, on the IndicCorp dataset, which includes Indian English content.

Guide: Running Locally

  1. Installation: Install the Hugging Face Transformers library and the necessary language tools.
  2. Setup: Use the following code snippet to load the model and tokenizer:
    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("ai4bharat/IndicBART", do_lower_case=False, use_fast=False, keep_accents=True)
    model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/IndicBART")
    
  3. Tokenization and Model Usage: Tokenize input and output sentences using the provided instructions. Convert non-Devanagari scripts to Devanagari before processing.
  4. Fine-tuning: The model can be fine-tuned using the YANMTT toolkit or official Hugging Face scripts for translation and summarization.
  5. Execution: For efficient execution, consider using cloud GPUs such as those offered by AWS, GCP, or Azure.

License

IndicBART is available under the MIT License.

More Related APIs in Text2text Generation