indic bert
ai4bharatIntroduction
IndicBERT is a multilingual ALBERT model developed as part of the AI4Bharat initiative. It is pretrained exclusively on 12 major Indian languages using a monolingual corpus of around 9 billion tokens. IndicBERT features fewer parameters compared to other multilingual models like mBERT and XLM-R, while offering comparable or superior performance.
Architecture
The model architecture is based on ALBERT, designed to handle multiple languages with a focus on efficiency. It supports the following languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
Training
IndicBERT was pretrained using AI4Bharat's monolingual corpus, which contains 8.9 billion tokens distributed across the 12 supported languages. The model's performance was evaluated using IndicGLUE and additional tasks, showing competitive results against other models like mBERT and XLM-R in various classification and analysis tasks.
Guide: Running Locally
- Download the Model: Access IndicBERT from AI4Bharat’s storage or Hugging Face. The package includes both TF checkpoints and PyTorch binaries.
- Set Up Environment: Ensure Python and PyTorch are installed. Use a virtual environment to manage dependencies.
- Load the Model: Use the Hugging Face Transformers library to load IndicBERT in your environment.
- Inference: Run your text data through the model to perform tasks like classification or sentiment analysis.
For better performance, consider using cloud GPUs such as those offered by AWS, GCP, or Azure.
License
IndicBERT is released under the MIT License, allowing for wide usage and modification with minimal restrictions.