muril base cased
googleIntroduction
MuRIL (Multilingual Representations for Indian Languages) is a BERT-based model pre-trained on 17 Indian languages and their transliterated forms. It is developed to handle NLP tasks in these languages, leveraging both translated and transliterated data.
Architecture
MuRIL adopts a BERT base architecture, trained from scratch using datasets such as Wikipedia, Common Crawl, PMINDIA, and Dakshina. It incorporates both translation and transliteration segment pairs during training. The model uses an exponent value of 0.3 for upsampling low-resource languages, differing from the typical 0.7.
Training
MuRIL is pre-trained using monolingual and parallel data. Monolingual data comes from Wikipedia and Common Crawl, while parallel data includes:
- Translated Data: Translations are generated using the Google NMT pipeline and the PMINDIA corpus.
- Transliterated Data: Transliterations are generated using the IndicTrans library, with contributions from the Dakshina dataset.
The training process involves a self-supervised masked language modeling task with whole-word masking, completed over 1000K steps with a batch size of 4096 and a max sequence length of 512.
Guide: Running Locally
- Setup Environment: Ensure Python and necessary libraries like PyTorch or TensorFlow are installed.
- Download Pre-trained Model: Access the MuRIL model on Hugging Face's Model Hub.
- Load Model: Use transformers library to load and fine-tune the model for your specific task.
- Run Inference: Feed input data to the model and obtain predictions.
For enhanced performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure to handle computations efficiently.
License
MuRIL is licensed under the Apache-2.0 License, allowing for broad usage and modification with proper attribution.