roberta hindi
flax-communityIntroduction
RoBERTa Hindi is a pretrained transformers model developed for the Hindi language using a masked language modeling (MLM) objective. It is a part of the Flax/Jax Community Week organized by Hugging Face, with TPU usage sponsored by Google. The model has been trained on a large corpus of Hindi data from various datasets.
Architecture
The RoBERTa Hindi model is based on the transformers architecture and has been pretrained on a mixture of datasets that includes mc4, oscar, and indic-nlp. It uses Byte-Pair Encoding (BPE) with a vocabulary size of 50,265 and processes inputs in segments of 512 tokens, marking the start and end of new documents with special tokens.
Training
Training Data
The model was trained on datasets such as:
- OSCAR: A multilingual corpus from the Common Crawl corpus.
- mC4: A cleaned version of the Common Crawl's web corpus.
- IndicGLUE and Samanantar: Benchmarks and parallel corpora for Indic languages.
- Hindi Text Summarization Corpus and Old Newspapers Hindi: Collections from Hindi news sources.
Training Procedure
- Preprocessing: Non-Hindi characters were removed, and tokenization was performed with BPE. The model uses dynamic masking during pretraining, with 15% of tokens masked in different ways.
- Pretraining: The training environment included Google Cloud Engine's TPUv3-8. A randomized shuffle of the combined dataset was used, and training logs are available on Weights & Biases.
Evaluation Results
RoBERTa Hindi has been evaluated on various tasks, showing competitive performance in genre classification, token classification, and sentiment analysis.
Guide: Running Locally
To use the RoBERTa Hindi model for masked language modeling:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
results = unmasker("हम आपके सुखद <mask> की कामना करते हैं")
Suggested Cloud GPUs
- Google Cloud Platform (GCP)
- Amazon Web Services (AWS)
- Microsoft Azure
These platforms offer various GPU options that can enhance the performance of running such models.
License
The model and its associated data are subject to the licenses provided by the original dataset sources and Hugging Face. It is essential to review and comply with these licenses when using the model.