roberta hindi LLM Model — Open LLM List

Introduction

RoBERTa Hindi is a pretrained transformers model developed for the Hindi language using a masked language modeling (MLM) objective. It is a part of the Flax/Jax Community Week organized by Hugging Face, with TPU usage sponsored by Google. The model has been trained on a large corpus of Hindi data from various datasets.

Architecture

The RoBERTa Hindi model is based on the transformers architecture and has been pretrained on a mixture of datasets that includes mc4, oscar, and indic-nlp. It uses Byte-Pair Encoding (BPE) with a vocabulary size of 50,265 and processes inputs in segments of 512 tokens, marking the start and end of new documents with special tokens.

Training

Training Data

The model was trained on datasets such as:

OSCAR: A multilingual corpus from the Common Crawl corpus.
mC4: A cleaned version of the Common Crawl's web corpus.
IndicGLUE and Samanantar: Benchmarks and parallel corpora for Indic languages.
Hindi Text Summarization Corpus and Old Newspapers Hindi: Collections from Hindi news sources.

Training Procedure

Preprocessing: Non-Hindi characters were removed, and tokenization was performed with BPE. The model uses dynamic masking during pretraining, with 15% of tokens masked in different ways.
Pretraining: The training environment included Google Cloud Engine's TPUv3-8. A randomized shuffle of the combined dataset was used, and training logs are available on Weights & Biases.

Evaluation Results

RoBERTa Hindi has been evaluated on various tasks, showing competitive performance in genre classification, token classification, and sentiment analysis.

Guide: Running Locally

To use the RoBERTa Hindi model for masked language modeling:

from transformers import pipeline
unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
results = unmasker("हम आपके सुखद <mask> की कामना करते हैं")

Suggested Cloud GPUs

Google Cloud Platform (GCP)
Amazon Web Services (AWS)
Microsoft Azure

These platforms offer various GPU options that can enhance the performance of running such models.

License

The model and its associated data are subject to the licenses provided by the original dataset sources and Hugging Face. It is essential to review and comply with these licenses when using the model.

More Related APIs in Fill Mask