roberta kaz large
nur-devIntroduction
roberta-kaz-large
is a RoBERTa-based language model specifically developed for the Kazakh language. It uses the RobertaForMaskedLM
architecture and has been trained on the "kz-transformers/multidomain-kazakh-dataset" to ensure robust generalization across various domains.
Architecture
The model is based on the RoBERTa architecture, designed for masked language modeling tasks. It can be utilized with the Hugging Face Transformers library, allowing for flexible integration into different applications.
Training
The training process involved using two NVIDIA A100 GPUs over 5.3 million examples from the specified dataset. Training was conducted for 10 epochs, incorporating gradient accumulation to handle large data batches effectively. The learning rate was gradually increased to optimize stability, resulting in 208,100 training steps focused on enhancing the model's Kazakh language proficiency.
Guide: Running Locally
To use roberta-kaz-large
locally, follow these steps:
-
Install the Hugging Face Transformers library:
pip install transformers
-
Load the model and tokenizer:
from transformers import RobertaTokenizerFast, RobertaForMaskedLM tokenizer = RobertaTokenizerFast.from_pretrained('nur-dev/roberta-kaz-large') model = RobertaForMaskedLM.from_pretrained('nur-dev/roberta-kaz-large')
-
Alternatively, use a pipeline for masked language modeling (MLM):
from transformers import pipeline pipe = pipeline('fill-mask', model='nur-dev/roberta-kaz-large')
For optimal performance, especially during training or large-scale inference, consider using cloud GPU services such as AWS, GCP, or Azure.
License
The model is licensed under the Academic Free License 3.0 (AFL-3.0).