slovakbert
gerulataIntroduction
SlovakBERT is a base-sized pretrained language model designed for the Slovak language, utilizing a masked language modeling (MLM) objective. It is case-sensitive and is primarily intended for fine-tuning on downstream tasks. The model requires adjustments for certain characters before tokenization.
Architecture
SlovakBERT is based on the RoBERTa architecture and supports integration with both PyTorch and TensorFlow frameworks. It is designed for masked language modeling tasks and can be utilized to extract features from text.
Training
SlovakBERT was pretrained using a diverse set of Slovak language datasets, including Wikipedia, OpenSubtitles, Oscar, Gerulata WebCrawl, Gerulata Monitoring, and blbec.online. The training data was processed to replace URLs and email addresses, reduce elongated punctuation, remove Markdown syntax, and eliminate text within braces. The pretraining was conducted using fairseq on four Nvidia A100 GPUs for 300K steps, with specific hyperparameters such as a learning rate of 5e-4 and a dropout rate of 0.1.
Guide: Running Locally
-
Install the Transformers Library:
pip install transformers
-
Load the Model and Tokenizer:
from transformers import RobertaTokenizer, RobertaModel tokenizer = RobertaTokenizer.from_pretrained('gerulata/slovakbert') model = RobertaModel.from_pretrained('gerulata/slovakbert')
-
Prepare and Encode Text:
text = "Text ktorý sa má embedovať." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
-
Use Cloud GPUs:
- Consider using cloud platforms such as AWS, GCP, or Azure to access powerful GPUs like the Nvidia A100 for more efficient training and inference.
License
SlovakBERT is released under the MIT License, allowing for flexibility in both academic and commercial use.