Sloven B E R Tcina
IMJONEZZIntroduction
SlovenBERTcina is a Slovak RoBERTa-based masked language model with 83 million parameters, designed for various natural language processing tasks. This model includes a pretrained tokenizer and is suitable for fine-tuning on downstream applications like Part-of-Speech tagging and Question Answering.
Architecture
SlovenBERTcina is built upon the RoBERTa architecture, which is a robust optimization of BERT. It uses a pretrained ByteLevelBPETokenizer, which is trained on the same Slovak dataset. The model has special tokens for sentence boundaries and masking, enhancing its ability to predict masked words in a text.
Training
The model was trained on an 8GB Slovak monolingual dataset, including sources like ParaCrawl and OSCAR, along with additional collected and cleaned data. The training process involved tokenizing the text with an uncased ByteLevelBPETokenizer. Evaluation results show the model's ability to predict masked words in Slovak sentences accurately.
Guide: Running Locally
- Install Dependencies: Ensure you have PyTorch and Transformers libraries installed.
pip install torch transformers
- Download the Model: Retrieve the SlovenBERTcina model from Hugging Face's Model Hub.
- Load the Model: Use the Transformers library to load the model.
from transformers import RobertaTokenizer, RobertaForMaskedLM tokenizer = RobertaTokenizer.from_pretrained("IMJONEZZ/SlovenBERTcina") model = RobertaForMaskedLM.from_pretrained("IMJONEZZ/SlovenBERTcina")
- Inference: Use the model for tasks like masked language modeling.
inputs = tokenizer("Mnoho ľudí tu <mask>", return_tensors="pt") outputs = model(**inputs) predictions = outputs.logits
For efficient training or large-scale inference, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure.
License
The SlovenBERTcina model is provided under the conditions that users credit Christopher Brousseau if utilized in research or professional projects. Ensure adherence to the licensing terms when integrating the model into your applications.