bcms bertic
classlaIntroduction
BERTić is a transformer-based language model specifically designed for Bosnian, Croatian, Montenegrin, and Serbian. It was developed in Zagreb, Croatia, and trained on over 8 billion tokens from these languages. The model is optimized for tasks such as named entity recognition and hate speech detection.
Architecture
BERTić employs the Electra architecture, known for its efficiency in training and performance. The model has been benchmarked against multilingual BERT and CroSloEngual BERT, demonstrating superior results in various natural language processing tasks.
Training
The model was trained on a massive corpus of text from Bosnian, Croatian, Montenegrin, and Serbian languages. It is fine-tuned for specific tasks, such as named entity recognition and hate speech detection, with versions available for each task.
Guide: Running Locally
To run BERTić locally, follow these basic steps:
-
Environment Setup: Ensure you have Python installed, and set up a virtual environment.
python -m venv bertic-env source bertic-env/bin/activate
-
Install Dependencies: Use pip to install the necessary packages, such as
transformers
andtorch
.pip install transformers torch
-
Load the Model: Use the Hugging Face
transformers
library to load the model.from transformers import AutoModelForPreTraining, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("classla/bcms-bertic") model = AutoModelForPreTraining.from_pretrained("classla/bcms-bertic")
-
Inference: Prepare your text data and use the model for inference.
inputs = tokenizer("Your input text here", return_tensors="pt") outputs = model(**inputs)
Cloud GPUs: For optimal performance, especially for training or large-scale inference, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.
License
BERTić is distributed under the Apache 2.0 license, which allows for both personal and commercial use, modification, and distribution.