robeczech base
ufalIntroduction
RobeCzech is a monolingual RoBERTa language representation model developed by the Institute of Formal and Applied Linguistics, Charles University, Prague. It is trained specifically on Czech language data and is designed for fill-mask tasks, as well as downstream applications like morphological tagging, lemmatization, dependency parsing, named entity recognition, and semantic parsing.
Architecture
- Model Type: Fill-Mask
- Language: Czech
- Base Architecture: RoBERTa
- Tokenization: Byte-level BPE (BBPE) tokenizer
Training
RobeCzech was trained using the Fairseq implementation on a corpus of Czech texts, including SYN v4, Czes, and Czech Wikipedia. The training procedure involved a batch size of 8,192, with samples up to 512 tokens long. The optimizer used was Adam, aiming to minimize the masked language-modeling objective. The model was evaluated on various NLP tasks and achieved high accuracy in morphological analysis, dependency parsing, named entity recognition, and semantic parsing.
Guide: Running Locally
To run RobeCzech locally, you can use the following code snippet:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
Basic Steps
- Install the
transformers
library. - Load the tokenizer and model using the
AutoTokenizer
andAutoModelForMaskedLM
classes.
Suggestion: Cloud GPUs
For efficient training and inference, consider using cloud GPU services like AWS EC2, Google Cloud Platform, or Microsoft Azure.
License
This model is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (cc-by-nc-sa-4.0).