roberta base latin cased3
pstroeIntroduction
The RoBERTa Latin Model, Version 3, is a Transformer-based language model designed for Latin text. It serves two primary purposes: evaluating Handwritten Text Recognition (HTR) results and functioning as a decoder for the TrOCR architecture. This model is an updated iteration with improved training data compared to its predecessors.
Architecture
The model is based on RoBERTa, a robustly optimized BERT approach, tailored specifically for Latin language processing. It utilizes a Transformer architecture optimized for masked language modeling tasks.
Training
The model was trained using a corpus from the Corpus Corporum, maintained by the University of Zurich, comprising 1.5G of text data, which is three times more than previous versions. Preprocessing steps included:
- Normalization of text using the Classical Language Toolkit (CLTK) with sentence splitting.
- Language identification to retain only Latin lines using langid. This resulted in a corpus of approximately 232 million tokens. The dataset for training will soon be available on Hugging Face.
Guide: Running Locally
-
Installation: Ensure Python and PyTorch are installed. Use
pip
to install the Hugging Face Transformers library:pip install transformers
-
Load the Model:
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("pstroe/roberta-base-latin-cased3") model = AutoModel.from_pretrained("pstroe/roberta-base-latin-cased3")
-
Inference: Use the tokenizer and model for inference on Latin text.
-
Cloud GPUs: For efficient training and inference, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.
License
For licensing details, contact Phillip Ströbel via email at [pstroebel@cl.uzh.ch] or through Twitter @CLingophil.