roberta base latin cased3

pstroe

Introduction

The RoBERTa Latin Model, Version 3, is a Transformer-based language model designed for Latin text. It serves two primary purposes: evaluating Handwritten Text Recognition (HTR) results and functioning as a decoder for the TrOCR architecture. This model is an updated iteration with improved training data compared to its predecessors.

Architecture

The model is based on RoBERTa, a robustly optimized BERT approach, tailored specifically for Latin language processing. It utilizes a Transformer architecture optimized for masked language modeling tasks.

Training

The model was trained using a corpus from the Corpus Corporum, maintained by the University of Zurich, comprising 1.5G of text data, which is three times more than previous versions. Preprocessing steps included:

  • Normalization of text using the Classical Language Toolkit (CLTK) with sentence splitting.
  • Language identification to retain only Latin lines using langid. This resulted in a corpus of approximately 232 million tokens. The dataset for training will soon be available on Hugging Face.

Guide: Running Locally

  1. Installation: Ensure Python and PyTorch are installed. Use pip to install the Hugging Face Transformers library:

    pip install transformers
    
  2. Load the Model:

    from transformers import AutoModel, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained("pstroe/roberta-base-latin-cased3")
    model = AutoModel.from_pretrained("pstroe/roberta-base-latin-cased3")
    
  3. Inference: Use the tokenizer and model for inference on Latin text.

  4. Cloud GPUs: For efficient training and inference, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.

License

For licensing details, contact Phillip Ströbel via email at [pstroebel@cl.uzh.ch] or through Twitter @CLingophil.

More Related APIs in Fill Mask