bert base italian xxl cased

dbmdz

Introduction

The BERT-BASE-ITALIAN-XXL-CASED model is an open-source Italian language model created by the MDZ Digital Library team at the Bavarian State Library. It is part of a collection of BERT and ELECTRA models optimized for natural language processing tasks in Italian.

Architecture

The Italian BERT models are based on the BERT architecture, supporting both cased and uncased versions. The XXL variant uses an extended dataset, resulting in a larger vocabulary and more comprehensive model performance for Italian. The ELECTRA models follow a similar architecture, focusing on efficient training through a generator-discriminator framework.

Training

The training data includes a recent Wikipedia dump and texts from the OPUS corpora, resulting in a dataset of 81GB with over 13 billion tokens. The XXL models incorporate additional data from the Italian part of the OSCAR corpus. The BERT models were trained with an initial sequence length of 512 subwords for approximately 2-3 million steps. ELECTRA training followed the standard procedure for a total of 1 million steps with a batch size of 128.

Guide: Running Locally

  1. Install Transformers Library:

    pip install transformers
    
  2. Load the Model and Tokenizer:

    from transformers import AutoModel, AutoTokenizer
    
    model_name = "dbmdz/bert-base-italian-xxl-cased"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
  3. Run Inference: Use the tokenizer and model to process Italian text data.

  4. Cloud GPUs: For efficient training and inference, consider using cloud services like AWS, Google Cloud, or Azure that offer GPU instances.

License

The BERT-BASE-ITALIAN-XXL-CASED model is distributed under the MIT License, allowing for wide usage and modification.

More Related APIs in Fill Mask