herbert base cased

allegro

Introduction

HerBERT is a BERT-based language model specifically trained for the Polish language using Masked Language Modeling (MLM) and Sentence Structural Objective (SSO). It employs dynamic masking of whole words to enhance its language understanding capabilities.

Architecture

HerBERT utilizes a transformer architecture and was developed using version 2.9 of the Hugging Face transformers library. The model leverages a character-level byte-pair encoding tokenizer with a vocabulary size of 50k tokens. The tokenizer, HerbertTokenizerFast, is optimized for speed and efficiency.

Training

The model was trained on six distinct Polish corpora, including CCNet Middle, CCNet Head, the National Corpus of Polish, Open Subtitles, Wikipedia, and Wolne Lektury. These corpora provide a diverse range of language data to ensure comprehensive language modeling.

Guide: Running Locally

To run HerBERT locally, follow these steps:

  1. Install Dependencies:

    pip install transformers torch
    
  2. Load Model and Tokenizer:

    from transformers import AutoTokenizer, AutoModel
    
    tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
    model = AutoModel.from_pretrained("allegro/herbert-base-cased")
    
  3. Inference Example:

    output = model(
        **tokenizer.batch_encode_plus(
            [
                (
                    "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                    "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
                )
            ],
            padding='longest',
            add_special_tokens=True,
            return_tensors='pt'
        )
    )
    

For enhanced performance, consider using cloud GPUs available from providers such as AWS, GCP, or Azure.

License

HerBERT is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). Users are free to share and adapt the model as long as appropriate credit is given.

More Related APIs in Feature Extraction