LVBERT Model Documentation

Introduction

LVBERT is a pretrained BERT model tailored for the Latvian language. It was designed for tasks involving natural language understanding, such as text classification, named entity recognition, and question answering. The model can also compute contextual embeddings for applications like text similarity, clustering, and semantic search. LVBERT is case-sensitive and was first introduced in a research paper and released on GitHub. The current version is available in the Hugging Face repository.

Architecture

LVBERT uses the BERT-base configuration:

  • 12 layers
  • 768 hidden units
  • 12 attention heads
  • Maximum sequence length of 512
  • Mini-batch size of 128
  • Vocabulary size of 32,000 tokens

The model's tokenizer is based on a SentencePiece model trained on the dataset, later converted to the WordPiece format used by BERT.

Training

The model was pretrained using a corpus of approximately 500 million tokens. Sources include:

  • Balanced Corpus of Modern Latvian
  • Latvian Wikipedia
  • Corpus of News Portal Articles
  • Corpus of News Portal Comments

The training objectives included masked language modeling and next sentence prediction.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and necessary libraries installed, such as transformers and torch.
  2. Clone Repository: Download the model from its Hugging Face repository.
  3. Load the Model: Use the transformers library to load the LVBERT model and tokenizer.
  4. Fine-tune or Use as Needed: Depending on your task, you may fine-tune the model or use it to generate embeddings.

For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

LVBERT is licensed under the Apache 2.0 License, allowing for both academic and commercial use with compliance to the license terms.

More Related APIs in Feature Extraction