lvbert
AiLab-IMCS-ULLVBERT Model Documentation
Introduction
LVBERT is a pretrained BERT model tailored for the Latvian language. It was designed for tasks involving natural language understanding, such as text classification, named entity recognition, and question answering. The model can also compute contextual embeddings for applications like text similarity, clustering, and semantic search. LVBERT is case-sensitive and was first introduced in a research paper and released on GitHub. The current version is available in the Hugging Face repository.
Architecture
LVBERT uses the BERT-base configuration:
- 12 layers
- 768 hidden units
- 12 attention heads
- Maximum sequence length of 512
- Mini-batch size of 128
- Vocabulary size of 32,000 tokens
The model's tokenizer is based on a SentencePiece model trained on the dataset, later converted to the WordPiece format used by BERT.
Training
The model was pretrained using a corpus of approximately 500 million tokens. Sources include:
- Balanced Corpus of Modern Latvian
- Latvian Wikipedia
- Corpus of News Portal Articles
- Corpus of News Portal Comments
The training objectives included masked language modeling and next sentence prediction.
Guide: Running Locally
- Install Dependencies: Ensure you have Python and necessary libraries installed, such as
transformers
andtorch
. - Clone Repository: Download the model from its Hugging Face repository.
- Load the Model: Use the
transformers
library to load the LVBERT model and tokenizer. - Fine-tune or Use as Needed: Depending on your task, you may fine-tune the model or use it to generate embeddings.
For optimal performance, consider using cloud GPU services such as AWS, Google Cloud, or Azure.
License
LVBERT is licensed under the Apache 2.0 License, allowing for both academic and commercial use with compliance to the license terms.