Luxem B E R T
lothritzIntroduction
LuxemBERT is a BERT-based language model specifically developed for the Luxembourgish language. It was trained on a dataset comprising 12.2 million sentences, sourced from the Luxembourgish Wikipedia, the Leipzig Corpora Collection, and rtl.lu. Additionally, data augmentation was performed by partially translating sentences from the German Wikipedia into Luxembourgish.
Architecture
LuxemBERT follows the architecture of BERT models, which are pre-trained language models that have been highly effective in natural language processing (NLP) tasks. The model is tailored to handle the specific linguistic characteristics of Luxembourgish.
Training
The LuxemBERT model was trained using a dataset of 12.2 million sentences. This dataset includes 6.1 million Luxembourgish sentences collected from various sources and another 6.1 million sentences translated from German to Luxembourgish. The model aims to improve NLP task performance in Luxembourgish, outperforming simple baselines and the multilingual BERT (mBERT) model.
Guide: Running Locally
- Setup Environment: Ensure you have Python and PyTorch installed. Consider using a virtual environment.
- Install Transformers Library: Use pip to install the Hugging Face
transformers
package.pip install transformers
- Load LuxemBERT: Utilize the Hugging Face API to load the LuxemBERT model.
from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("lothritz/LuxemBERT") model = AutoModel.from_pretrained("lothritz/LuxemBERT")
- Inference: Prepare your text data, tokenize it, and pass it through the model for predictions.
Cloud GPUs: For extensive training or inference, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure, which provide scalable resources.
License
The LuxemBERT model is available for use under specific terms outlined by the authors. Users should refer to the original publication or repository for detailed licensing information.