xlm roberta base finetuned luo finetuned ner swahili
mbeukmanIntroduction
The xlm-roberta-base-finetuned-luo-finetuned-ner-swahili
is a token classification model specifically for Named Entity Recognition (NER). It is based on the XLM-RoBERTa architecture and fine-tuned using the MasakhaNER dataset, focusing on the Swahili language.
Architecture
This model is built on the transformer-based XLM-RoBERTa architecture. It has been fine-tuned for 50 epochs with a maximum sequence length of 200, a batch size of 32, and a learning rate of 5e-5. It has been evaluated across five different random seeds, with the best-performing model selected based on the aggregate F1 score on the test set.
Training
The model was fine-tuned by Michael Beukman as part of a project at the University of the Witwatersrand. Training was conducted using the MasakhaNER dataset, which contains news articles in ten African languages. The fine-tuning process took between 10 to 30 minutes per model on an NVIDIA RTX3090 GPU, requiring at least 14GB of VRAM for a batch size of 32.
Guide: Running Locally
To use this model locally, follow these steps:
-
Install the
transformers
library from Hugging Face:pip install transformers
-
Load the model and tokenizer:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model_name = 'mbeukman/xlm-roberta-base-finetuned-luo-finetuned-ner-swahili' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) nlp = pipeline("ner", model=model, tokenizer=tokenizer)
-
Use the model for NER:
example = "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ." ner_results = nlp(example) print(ner_results)
For optimal performance, consider using cloud GPUs such as those from AWS, Google Cloud, or Azure.
License
This model is released under the Apache License, Version 2.0.