xlm roberta base finetuned luo finetuned ner swahili

mbeukman

Introduction

The xlm-roberta-base-finetuned-luo-finetuned-ner-swahili is a token classification model specifically for Named Entity Recognition (NER). It is based on the XLM-RoBERTa architecture and fine-tuned using the MasakhaNER dataset, focusing on the Swahili language.

Architecture

This model is built on the transformer-based XLM-RoBERTa architecture. It has been fine-tuned for 50 epochs with a maximum sequence length of 200, a batch size of 32, and a learning rate of 5e-5. It has been evaluated across five different random seeds, with the best-performing model selected based on the aggregate F1 score on the test set.

Training

The model was fine-tuned by Michael Beukman as part of a project at the University of the Witwatersrand. Training was conducted using the MasakhaNER dataset, which contains news articles in ten African languages. The fine-tuning process took between 10 to 30 minutes per model on an NVIDIA RTX3090 GPU, requiring at least 14GB of VRAM for a batch size of 32.

Guide: Running Locally

To use this model locally, follow these steps:

  1. Install the transformers library from Hugging Face:

    pip install transformers
    
  2. Load the model and tokenizer:

    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
    
    model_name = 'mbeukman/xlm-roberta-base-finetuned-luo-finetuned-ner-swahili'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForTokenClassification.from_pretrained(model_name)
    
    nlp = pipeline("ner", model=model, tokenizer=tokenizer)
    
  3. Use the model for NER:

    example = "Wizara ya afya ya Tanzania imeripoti Jumatatu kuwa , watu takriban 14 zaidi wamepata maambukizi ya Covid - 19 ."
    ner_results = nlp(example)
    print(ner_results)
    

For optimal performance, consider using cloud GPUs such as those from AWS, Google Cloud, or Azure.

License

This model is released under the Apache License, Version 2.0.

More Related APIs in Token Classification