roberta ner multilingual

julian-schelb

Introduction

The RoBERTa NER Multilingual model is designed for named entity recognition (NER), classifying tokens in text according to the IOB format. It is a fine-tuned version of XLM-RoBERTa, capable of recognizing entities such as persons, organizations, and locations across 21 languages.

Architecture

The model is based on XLM-RoBERTa, a transformer that utilizes masked language modeling (MLM) for pre-training. This pre-training involves masking words in a sentence and predicting them, allowing RoBERTa to learn bidirectional representations from large datasets in 100 languages.

Training

The model is fine-tuned using the WikiANN dataset, which includes entity-annotated examples across 21 languages. The training set comprises 375,100 sentences, with a validation set of 173,100 examples. The NER tags follow the IOB2 format, categorizing entities into locations, persons, or organizations. The evaluation results indicate high precision and recall, particularly for person entities.

Guide: Running Locally

  1. Install Dependencies: Ensure you have Python and PyTorch installed, along with the transformers library.

    pip install torch transformers
    
  2. Load the Model: Use the AutoTokenizer and AutoModelForTokenClassification classes to load the model.

    from transformers import AutoTokenizer, AutoModelForTokenClassification
    
    tokenizer = AutoTokenizer.from_pretrained("julian-schelb/roberta-ner-multilingual/", add_prefix_space=True)
    model = AutoModelForTokenClassification.from_pretrained("julian-schelb/roberta-ner-multilingual/")
    
  3. Prepare Input: Tokenize your input text.

    text = "Your text here."
    inputs = tokenizer(text, add_special_tokens=False, return_tensors="pt")
    
  4. Inference: Run the model and get predictions.

    with torch.no_grad():
        logits = model(**inputs).logits
    predicted_token_class_ids = logits.argmax(-1)
    
  5. Interpret Results: Map predicted IDs to token classes.

    predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
    print(predicted_tokens_classes)
    

Cloud GPUs: For faster processing, consider using cloud-based GPU services like AWS, Google Cloud, or Azure.

License

The model is licensed under the MIT License, allowing wide use and modification with attribution.

More Related APIs in Token Classification