wikineural multilingual ner

Babelscape

Introduction

WikiNEuRal is a multilingual Named Entity Recognition (NER) model, developed as part of the EMNLP 2021 paper "WikiNEuRal: Combined Neural and Knowledge-Based Silver Data Creation for Multilingual NER." The model is fine-tuned on the WikiNEuRal dataset using a multilingual BERT (mBERT) model and supports nine languages: German, English, Spanish, French, Italian, Dutch, Polish, Portuguese, and Russian.

Architecture

The model is based on the mBERT architecture, which has been fine-tuned for three epochs on the WikiNEuRal dataset. This dataset is created by combining neural and knowledge-based approaches to address data scarcity in multilingual NER.

Training

The training process involves the use of the WikiNEuRal dataset, which leverages Wikipedia texts and novel domain adaptation techniques to create high-quality training corpora for NER. The model has been evaluated on standard benchmarks, showing significant improvements in span-based F1-score over previous systems.

Guide: Running Locally

To use the model locally, the Transformers library from Hugging Face is required. Here are the basic steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Load the model and tokenizer:

    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
    model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")
    
  3. Create a Named Entity Recognition pipeline:

    nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
    example = "My name is Wolfgang and I live in Berlin"
    ner_results = nlp(example)
    print(ner_results)
    

For optimal performance, consider using cloud GPUs such as those provided by AWS, Google Cloud, or Azure.

License

This model and its dataset are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Usage is restricted to non-commercial research purposes, and the copyright remains with the original authors. For more information, visit CC BY-NC-SA 4.0.

More Related APIs in Token Classification