bert base multilingual cased ner hrl

Davlan

Introduction

The bert-base-multilingual-cased-ner-hrl is a Named Entity Recognition (NER) model designed to identify entities such as locations, organizations, and persons across ten high-resourced languages. It is a fine-tuned version of the multilingual BERT (mBERT) base model.

Architecture

The model is based on the bert-base-multilingual-cased architecture, which supports multiple languages including Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese, and Chinese. It uses a token classification approach to recognize entities in text.

Training

The model was fine-tuned using datasets specific to each language:

  • Arabic: ANERcorp
  • German: CoNLL 2003
  • English: CoNLL 2003
  • Spanish: CoNLL 2002
  • French: Europeana Newspapers
  • Italian: Italian I-CAB
  • Latvian: Latvian NER
  • Dutch: CoNLL 2002
  • Portuguese: Paramopama + Second Harem
  • Chinese: MSRA

It was trained on an NVIDIA V100 GPU using Hugging Face's recommended hyperparameters.

Guide: Running Locally

To run the model locally, follow these steps:

  1. Install the Transformers library:

    pip install transformers
    
  2. Load the tokenizer and model:

    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained("Davlan/bert-base-multilingual-cased-ner-hrl")
    model = AutoModelForTokenClassification.from_pretrained("Davlan/bert-base-multilingual-cased-ner-hrl")
    nlp = pipeline("ner", model=model, tokenizer=tokenizer)
    
  3. Perform Named Entity Recognition:

    example = "Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute."
    ner_results = nlp(example)
    print(ner_results)
    

For improved performance, consider using cloud GPUs such as those available on AWS or Google Cloud.

License

The model is released under the Academic Free License v3.0 (AFL-3.0).

More Related APIs in Token Classification