bert base romanian ner

dumitrescustefan

Introduction

bert-base-romanian-ner is a fine-tuned BERT model designed for Named Entity Recognition (NER) in Romanian. It recognizes 15 entity types, achieving state-of-the-art performance on the NER task. The model is based on bert-base-romanian-cased-v1 and fine-tuned on the RONEC version 2.0 dataset, which contains over 80,000 annotated entities across 12,330 sentences.

Architecture

The model is a BERT-based architecture fine-tuned for token classification tasks, specifically targeting Romanian text for NER. It employs a BIO2 annotation scheme, which labels entities with "B-" (beginning) and "I-" (inside) prefixes, and "O" for non-entity tokens. The model is compatible with the Hugging Face Transformers library and uses PyTorch as its backend.

Training

The bert-base-romanian-ner was trained using the RONEC version 2.0 dataset, which comprises over half a million tokens and 80,283 distinct annotated entities. The dataset includes a wide variety of entity types, from persons and organizations to money and events, allowing the model to generalize across different categories effectively.

Guide: Running Locally

To run the model locally, follow these basic steps:

  1. Install the necessary libraries:

    pip install transformers torch
    
  2. Use the model with the Hugging Face Transformers pipeline:

    from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-ner")
    model = AutoModelForTokenClassification.from_pretrained("dumitrescustefan/bert-base-romanian-ner")
    nlp = pipeline("ner", model=model, tokenizer=tokenizer)
    
    text = "Alex cumpără un bilet pentru trenul 3118 în direcția Cluj cu plecare la ora 13:00."
    ner_results = nlp(text)
    print(ner_results)
    
  3. Handle text sanitization by replacing specific Romanian characters:

    text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
    

For better performance, consider using cloud GPUs from providers like AWS or Google Cloud Platform.

License

The bert-base-romanian-ner model is licensed under the MIT License, which permits reuse with minimal restrictions.

More Related APIs in Token Classification