bert base romanian ner
dumitrescustefanIntroduction
bert-base-romanian-ner
is a fine-tuned BERT model designed for Named Entity Recognition (NER) in Romanian. It recognizes 15 entity types, achieving state-of-the-art performance on the NER task. The model is based on bert-base-romanian-cased-v1
and fine-tuned on the RONEC version 2.0 dataset, which contains over 80,000 annotated entities across 12,330 sentences.
Architecture
The model is a BERT-based architecture fine-tuned for token classification tasks, specifically targeting Romanian text for NER. It employs a BIO2 annotation scheme, which labels entities with "B-" (beginning) and "I-" (inside) prefixes, and "O" for non-entity tokens. The model is compatible with the Hugging Face Transformers library and uses PyTorch as its backend.
Training
The bert-base-romanian-ner
was trained using the RONEC version 2.0 dataset, which comprises over half a million tokens and 80,283 distinct annotated entities. The dataset includes a wide variety of entity types, from persons and organizations to money and events, allowing the model to generalize across different categories effectively.
Guide: Running Locally
To run the model locally, follow these basic steps:
-
Install the necessary libraries:
pip install transformers torch
-
Use the model with the Hugging Face Transformers pipeline:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-ner") model = AutoModelForTokenClassification.from_pretrained("dumitrescustefan/bert-base-romanian-ner") nlp = pipeline("ner", model=model, tokenizer=tokenizer) text = "Alex cumpără un bilet pentru trenul 3118 în direcția Cluj cu plecare la ora 13:00." ner_results = nlp(text) print(ner_results)
-
Handle text sanitization by replacing specific Romanian characters:
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
For better performance, consider using cloud GPUs from providers like AWS or Google Cloud Platform.
License
The bert-base-romanian-ner
model is licensed under the MIT License, which permits reuse with minimal restrictions.