roberta base finetuned cluener2020 chinese
uerIntroduction
The Chinese RoBERTa-Base model is fine-tuned for Named Entity Recognition (NER) tasks, specifically utilizing the CLUENER2020 dataset. It has been developed using UER-py, with options for fine-tuning using TencentPretrain for larger models and multimodal frameworks.
Architecture
The model is based on the RoBERTa architecture, fine-tuned particularly for token classification tasks in the Chinese language. It utilizes a sequence length of 512 and is trained on the pre-trained chinese_roberta_L-12_H-768
model.
Training
Training Data
The model uses the CLUENER2020 dataset, which provides a fine-grained NER dataset for Chinese, employing only the training set for fine-tuning.
Training Procedure
The model is fine-tuned over five epochs with a batch size of 32 and a learning rate of 3e-5. The training is conducted on Tencent Cloud. The model's performance is evaluated at the end of each epoch, and the best-performing model on the development set is saved. The model is finally converted to Hugging Face's format.
python3 finetune/run_ner.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
--vocab_path models/google_zh_vocab.txt \
--train_path datasets/cluener2020/train.tsv \
--dev_path datasets/cluener2020/dev.tsv \
--label2id_path datasets/cluener2020/label2id.json \
--output_model_path models/cluener2020_ner_model.bin \
--learning_rate 3e-5 --epochs_num 5 --batch_size 32 --seq_length 512
Guide: Running Locally
To use the model locally:
- Install the Hugging Face Transformers library.
- Load the model and tokenizer using
AutoModelForTokenClassification
andAutoTokenizer
. - Use the
pipeline
for NER.
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model = AutoModelForTokenClassification.from_pretrained('uer/roberta-base-finetuned-cluener2020-chinese')
tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-cluener2020-chinese')
ner = pipeline('ner', model=model, tokenizer=tokenizer)
result = ner("江苏警方通报特斯拉冲进店铺")
For faster performance, it is recommended to use cloud GPU services from providers like AWS, Google Cloud, or Azure.
License
The use of this model and code is governed by the respective licenses of UER-py and TencentPretrain. Please refer to their repositories for specific licensing terms.