roberta base cold

thu-coai

Introduction

The roberta-base-cold model from THU-COAI is a fine-tuned version of hfl/chinese-roberta-wwm-ext, designed specifically for Chinese offensive language detection. This model is trained on the COLDataset and achieves an accuracy of 82.75 and a macro-F1 score of 82.39 on the test set.

Architecture

The model utilizes the RoBERTa architecture, a variant of the BERT model, implemented in PyTorch. It is optimized for text classification tasks, focusing on detecting offensive language in Chinese text.

Training

The model is fine-tuned on the COLDataset, a specialized dataset for Chinese offensive language detection. The training process involves using the pre-trained hfl/chinese-roberta-wwm-ext model as a base and refining it with the COLDataset to improve its performance on the task.

Guide: Running Locally

  1. Install Dependencies: Ensure you have PyTorch and the Transformers library installed.

    pip install torch transformers
    
  2. Load Model and Tokenizer:

    import torch
    from transformers.models.bert import BertTokenizer, BertForSequenceClassification
    
    tokenizer = BertTokenizer.from_pretrained('thu-coai/roberta-base-cold')
    model = BertForSequenceClassification.from_pretrained('thu-coai/roberta-base-cold')
    model.eval()
    
  3. Prepare Input and Make Predictions:

    texts = ['你就是个傻逼!','黑人很多都好吃懒做,偷奸耍滑!','男女平等,黑人也很优秀。']
    model_input = tokenizer(texts, return_tensors="pt", padding=True)
    model_output = model(**model_input, return_dict=False)
    prediction = torch.argmax(model_output[0].cpu(), dim=-1)
    prediction = [p.item() for p in prediction]
    print(prediction)  # Outputs [1, 1, 0] (0 for Non-Offensive, 1 for Offensive)
    
  4. Compute Resources: For optimal performance, consider using cloud GPUs such as AWS EC2 or Google Cloud Platform.

License

Please refer to the official Hugging Face model card for the specific licensing terms. Additionally, if you use this model, kindly cite the original paper as follows:

@article{deng2022cold,
  title={Cold: A benchmark for chinese offensive language detection},
  author={Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie},
  booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
  year={2022}
}

More Related APIs in Text Classification