Erlangshen Roberta 330 M Similarity LLM Model

Introduction

Erlangshen-Roberta-330M-Similarity is a fine-tuned version of the Chinese RoBERTa-wwm-ext-large model developed for similarity tasks. It has been trained on 20 Chinese paraphrase datasets, amounting to 2,773,880 samples, to enhance its performance in natural language understanding (NLU) applications.

Architecture

The model is based on the chinese-roberta-wwm-ext-large architecture. It is designed for tasks requiring semantic similarity assessment between Chinese text pairs. The model features 330 million parameters, making it suitable for a variety of NLU tasks.

Training

The model has been fine-tuned on a substantial dataset of 2,773,880 samples from 20 Chinese paraphrase datasets. This extensive training has resulted in improved performance metrics across various benchmarks, including BQ, BUSTM, and AFQMC, demonstrating its effectiveness in identifying semantic similarities in Chinese text.

Guide: Running Locally

To use Erlangshen-Roberta-330M-Similarity locally, follow these steps:

Install required libraries: Ensure that you have the transformers library and torch installed in your Python environment.

Load the model and tokenizer:

from transformers import BertForSequenceClassification, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-330M-Similarity')
model = BertForSequenceClassification.from_pretrained('IDEA-CCNL/Erlangshen-Roberta-330M-Similarity')

Prepare input data:

texta = '今天的饭不好吃'
textb = '今天心情不好'

Make predictions:

output = model(torch.tensor([tokenizer.encode(texta, textb)]))
print(torch.nn.functional.softmax(output.logits, dim=-1))

For improved performance, consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure.

License

The Erlangshen-Roberta-330M-Similarity model is licensed under the Apache 2.0 License. This allows for both personal and commercial usage, with attribution required.

More Related APIs in Text Classification