text2vec base chinese LLM Model

Introduction

The text2vec-base-chinese model by shibing624 is a CoSENT (Cosine Sentence) model designed to map Chinese sentences into a 768-dimensional dense vector space. This model is particularly useful for tasks such as sentence embeddings, text matching, and semantic search.

Architecture

The model architecture is based on the CoSENT framework, which includes:

A Transformer model (BertModel) configured with a maximum sequence length of 128 and case sensitivity.
A pooling layer configured to use mean token pooling for generating sentence embeddings.

Training

Pre-training: Utilizes the pre-trained hfl/chinese-macbert-base model.
Fine-tuning: The model is fine-tuned using a contrastive learning objective to compute cosine similarity between sentence pairs, employing a ranking loss for optimization.
Hyperparameters:
- Training dataset: nli_zh
- Maximum sequence length: 128
- Best epoch: 5
- Sentence embedding dimension: 768

Guide: Running Locally

Install Required Libraries:
- text2vec for simplified usage:
```
pip install -U text2vec
```
- transformers and torch for direct usage:
```
pip install transformers
```
- sentence-transformers for sentence encoding:
```
pip install -U sentence-transformers
```

Load and Use the Model:

Using text2vec:

from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
model = SentenceModel('shibing624/text2vec-base-chinese')
embeddings = model.encode(sentences)
print(embeddings)

Using transformers:

from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = BertModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)

For optimized performance, consider using cloud GPUs and ONNX or OpenVINO for speedups.

License

The text2vec-base-chinese model is released under the Apache 2.0 license, allowing for both personal and commercial use.

More Related APIs in Sentence Similarity