text2vec base chinese
shibing624Introduction
The text2vec-base-chinese
model by shibing624 is a CoSENT (Cosine Sentence) model designed to map Chinese sentences into a 768-dimensional dense vector space. This model is particularly useful for tasks such as sentence embeddings, text matching, and semantic search.
Architecture
The model architecture is based on the CoSENT framework, which includes:
- A Transformer model (
BertModel
) configured with a maximum sequence length of 128 and case sensitivity. - A pooling layer configured to use mean token pooling for generating sentence embeddings.
Training
- Pre-training: Utilizes the pre-trained
hfl/chinese-macbert-base
model. - Fine-tuning: The model is fine-tuned using a contrastive learning objective to compute cosine similarity between sentence pairs, employing a ranking loss for optimization.
- Hyperparameters:
- Training dataset:
nli_zh
- Maximum sequence length: 128
- Best epoch: 5
- Sentence embedding dimension: 768
- Training dataset:
Guide: Running Locally
-
Install Required Libraries:
text2vec
for simplified usage:pip install -U text2vec
transformers
andtorch
for direct usage:pip install transformers
sentence-transformers
for sentence encoding:pip install -U sentence-transformers
-
Load and Use the Model:
- Using
text2vec
:from text2vec import SentenceModel sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'] model = SentenceModel('shibing624/text2vec-base-chinese') embeddings = model.encode(sentences) print(embeddings)
- Using
transformers
:from transformers import BertTokenizer, BertModel import torch tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese') model = BertModel.from_pretrained('shibing624/text2vec-base-chinese') sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'] encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input)
- For optimized performance, consider using cloud GPUs and ONNX or OpenVINO for speedups.
- Using
License
The text2vec-base-chinese
model is released under the Apache 2.0 license, allowing for both personal and commercial use.