text2vec base chinese

shibing624

Introduction

The text2vec-base-chinese model by shibing624 is a CoSENT (Cosine Sentence) model designed to map Chinese sentences into a 768-dimensional dense vector space. This model is particularly useful for tasks such as sentence embeddings, text matching, and semantic search.

Architecture

The model architecture is based on the CoSENT framework, which includes:

  • A Transformer model (BertModel) configured with a maximum sequence length of 128 and case sensitivity.
  • A pooling layer configured to use mean token pooling for generating sentence embeddings.

Training

  • Pre-training: Utilizes the pre-trained hfl/chinese-macbert-base model.
  • Fine-tuning: The model is fine-tuned using a contrastive learning objective to compute cosine similarity between sentence pairs, employing a ranking loss for optimization.
  • Hyperparameters:
    • Training dataset: nli_zh
    • Maximum sequence length: 128
    • Best epoch: 5
    • Sentence embedding dimension: 768

Guide: Running Locally

  1. Install Required Libraries:

    • text2vec for simplified usage:
      pip install -U text2vec
      
    • transformers and torch for direct usage:
      pip install transformers
      
    • sentence-transformers for sentence encoding:
      pip install -U sentence-transformers
      
  2. Load and Use the Model:

    • Using text2vec:
      from text2vec import SentenceModel
      sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
      model = SentenceModel('shibing624/text2vec-base-chinese')
      embeddings = model.encode(sentences)
      print(embeddings)
      
    • Using transformers:
      from transformers import BertTokenizer, BertModel
      import torch
      tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
      model = BertModel.from_pretrained('shibing624/text2vec-base-chinese')
      sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
      encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
      with torch.no_grad():
          model_output = model(**encoded_input)
      
    • For optimized performance, consider using cloud GPUs and ONNX or OpenVINO for speedups.

License

The text2vec-base-chinese model is released under the Apache 2.0 license, allowing for both personal and commercial use.

More Related APIs in Sentence Similarity