Conan embedding v1
TencentBACIntroduction
The Conan-embedding-v1 model, developed by the Tencent BAC Group, is a general text embedding model built using the sentence-transformers library on PyTorch. It is designed for various tasks including semantic textual similarity (STS), classification, clustering, reranking, and retrieval, specifically optimized for the Chinese language.
Architecture
The model is built upon the BERT architecture and utilizes sentence-transformers, a library that facilitates the creation of sentence embeddings. The model supports safetensors and focuses on improving text embedding performance through the use of more and better negative samples, as detailed in the technical report on arXiv.
Training
Training details are discussed in the associated technical report, which can be accessed on arXiv with the identifier 2408.15710. The report outlines the methodologies employed, including the specific approaches to handling negative samples, which are crucial for enhancing the embedding quality.
Guide: Running Locally
- Environment Setup: Ensure you have Python and PyTorch installed. Use a virtual environment to manage dependencies.
- Install Sentence-Transformers:
pip install -U sentence-transformers
- Download Model: Use the Hugging Face model hub to download the Conan-embedding-v1 model.
- Load the Model:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('TencentBAC/Conan-embedding-v1')
- Inference: Use the model to generate embeddings for your text data.
Suggested Cloud GPUs
For optimal performance, especially with large datasets or intensive tasks, consider using cloud GPU services such as AWS EC2, Google Cloud Platform, or Azure.
License
The Conan-embedding-v1 model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License. This permits sharing and adaptation for non-commercial purposes, provided appropriate credit is given. More details can be found at Creative Commons License.