K U R E v1
nlpai-labIntroduction
KURE-V1, developed by Korea University's NLP&AI Lab, is a Korean text retrieval embedding model that excels in performance compared to other multilingual models. It is recognized as one of the top Korean retrieval models available publicly.
Architecture
The model is fine-tuned from BAAI/bge-m3 using Korean data and employs the CachedGISTEmbedLoss for training. It supports both Korean and English languages and is licensed under the MIT License.
Training
Training Data
- Utilizes 2,000,000 examples from Korean query-document-hard_negative(5) data.
Training Procedure
- Loss Function: CachedGISTEmbedLoss
- Batch Size: 4096
- Learning Rate: 2e-05
- Epochs: 1
Evaluation
KURE-V1 was assessed using several benchmark datasets, achieving high scores in recall, precision, NDCG, and F1 metrics across various datasets including Ko-StrategyQA, AutoRAGRetrieval, and others.
Guide: Running Locally
Install Dependencies
First, install the required Sentence Transformers library:
pip install -U sentence-transformers
Python Code Example
from sentence_transformers import SentenceTransformer
# Load model from the Hugging Face Hub
model = SentenceTransformer("nlpai-lab/KURE-v1")
# Sample sentences for inference
sentences = [
'헌법과 법원조직법은 어떤 방식을 통해 기본권 보장 등의 다양한 법적 모색을 가능하게 했어',
# add more sentences as needed
]
# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)
# Calculate similarity scores
similarities = model.similarity(embeddings, embeddings)
print(similarities)
Cloud GPUs
For enhanced performance, consider using cloud-based GPU services such as AWS EC2 with NVIDIA GPUs, Google Cloud's AI Platform, or Azure's ML compute options.
License
The KURE-V1 model is distributed under the MIT License, allowing for flexibility in both academic and commercial use.