Introduction

KURE-V1, developed by Korea University's NLP&AI Lab, is a Korean text retrieval embedding model that excels in performance compared to other multilingual models. It is recognized as one of the top Korean retrieval models available publicly.

Architecture

The model is fine-tuned from BAAI/bge-m3 using Korean data and employs the CachedGISTEmbedLoss for training. It supports both Korean and English languages and is licensed under the MIT License.

Training

Training Data

  • Utilizes 2,000,000 examples from Korean query-document-hard_negative(5) data.

Training Procedure

  • Loss Function: CachedGISTEmbedLoss
  • Batch Size: 4096
  • Learning Rate: 2e-05
  • Epochs: 1

Evaluation

KURE-V1 was assessed using several benchmark datasets, achieving high scores in recall, precision, NDCG, and F1 metrics across various datasets including Ko-StrategyQA, AutoRAGRetrieval, and others.

Guide: Running Locally

Install Dependencies

First, install the required Sentence Transformers library:

pip install -U sentence-transformers

Python Code Example

from sentence_transformers import SentenceTransformer

# Load model from the Hugging Face Hub
model = SentenceTransformer("nlpai-lab/KURE-v1")

# Sample sentences for inference
sentences = [
    '헌법과 법원조직법은 어떤 방식을 통해 기본권 보장 등의 다양한 법적 모색을 가능하게 했어',
    # add more sentences as needed
]

# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)

# Calculate similarity scores
similarities = model.similarity(embeddings, embeddings)
print(similarities)

Cloud GPUs

For enhanced performance, consider using cloud-based GPU services such as AWS EC2 with NVIDIA GPUs, Google Cloud's AI Platform, or Azure's ML compute options.

License

The KURE-V1 model is distributed under the MIT License, allowing for flexibility in both academic and commercial use.

More Related APIs in Feature Extraction