Introduction

The BGE-M3-KO model is a sentence-transformer, designed for sentence similarity tasks. It is capable of mapping sentences and paragraphs into a 1024-dimensional dense vector space, benefiting semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering. The model is particularly trained on datasets in Korean and English.

Architecture

The model uses a Sentence Transformer architecture, which includes a Transformer encoder based on XLM-Roberta. Its features include:

  • Maximum sequence length of 8192 tokens.
  • Output dimensionality of 1024 tokens.
  • Utilizes cosine similarity to assess sentence similarity.
  • Comprises a Transformer model with a pooling layer and normalization.

Training

The model was trained with a batch size of 32768 and a learning rate of 3e-05, employing FP16 precision. The training process involved weakly-supervised contrastive pre-training, as detailed in the paper "Text Embeddings by Weakly-Supervised Contrastive Pre-training."

Guide: Running Locally

  1. Install Dependencies:
    pip install -U sentence-transformers
    
  2. Load the Model:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("dragonkue/bge-m3-ko")
    
  3. Inference: Encode sentences and calculate similarity.
    sentences = ['Sentence 1', 'Sentence 2']
    embeddings = model.encode(sentences)
    similarities = model.similarity(embeddings, embeddings)
    
  4. Hardware Recommendation: For optimal performance, especially with large batch sizes or sequence lengths, consider using cloud GPUs such as AWS EC2 P3 instances or Google Cloud's A100 GPUs.

License

This model is released under the Apache 2.0 License, allowing for both personal and commercial use, as well as modification and distribution of the model.

More Related APIs in Sentence Similarity