Modern B E R T korean large preview LLM Model

Introduction

The ModernBERT-Korean-Large-Preview is a sentence-transformers model finetuned for Korean sentence similarity tasks. It is designed to map sentences and paragraphs to a 1024-dimensional dense vector space, enabling applications such as semantic textual similarity, semantic search, paraphrase mining, and text classification.

Architecture

The model is based on the Sentence Transformer architecture, utilizing the answerdotai/ModernBERT-large as the base model. Key architectural details include a maximum sequence length of 8192 tokens, output dimensionality of 1024 dimensions, and the use of cosine similarity for measuring sentence similarity. The model employs a Transformer for encoding and a pooling layer with various modes, such as mean tokens, to generate sentence embeddings.

Training

The model was trained on the korean_nli_dataset_reranker_v1, which consists of 1,120,235 samples. It utilizes the CachedMultipleNegativesRankingLoss function with specific parameters to optimize the training process. The training logs indicate a development evaluation cosine accuracy of 0.877, demonstrating the model's effectiveness in sentence similarity tasks.

Guide: Running Locally

To run the model locally:

Install Dependencies: Ensure Python 3.11.9 and the necessary libraries such as sentence-transformers, transformers, torch, accelerate, datasets, and tokenizers are installed.
```
pip install sentence-transformers transformers torch accelerate datasets tokenizers
```

Load the Model: Use the Hugging Face transformers library to load the model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sigridjineth/ModernBERT-korean-large-preview')

Inference: Prepare your sentences and use the model to generate embeddings.

sentences = ["여기에 문장을 입력하세요."]
embeddings = model.encode(sentences)

GPU Recommendation: For optimal performance, consider using a cloud GPU service such as AWS EC2, Google Cloud, or Azure.

License

The model and its components are distributed under the Apache License 2.0, allowing for both personal and commercial use with attribution. For specific terms and conditions, refer to the license documentation provided with the model.

More Related APIs in Sentence Similarity