K R S B E R T V40 K klue N L I aug S T S
snunlpIntroduction
The KR-SBERT-V40K-KLUENLI-AUGSTS model is a Korean-specific sentence-transformers model developed by SNUNLP. It is used for tasks such as clustering and semantic search by mapping sentences and paragraphs to a 768-dimensional dense vector space.
Architecture
The model architecture is based on SentenceTransformer, which includes:
- A Transformer model (BertModel) with a maximum sequence length of 128 and no lower casing.
- A pooling layer that performs mean pooling on token embeddings to generate sentence embeddings.
Training
The model was evaluated using the Sentence Embeddings Benchmark and achieved an accuracy of 0.8628. It is designed to handle various sentence similarity tasks, using data such as KR-SBERT-Medium-NLI-STS and KR-SBERT-V40K-NLI-augSTS.
Guide: Running Locally
-
Install Dependencies:
- For Sentence-Transformers:
pip install -U sentence-transformers
- For HuggingFace Transformers:
pip install torch transformers
- For Sentence-Transformers:
-
Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('snunlp/KR-SBERT-V40K-klueNLI-augSTS') sentences = ["This is an example sentence", "Each sentence is converted"] embeddings = model.encode(sentences) print(embeddings)
-
Using HuggingFace Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) tokenizer = AutoTokenizer.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS') model = AutoModel.from_pretrained('snunlp/KR-SBERT-V40K-klueNLI-augSTS') sentences = ['This is an example sentence', 'Each sentence is converted'] encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
-
Cloud GPUs: Consider using cloud-based GPUs from providers like AWS, Google Cloud, or Azure for faster processing, especially when handling large datasets or performing extensive computations.
License
The KR-SBERT model is a publicly available resource published by SNUNLP. For citation and further details, refer to their GitHub repository: KR-SBERT GitHub.