bert base turkish cased mean nli stsb tr LLM Model

Introduction

The BERT-BASE-TURKISH-CASED-MEAN-NLI-STSB-TR model by EMRECAN is designed for sentence similarity tasks. This model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for clustering and semantic searches. It is trained using Turkish machine-translated versions of the NLI and STS-b datasets, leveraging training scripts from the sentence-transformers repository.

Architecture

The model is implemented as a SentenceTransformer that incorporates a Transformer with a BertModel backbone and a Pooling layer. The pooling layer is configured for mean token pooling with a maximum sequence length of 75 and does not perform lowercasing.

Training

The training utilized scripts like training_nli_v2.py and training_stsbenchmark_continue_training.py. The model was trained over four epochs using a CosineSimilarityLoss with a batch size of 16. Key parameters included a learning rate of 2e-5, warmup steps of 144, and a weight decay of 0.01.

Guide: Running Locally

Installation: Ensure sentence-transformers is installed via pip install -U sentence-transformers.

Usage: Load the model using the SentenceTransformer class and encode sentences to obtain embeddings.

from sentence_transformers import SentenceTransformer
sentences = ["Bu örnek bir cümle", "Her cümle vektöre çevriliyor"]
model = SentenceTransformer('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr')
embeddings = model.encode(sentences)
print(embeddings)

Alternative Approach: Use Hugging Face Transformers with mean pooling.

from transformers import AutoTokenizer, AutoModel
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["Bu örnek bir cümle", "Her cümle vektöre çevriliyor"]
tokenizer = AutoTokenizer.from_pretrained('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr')
model = AutoModel.from_pretrained('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:", sentence_embeddings)

Cloud GPUs: For efficient computation, consider using cloud GPUs such as those available on AWS or Google Cloud.

License

This model is released under the Apache-2.0 license.

More Related APIs in Sentence Similarity