bert base turkish cased mean nli stsb tr
emrecanIntroduction
The BERT-BASE-TURKISH-CASED-MEAN-NLI-STSB-TR model by EMRECAN is designed for sentence similarity tasks. This model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for clustering and semantic searches. It is trained using Turkish machine-translated versions of the NLI and STS-b datasets, leveraging training scripts from the sentence-transformers repository.
Architecture
The model is implemented as a SentenceTransformer
that incorporates a Transformer
with a BertModel
backbone and a Pooling
layer. The pooling layer is configured for mean token pooling with a maximum sequence length of 75 and does not perform lowercasing.
Training
The training utilized scripts like training_nli_v2.py
and training_stsbenchmark_continue_training.py
. The model was trained over four epochs using a CosineSimilarityLoss
with a batch size of 16. Key parameters included a learning rate of 2e-5, warmup steps of 144, and a weight decay of 0.01.
Guide: Running Locally
- Installation: Ensure
sentence-transformers
is installed viapip install -U sentence-transformers
. - Usage: Load the model using the
SentenceTransformer
class and encode sentences to obtain embeddings.from sentence_transformers import SentenceTransformer sentences = ["Bu örnek bir cümle", "Her cümle vektöre çevriliyor"] model = SentenceTransformer('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr') embeddings = model.encode(sentences) print(embeddings)
- Alternative Approach: Use Hugging Face Transformers with mean pooling.
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ["Bu örnek bir cümle", "Her cümle vektöre çevriliyor"] tokenizer = AutoTokenizer.from_pretrained('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr') model = AutoModel.from_pretrained('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:", sentence_embeddings)
- Cloud GPUs: For efficient computation, consider using cloud GPUs such as those available on AWS or Google Cloud.
License
This model is released under the Apache-2.0 license.