bert base turkish cased mean nli stsb tr

emrecan

Introduction

The BERT-BASE-TURKISH-CASED-MEAN-NLI-STSB-TR model by EMRECAN is designed for sentence similarity tasks. This model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for clustering and semantic searches. It is trained using Turkish machine-translated versions of the NLI and STS-b datasets, leveraging training scripts from the sentence-transformers repository.

Architecture

The model is implemented as a SentenceTransformer that incorporates a Transformer with a BertModel backbone and a Pooling layer. The pooling layer is configured for mean token pooling with a maximum sequence length of 75 and does not perform lowercasing.

Training

The training utilized scripts like training_nli_v2.py and training_stsbenchmark_continue_training.py. The model was trained over four epochs using a CosineSimilarityLoss with a batch size of 16. Key parameters included a learning rate of 2e-5, warmup steps of 144, and a weight decay of 0.01.

Guide: Running Locally

  1. Installation: Ensure sentence-transformers is installed via pip install -U sentence-transformers.
  2. Usage: Load the model using the SentenceTransformer class and encode sentences to obtain embeddings.
    from sentence_transformers import SentenceTransformer
    sentences = ["Bu örnek bir cümle", "Her cümle vektöre çevriliyor"]
    model = SentenceTransformer('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Alternative Approach: Use Hugging Face Transformers with mean pooling.
    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ["Bu örnek bir cümle", "Her cümle vektöre çevriliyor"]
    tokenizer = AutoTokenizer.from_pretrained('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr')
    model = AutoModel.from_pretrained('emrecan/bert-base-turkish-cased-mean-nli-stsb-tr')
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:", sentence_embeddings)
    
  4. Cloud GPUs: For efficient computation, consider using cloud GPUs such as those available on AWS or Google Cloud.

License

This model is released under the Apache-2.0 license.

More Related APIs in Sentence Similarity