nb sbert base

NbAiLab

Introduction

NB-SBERT-BASE is a SentenceTransformers model designed for Norwegian language sentence similarity tasks. It is trained on a machine-translated version of the MNLI dataset using the nb-bert-base model. The model transforms sentences into 768-dimensional vectors, facilitating tasks like clustering and semantic search. It supports cross-language similarity, meaning an English-Norwegian sentence pair with similar meanings should have a high similarity score.

Architecture

The model architecture includes:

  • Transformer: A BERT model with a max sequence length of 75.
  • Pooling: Mean pooling of token embeddings to generate sentence embeddings.

Training

NB-SBERT-BASE was trained using the MultipleNegativesRankingLoss with cosine similarity. The key parameters include:

  • Batch Size: 32
  • Epochs: 1
  • Learning Rate: 2e-5
  • Warmup Steps: 1648
  • Optimizer: AdamW with weight decay of 0.01

Evaluation on the STS-test dataset yielded Pearson and Spearman scores of 0.8275 and 0.8245, respectively, for cosine similarity.

Guide: Running Locally

Basic Steps

  1. Install SentenceTransformers:

    pip install -U sentence-transformers
    
  2. Load and Use the Model:

    from sentence_transformers import SentenceTransformer, util
    model = SentenceTransformer('NbAiLab/nb-sbert-base')
    sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"]
    embeddings = model.encode(sentences)
    cosine_scores = util.cos_sim(embeddings[0], embeddings[1])
    print(cosine_scores)
    
  3. Alternative with Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    tokenizer = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-base')
    model = AutoModel.from_pretrained('NbAiLab/nb-sbert-base')
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    

Suggest Cloud GPUs

For larger datasets or faster processing, consider using cloud GPU services like AWS EC2, Google Cloud Platform, or Azure.

License

NB-SBERT-BASE is released under the Apache 2.0 License.

More Related APIs in Sentence Similarity