nb sbert base
NbAiLabIntroduction
NB-SBERT-BASE is a SentenceTransformers model designed for Norwegian language sentence similarity tasks. It is trained on a machine-translated version of the MNLI dataset using the nb-bert-base model. The model transforms sentences into 768-dimensional vectors, facilitating tasks like clustering and semantic search. It supports cross-language similarity, meaning an English-Norwegian sentence pair with similar meanings should have a high similarity score.
Architecture
The model architecture includes:
- Transformer: A BERT model with a max sequence length of 75.
- Pooling: Mean pooling of token embeddings to generate sentence embeddings.
Training
NB-SBERT-BASE was trained using the MultipleNegativesRankingLoss with cosine similarity. The key parameters include:
- Batch Size: 32
- Epochs: 1
- Learning Rate: 2e-5
- Warmup Steps: 1648
- Optimizer: AdamW with weight decay of 0.01
Evaluation on the STS-test dataset yielded Pearson and Spearman scores of 0.8275 and 0.8245, respectively, for cosine similarity.
Guide: Running Locally
Basic Steps
-
Install SentenceTransformers:
pip install -U sentence-transformers
-
Load and Use the Model:
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('NbAiLab/nb-sbert-base') sentences = ["This is a Norwegian boy", "Dette er en norsk gutt"] embeddings = model.encode(sentences) cosine_scores = util.cos_sim(embeddings[0], embeddings[1]) print(cosine_scores)
-
Alternative with Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-base') model = AutoModel.from_pretrained('NbAiLab/nb-sbert-base') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input)
Suggest Cloud GPUs
For larger datasets or faster processing, consider using cloud GPU services like AWS EC2, Google Cloud Platform, or Azure.
License
NB-SBERT-BASE is released under the Apache 2.0 License.