sbert cased finnish paraphrase
TurkuNLPIntroduction
The SBERT-CASED-FINNISH-PARAPHRASE model by TurkuNLP is a Finnish Sentence BERT model trained for sentence similarity tasks. It is based on FinBERT and is designed to identify paraphrases in Finnish text.
Architecture
The model architecture consists of a SentenceTransformer with two main components:
- Transformer: Uses a BERT model with a max sequence length of 128 and does not lower case input text.
- Pooling Layer: Applies mean pooling on word embeddings to generate sentence embeddings.
Training
- Library: The model is implemented using the
sentence-transformers
library. - Base Model: It uses the
TurkuNLP/bert-base-finnish-cased-v1
. - Dataset: Training data includes the Finnish Paraphrase Corpus and 5.5 million paraphrase candidates, with 500K positive and 5M negative samples.
- Task: Binary classification to determine if two sentences are paraphrases, with specific labels indicating paraphrases or non-paraphrases.
Guide: Running Locally
To run the model locally, you can either use the sentence-transformers
library or HuggingFace Transformers
.
Using SentenceTransformers
from sentence_transformers import SentenceTransformer
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
model = SentenceTransformer('TurkuNLP/sbert-cased-finnish-paraphrase')
embeddings = model.encode(sentences)
print(embeddings)
Using HuggingFace Transformers
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
tokenizer = AutoTokenizer.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')
model = AutoModel.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Suggested Cloud GPUs
For efficient processing, consider using cloud GPUs like AWS EC2 P3 instances, Google Cloud GPUs, or Azure NV-series VMs.
License
The licensing information for this model is not provided in the given documentation. Please refer to the TurkuNLP repository or contact the authors for detailed licensing terms.