sbert cased finnish paraphrase

TurkuNLP

Introduction

The SBERT-CASED-FINNISH-PARAPHRASE model by TurkuNLP is a Finnish Sentence BERT model trained for sentence similarity tasks. It is based on FinBERT and is designed to identify paraphrases in Finnish text.

Architecture

The model architecture consists of a SentenceTransformer with two main components:

  1. Transformer: Uses a BERT model with a max sequence length of 128 and does not lower case input text.
  2. Pooling Layer: Applies mean pooling on word embeddings to generate sentence embeddings.

Training

  • Library: The model is implemented using the sentence-transformers library.
  • Base Model: It uses the TurkuNLP/bert-base-finnish-cased-v1.
  • Dataset: Training data includes the Finnish Paraphrase Corpus and 5.5 million paraphrase candidates, with 500K positive and 5M negative samples.
  • Task: Binary classification to determine if two sentences are paraphrases, with specific labels indicating paraphrases or non-paraphrases.

Guide: Running Locally

To run the model locally, you can either use the sentence-transformers library or HuggingFace Transformers.

Using SentenceTransformers

from sentence_transformers import SentenceTransformer

sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
model = SentenceTransformer('TurkuNLP/sbert-cased-finnish-paraphrase')
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
tokenizer = AutoTokenizer.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')
model = AutoModel.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Suggested Cloud GPUs

For efficient processing, consider using cloud GPUs like AWS EC2 P3 instances, Google Cloud GPUs, or Azure NV-series VMs.

License

The licensing information for this model is not provided in the given documentation. Please refer to the TurkuNLP repository or contact the authors for detailed licensing terms.

More Related APIs in Sentence Similarity