Paraphrase-MPNET-Base-V2-Fuzzy-Matcher

Introduction

The Paraphrase-MPNET-Base-V2-Fuzzy-Matcher is a model designed for fuzzy string matching using character-level embeddings. It employs a Siamese BERT architecture, allowing users to derive embeddings for fuzzy matching tasks such as entity resolution, record linking, and structured data search.

Architecture

This model utilizes a Siamese BERT architecture, focusing on character-level token embeddings to enhance fuzzy matching capabilities. It is based on the MPNet model and can be accessed via the Sentence Transformers library or directly through Hugging Face Transformers.

Training

The model is trained to handle character-level token inputs, making it suitable for tasks requiring fuzzy matching. The training process leverages techniques from the Sentence Transformers framework, which facilitates efficient and effective implementation of fuzzy string matching.

Guide: Running Locally

To use the Paraphrase-MPNET-Base-V2-Fuzzy-Matcher model locally, follow the steps below:

Install Dependencies
Ensure you have the required libraries installed:

pip install -U sentence-transformers
pip install transformers torch

Using Sentence Transformers
Load and use the model with the Sentence Transformers library:

from sentence_transformers import SentenceTransformer, util

word1 = "fuzzformer"
word1 = " ".join([char for char in word1])
word2 = "fizzformer"
word2 = " ".join([char for char in word2])
words = [word1, word2]

model = SentenceTransformer('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
fuzzy_embeddings = model.encode(words)

print(util.cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))

Using Hugging Face Transformers
Alternatively, use the model with Hugging Face Transformers:

import torch
from transformers import AutoTokenizer, AutoModel

def cos_sim(a: torch.Tensor, b: torch.Tensor):
    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
    return torch.mm(a_norm, b_norm.transpose(0, 1))

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

word1 = "fuzzformer"
word1 = " ".join([char for char in word1])
word2 = "fizzformer"
word2 = " ".join([char for char in word2])
words = [word1, word2]

tokenizer = AutoTokenizer.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
model = AutoModel.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
encoded_input = tokenizer(words, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

fuzzy_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))

Cloud GPU Suggestion
For enhanced performance, consider using cloud-based GPU services like AWS, GCP, or Azure to run these models efficiently.

License

The licensing details for this model are not explicitly mentioned. Please refer to the Hugging Face model repository or contact the author for specific licensing information.