paraphrase mpnet base v2 fuzzy matcher
shahrukhx01Paraphrase-MPNET-Base-V2-Fuzzy-Matcher
Introduction
The Paraphrase-MPNET-Base-V2-Fuzzy-Matcher is a model designed for fuzzy string matching using character-level embeddings. It employs a Siamese BERT architecture, allowing users to derive embeddings for fuzzy matching tasks such as entity resolution, record linking, and structured data search.
Architecture
This model utilizes a Siamese BERT architecture, focusing on character-level token embeddings to enhance fuzzy matching capabilities. It is based on the MPNet model and can be accessed via the Sentence Transformers library or directly through Hugging Face Transformers.
Training
The model is trained to handle character-level token inputs, making it suitable for tasks requiring fuzzy matching. The training process leverages techniques from the Sentence Transformers framework, which facilitates efficient and effective implementation of fuzzy string matching.
Guide: Running Locally
To use the Paraphrase-MPNET-Base-V2-Fuzzy-Matcher model locally, follow the steps below:
-
Install Dependencies
Ensure you have the required libraries installed:pip install -U sentence-transformers pip install transformers torch
-
Using Sentence Transformers
Load and use the model with the Sentence Transformers library:from sentence_transformers import SentenceTransformer, util word1 = "fuzzformer" word1 = " ".join([char for char in word1]) word2 = "fizzformer" word2 = " ".join([char for char in word2]) words = [word1, word2] model = SentenceTransformer('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher') fuzzy_embeddings = model.encode(words) print(util.cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))
-
Using Hugging Face Transformers
Alternatively, use the model with Hugging Face Transformers:import torch from transformers import AutoTokenizer, AutoModel def cos_sim(a: torch.Tensor, b: torch.Tensor): a_norm = torch.nn.functional.normalize(a, p=2, dim=1) b_norm = torch.nn.functional.normalize(b, p=2, dim=1) return torch.mm(a_norm, b_norm.transpose(0, 1)) def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) word1 = "fuzzformer" word1 = " ".join([char for char in word1]) word2 = "fizzformer" word2 = " ".join([char for char in word2]) words = [word1, word2] tokenizer = AutoTokenizer.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher') model = AutoModel.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher') encoded_input = tokenizer(words, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) fuzzy_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print(cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))
-
Cloud GPU Suggestion
For enhanced performance, consider using cloud-based GPU services like AWS, GCP, or Azure to run these models efficiently.
License
The licensing details for this model are not explicitly mentioned. Please refer to the Hugging Face model repository or contact the author for specific licensing information.