paraphrase mpnet base v2 fuzzy matcher

shahrukhx01

Paraphrase-MPNET-Base-V2-Fuzzy-Matcher

Introduction

The Paraphrase-MPNET-Base-V2-Fuzzy-Matcher is a model designed for fuzzy string matching using character-level embeddings. It employs a Siamese BERT architecture, allowing users to derive embeddings for fuzzy matching tasks such as entity resolution, record linking, and structured data search.

Architecture

This model utilizes a Siamese BERT architecture, focusing on character-level token embeddings to enhance fuzzy matching capabilities. It is based on the MPNet model and can be accessed via the Sentence Transformers library or directly through Hugging Face Transformers.

Training

The model is trained to handle character-level token inputs, making it suitable for tasks requiring fuzzy matching. The training process leverages techniques from the Sentence Transformers framework, which facilitates efficient and effective implementation of fuzzy string matching.

Guide: Running Locally

To use the Paraphrase-MPNET-Base-V2-Fuzzy-Matcher model locally, follow the steps below:

  1. Install Dependencies
    Ensure you have the required libraries installed:

    pip install -U sentence-transformers
    pip install transformers torch
    
  2. Using Sentence Transformers
    Load and use the model with the Sentence Transformers library:

    from sentence_transformers import SentenceTransformer, util
    
    word1 = "fuzzformer"
    word1 = " ".join([char for char in word1])
    word2 = "fizzformer"
    word2 = " ".join([char for char in word2])
    words = [word1, word2]
    
    model = SentenceTransformer('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
    fuzzy_embeddings = model.encode(words)
    
    print(util.cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))
    
  3. Using Hugging Face Transformers
    Alternatively, use the model with Hugging Face Transformers:

    import torch
    from transformers import AutoTokenizer, AutoModel
    
    def cos_sim(a: torch.Tensor, b: torch.Tensor):
        a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
        b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
        return torch.mm(a_norm, b_norm.transpose(0, 1))
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    word1 = "fuzzformer"
    word1 = " ".join([char for char in word1])
    word2 = "fizzformer"
    word2 = " ".join([char for char in word2])
    words = [word1, word2]
    
    tokenizer = AutoTokenizer.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
    model = AutoModel.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
    encoded_input = tokenizer(words, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    fuzzy_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    print(cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))
    
  4. Cloud GPU Suggestion
    For enhanced performance, consider using cloud-based GPU services like AWS, GCP, or Azure to run these models efficiently.

License

The licensing details for this model are not explicitly mentioned. Please refer to the Hugging Face model repository or contact the author for specific licensing information.

More Related APIs in Feature Extraction