Patent S B E R Ta LLM Model

Introduction

PatentSBERTa is a deep natural language processing (NLP) model designed for patent distance and classification, leveraging the augmented Sentence-BERT (SBERT) architecture. Developed by the AI-Growth-Lab at Aalborg University Business School, this model is adept at mapping sentences and paragraphs to a 768-dimensional vector space, facilitating tasks such as clustering and semantic search.

Architecture

The PatentSBERTa model is built using the SentenceTransformer framework, which incorporates:

A Transformer model (MPNetModel) with a maximum sequence length of 512 and no lowercasing.
A pooling layer configured for CLS token pooling, which outputs embeddings of dimension 768.

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True})
)

Training

The model was trained using the following parameters:

DataLoader: Batch size of 16, utilizing a random sampler.
Loss Function: CosineSimilarityLoss.
Training was conducted over 1 epoch with an optimizer class of AdamW and a learning rate of 2e-05.
The training schedule included a warmup phase with 100 steps and a weight decay of 0.01.

Guide: Running Locally

Installation

To use PatentSBERTa, first install the sentence-transformers package:

pip install -U sentence-transformers

Using Sentence-Transformers

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('AI-Growth-Lab/PatentSBERTa')
embeddings = model.encode(sentences)
print(embeddings)

Using Hugging Face Transformers

from transformers import AutoTokenizer, AutoModel
import torch

def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]

sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('AI-Growth-Lab/PatentSBERTa')
model = AutoModel.from_pretrained('AI-Growth-Lab/PatentSBERTa')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Cloud GPU Suggestion

For optimal performance, consider running the model on a cloud service that provides GPU support, such as AWS, Google Cloud, or Azure.

License

PatentSBERTa is available under the licensing terms specified by the authors. For more details, refer to the project repository on GitHub: PatentSBERTa GitHub.

More Related APIs in Sentence Similarity