Patent S B E R Ta
AI-Growth-LabIntroduction
PatentSBERTa is a deep natural language processing (NLP) model designed for patent distance and classification, leveraging the augmented Sentence-BERT (SBERT) architecture. Developed by the AI-Growth-Lab at Aalborg University Business School, this model is adept at mapping sentences and paragraphs to a 768-dimensional vector space, facilitating tasks such as clustering and semantic search.
Architecture
The PatentSBERTa model is built using the SentenceTransformer framework, which incorporates:
- A Transformer model (MPNetModel) with a maximum sequence length of 512 and no lowercasing.
- A pooling layer configured for CLS token pooling, which outputs embeddings of dimension 768.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True})
)
Training
The model was trained using the following parameters:
- DataLoader: Batch size of 16, utilizing a random sampler.
- Loss Function: CosineSimilarityLoss.
- Training was conducted over 1 epoch with an optimizer class of
AdamW
and a learning rate of 2e-05. - The training schedule included a warmup phase with 100 steps and a weight decay of 0.01.
Guide: Running Locally
Installation
To use PatentSBERTa, first install the sentence-transformers
package:
pip install -U sentence-transformers
Using Sentence-Transformers
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('AI-Growth-Lab/PatentSBERTa')
embeddings = model.encode(sentences)
print(embeddings)
Using Hugging Face Transformers
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('AI-Growth-Lab/PatentSBERTa')
model = AutoModel.from_pretrained('AI-Growth-Lab/PatentSBERTa')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Cloud GPU Suggestion
For optimal performance, consider running the model on a cloud service that provides GPU support, such as AWS, Google Cloud, or Azure.
License
PatentSBERTa is available under the licensing terms specified by the authors. For more details, refer to the project repository on GitHub: PatentSBERTa GitHub.