poly B E R T

kuelumbus

Introduction

polyBERT is a chemical language model designed for ultrafast polymer informatics. It converts PSMILES strings into 600-dimensional dense fingerprints that numerically represent polymer chemical structures. The model is part of Hugging Face's ecosystem and supports sentence-similarity tasks.

Architecture

The polyBERT model is built using the SentenceTransformer framework. It consists of two main components:

  • A Transformer model, specifically the DebertaV2Model, which handles sequence processing with a maximum sequence length of 512 tokens.
  • A Pooling layer configured to compute mean token embeddings, producing a final 600-dimensional output for each input.

Training

The model is trained to transform chemical descriptions into vector representations that capture the essence of polymer structures. For detailed evaluation results and training methodology, refer to the GitHub repository and associated research paper.

Guide: Running Locally

To use polyBERT locally, follow these steps:

  1. Install Dependencies:

    pip install sentence-transformers
    
  2. Using Sentence-Transformers:

    from sentence_transformers import SentenceTransformer
    
    psmiles_strings = ["[*]CC[*]", "[*]COC[*]"]
    polyBERT = SentenceTransformer('kuelumbus/polyBERT')
    embeddings = polyBERT.encode(psmiles_strings)
    print(embeddings)
    
  3. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    psmiles_strings = ["[*]CC[*]", "[*]COC[*]"]
    tokenizer = AutoTokenizer.from_pretrained('kuelumbus/polyBERT')
    polyBERT = AutoModel.from_pretrained('kuelumbus/polyBERT')
    encoded_input = tokenizer(psmiles_strings, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = polyBERT(**encoded_input)
    
    fingerprints = mean_pooling(model_output, encoded_input['attention_mask'])
    print("Fingerprints:")
    print(fingerprints)
    
  4. Cloud GPUs: For intensive computations, consider using cloud services with GPU support, such as AWS, GCP, or Azure.

License

Please refer to the LICENSE file included in the model's repository for detailed licensing information.

More Related APIs in Sentence Similarity