Introduction

The ALLENAI-SPECTER model is a conversion from the AllenAI SPECTER model to the sentence-transformers framework. It is used to map the titles and abstracts of scientific publications into a vector space, enabling the identification of similar papers based on proximity.

Architecture

The model is implemented using the SentenceTransformer architecture, which consists of:

  • A Transformer layer with BertModel as the backbone, having a maximum sequence length of 512 and case sensitivity (do_lower_case=False).
  • A Pooling layer that uses CLS token pooling mode (pooling_mode_cls_token=True), with a word embedding dimension of 768.

Training

The ALLENAI-SPECTER model is trained to compute sentence embeddings for scientific publications. Detailed evaluation results are available through the Sentence Embeddings Benchmark.

Guide: Running Locally

Prerequisites

  1. Install the sentence-transformers library:
    pip install -U sentence-transformers
    

Using Sentence-Transformers

  1. Import and load the model:
    from sentence_transformers import SentenceTransformer
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('sentence-transformers/allenai-specter')
    embeddings = model.encode(sentences)
    print(embeddings)
    

Using Hugging Face Transformers

  1. Import libraries and define pooling function:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def cls_pooling(model_output, attention_mask):
        return model_output[0][:,0]
    
  2. Tokenize and compute embeddings:

    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/allenai-specter')
    model = AutoModel.from_pretrained('sentence-transformers/allenai-specter')
    
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
    print("Sentence embeddings:")
    print(sentence_embeddings)
    

Suggestion

For more efficient execution, especially with large datasets, consider using cloud GPUs provided by services like AWS, Google Cloud, or Azure.

License

This model is licensed under the Apache 2.0 License.

More Related APIs in Sentence Similarity