allenai specter
sentence-transformersIntroduction
The ALLENAI-SPECTER model is a conversion from the AllenAI SPECTER model to the sentence-transformers framework. It is used to map the titles and abstracts of scientific publications into a vector space, enabling the identification of similar papers based on proximity.
Architecture
The model is implemented using the SentenceTransformer
architecture, which consists of:
- A Transformer layer with
BertModel
as the backbone, having a maximum sequence length of 512 and case sensitivity (do_lower_case=False
). - A Pooling layer that uses CLS token pooling mode (
pooling_mode_cls_token=True
), with a word embedding dimension of 768.
Training
The ALLENAI-SPECTER model is trained to compute sentence embeddings for scientific publications. Detailed evaluation results are available through the Sentence Embeddings Benchmark.
Guide: Running Locally
Prerequisites
- Install the
sentence-transformers
library:pip install -U sentence-transformers
Using Sentence-Transformers
- Import and load the model:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/allenai-specter') embeddings = model.encode(sentences) print(embeddings)
Using Hugging Face Transformers
-
Import libraries and define pooling function:
from transformers import AutoTokenizer, AutoModel import torch def cls_pooling(model_output, attention_mask): return model_output[0][:,0]
-
Tokenize and compute embeddings:
sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/allenai-specter') model = AutoModel.from_pretrained('sentence-transformers/allenai-specter') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
Suggestion
For more efficient execution, especially with large datasets, consider using cloud GPUs provided by services like AWS, Google Cloud, or Azure.
License
This model is licensed under the Apache 2.0 License.