Introduction

SciNCL is a pre-trained BERT language model designed to generate document-level embeddings for research papers. It leverages the citation graph neighborhood to create samples for contrastive learning, using weights initialized from scibert-scivocab-uncased. The model's citation embeddings are based on the S2ORC citation graph. The associated paper is "Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings" (EMNLP 2022).

Architecture

SciNCL utilizes a BERT architecture with adaptations for contrastive learning through citation embeddings. It focuses on embedding scientific documents by leveraging contextual information from their citation networks.

Training

The model is trained using contrastive learning with the S2ORC citation graph. Triplet mining parameters include various strategies for selecting positive and negative examples, contributing to its robust document representation capabilities. The training results have shown significant improvements on SciDocs metrics, with evaluations provided in the accompanying research paper.

Guide: Running Locally

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Example papers
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]

# Generate embeddings
embeddings = model.encode(papers)

# Compute similarity
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())

Using Transformers

from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/scincl')
model = AutoModel.from_pretrained('malteos/scincl')

# Example papers
papers = [
    {'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
    {'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}
]

# Preprocessing
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)

# Inference
result = model(**inputs)
embeddings = result.last_hidden_state[:, 0, :]

# Calculate similarity
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1].T)
print(similarity.item())

Cloud GPUs

For efficient processing, it is recommended to utilize cloud GPUs provided by services like AWS, Google Cloud, or Azure.

License

The SciNCL model is licensed under the MIT License, allowing for flexible use and distribution.

More Related APIs in Feature Extraction