Bio L O R D S T A M B2 v1 LLM Model

Introduction

BioLORD-STAMB2-V1 is a model designed to generate meaningful representations for clinical sentences and biomedical concepts. It leverages a novel pre-training strategy called BioLORD, which grounds concept representations using definitions and short descriptions derived from biomedical ontologies, improving semantic accuracy. This model is based on the sentence-transformers/all-mpnet-base-v2 and was fine-tuned on the BioLORD-Dataset.

Architecture

The BioLORD-STAMB2-V1 model is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space. It is particularly optimized for the biomedical domain, offering state-of-the-art performance in tasks like text similarity on clinical sentences and biomedical concepts.

Training

BioLORD-STAMB2-V1 was fine-tuned using the BioLORD-Dataset, focusing on maximizing the similarity of representations for names referring to the same concept. It utilizes contrastive learning and multi-relational knowledge graphs to ground its representations, which helps in producing more semantic and hierarchical concept representations.

Guide: Running Locally

To use the model locally, you can follow these steps:

Install Sentence-Transformers:
```
pip install -U sentence-transformers
```

Using Sentence-Transformers:

from sentence_transformers import SentenceTransformer

sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
model = SentenceTransformer('FremyCompany/BioLORD-STAMB2-v1')
embeddings = model.encode(sentences)
print(embeddings)

Using Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["Cat scratch injury", "Cat scratch disease", "Bartonellosis"]
tokenizer = AutoTokenizer.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')
model = AutoModel.from_pretrained('FremyCompany/BioLORD-STAMB2-v1')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)

Cloud GPUs: For enhanced performance, especially with larger datasets or more intensive tasks, consider using cloud GPU services such as AWS, Google Cloud, or Azure.

License

The contributions for this model are under the MIT license. However, since the training data originates from UMLS, appropriate licensing of UMLS is required. UMLS is generally free in most countries, but you need to create an account and report your usage annually to maintain a valid license.

More Related APIs in Sentence Similarity