pubmedbert base embeddings 2 M
NeuMLIntroduction
The PUBMEDBERT-BASE-EMBEDDINGS-2M is a distilled version of the PubMedBERT model using the Model2Vec library. It provides static embeddings, enabling faster computation for text embeddings on both GPU and CPU. This model is ideal for scenarios with limited computational resources or where real-time performance is critical.
Architecture
This model is built upon the PubMedBERT architecture and utilizes the Model2Vec library to create static embeddings. It is designed for efficient performance, offering a balance between speed and accuracy.
Training
The model was trained using the Tokenlearn library. Initially, data was processed with the tokenlearn.featurize
command. The training involved using a random sample of articles and the model was weighted using the BM25 method instead of the default SIF weighting method. The training process involved tokenizing data, calculating token weights, applying PCA, and normalizing the embeddings.
Guide: Running Locally
To run this model locally, you can use one of the following methods:
-
Using txtai:
import txtai embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-2M", content=True) embeddings.index(documents()) embeddings.search("query to run")
-
Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer from sentence_transformers.models import StaticEmbedding static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-2M") model = SentenceTransformer(modules=[static]) embeddings = model.encode(["This is an example sentence"])
-
Using Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-2M") embeddings = model.encode(["This is an example sentence"])
For optimal performance, it is recommended to use a cloud GPU service such as AWS EC2, Google Cloud Platform, or Azure, particularly if processing large datasets.
License
This model is released under the Apache 2.0 license, allowing for commercial use, modification, and distribution, provided that the same license is included with any substantial portions of the software.