pubmedbert base embeddings 2 M

NeuML

Introduction

The PUBMEDBERT-BASE-EMBEDDINGS-2M is a distilled version of the PubMedBERT model using the Model2Vec library. It provides static embeddings, enabling faster computation for text embeddings on both GPU and CPU. This model is ideal for scenarios with limited computational resources or where real-time performance is critical.

Architecture

This model is built upon the PubMedBERT architecture and utilizes the Model2Vec library to create static embeddings. It is designed for efficient performance, offering a balance between speed and accuracy.

Training

The model was trained using the Tokenlearn library. Initially, data was processed with the tokenlearn.featurize command. The training involved using a random sample of articles and the model was weighted using the BM25 method instead of the default SIF weighting method. The training process involved tokenizing data, calculating token weights, applying PCA, and normalizing the embeddings.

Guide: Running Locally

To run this model locally, you can use one of the following methods:

  1. Using txtai:

    import txtai
    embeddings = txtai.Embeddings(path="neuml/pubmedbert-base-embeddings-2M", content=True)
    embeddings.index(documents())
    embeddings.search("query to run")
    
  2. Using Sentence-Transformers:

    from sentence_transformers import SentenceTransformer
    from sentence_transformers.models import StaticEmbedding
    static = StaticEmbedding.from_model2vec("neuml/pubmedbert-base-embeddings-2M")
    model = SentenceTransformer(modules=[static])
    embeddings = model.encode(["This is an example sentence"])
    
  3. Using Model2Vec:

    from model2vec import StaticModel
    model = StaticModel.from_pretrained("neuml/pubmedbert-base-embeddings-2M")
    embeddings = model.encode(["This is an example sentence"])
    

For optimal performance, it is recommended to use a cloud GPU service such as AWS EC2, Google Cloud Platform, or Azure, particularly if processing large datasets.

License

This model is released under the Apache 2.0 license, allowing for commercial use, modification, and distribution, provided that the same license is included with any substantial portions of the software.

More Related APIs in Sentence Similarity