Bio B E R T mnli snli scinli scitail mednli stsb

pritamdeka

Introduction

The BIOBERT-MNLI-SNLI-SCINLI-SCITAIL-MEDNLI-STSB model is a sentence-transformers model that converts sentences and paragraphs into 768-dimensional dense vectors. It is designed for tasks such as clustering and semantic search, using training data from SNLI, MNLI, SCINLI, SCITAIL, MEDNLI, and STSB datasets to provide robust sentence embeddings.

Architecture

The model architecture consists of a SentenceTransformer that incorporates a BertModel transformer for processing sentences up to a maximum length of 100 tokens. The architecture includes a pooling layer to generate sentence embeddings using the mean of the token embeddings.

SentenceTransformer(
  (0): Transformer({'max_seq_length': 100, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Training

The model was trained using a DataLoader with a batch size of 64. The training utilized a CosineSimilarityLoss and was conducted over 4 epochs with an evaluation every 1000 steps. The optimization was performed using the AdamW optimizer with a learning rate of 2e-05 and a weight decay of 0.01. The learning rate was scheduled using a WarmupLinear scheduler with 36 warmup steps.

Guide: Running Locally

To use the model locally, follow these steps:

  1. Install Sentence Transformers:

    pip install -U sentence-transformers
    
  2. Using Sentence-Transformers Library:

    from sentence_transformers import SentenceTransformer
    sentences = ["This is an example sentence", "Each sentence is converted"]
    model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')
    embeddings = model.encode(sentences)
    print(embeddings)
    
  3. Using Hugging Face Transformers:

    from transformers import AutoTokenizer, AutoModel
    import torch
    
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    sentences = ['This is an example sentence', 'Each sentence is converted']
    tokenizer = AutoTokenizer.from_pretrained('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')
    model = AutoModel.from_pretrained('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')
    encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    
    print("Sentence embeddings:")
    print(sentence_embeddings)
    
  4. Cloud GPUs Recommendation: For faster processing, consider using cloud GPUs available from providers like AWS, Google Cloud, or Azure.

License

The model is licensed under the Creative Commons Attribution-NonCommercial 3.0 (cc-by-nc-3.0). This license permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

More Related APIs in Sentence Similarity