Bio B E R T mnli snli scinli scitail mednli stsb
pritamdekaIntroduction
The BIOBERT-MNLI-SNLI-SCINLI-SCITAIL-MEDNLI-STSB model is a sentence-transformers model that converts sentences and paragraphs into 768-dimensional dense vectors. It is designed for tasks such as clustering and semantic search, using training data from SNLI, MNLI, SCINLI, SCITAIL, MEDNLI, and STSB datasets to provide robust sentence embeddings.
Architecture
The model architecture consists of a SentenceTransformer
that incorporates a BertModel
transformer for processing sentences up to a maximum length of 100 tokens. The architecture includes a pooling layer to generate sentence embeddings using the mean of the token embeddings.
SentenceTransformer(
(0): Transformer({'max_seq_length': 100, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Training
The model was trained using a DataLoader
with a batch size of 64. The training utilized a CosineSimilarityLoss
and was conducted over 4 epochs with an evaluation every 1000 steps. The optimization was performed using the AdamW
optimizer with a learning rate of 2e-05 and a weight decay of 0.01. The learning rate was scheduled using a WarmupLinear
scheduler with 36 warmup steps.
Guide: Running Locally
To use the model locally, follow these steps:
-
Install Sentence Transformers:
pip install -U sentence-transformers
-
Using Sentence-Transformers Library:
from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb') embeddings = model.encode(sentences) print(embeddings)
-
Using Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModel import torch def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb') model = AutoModel.from_pretrained('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings)
-
Cloud GPUs Recommendation: For faster processing, consider using cloud GPUs available from providers like AWS, Google Cloud, or Azure.
License
The model is licensed under the Creative Commons Attribution-NonCommercial 3.0 (cc-by-nc-3.0). This license permits non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.