Mini L M L6 Keyword Extraction LLM Model

Introduction

The MiniLM-L6-Keyword-Extraction model, part of the sentence-transformers library, is designed to map sentences and paragraphs into a 384-dimensional dense vector space. This facilitates tasks such as clustering, semantic search, and sentence similarity. The model is based on a fine-tuned version of the MiniLM architecture.

Architecture

The architecture utilizes the pretrained nreimers/MiniLM-L6-H384-uncased model, which is fine-tuned for sentence embeddings using a contrastive learning objective. The model processes input text to generate dense vector representations that capture semantic content.

Training

Pre-training

The initial pre-training is done on the nreimers/MiniLM-L6-H384-uncased model, focusing on dense vector generation for semantic understanding.

Fine-tuning

Fine-tuning involves a contrastive objective, computing cosine similarity for sentence pairs and applying cross-entropy loss. The model was trained using a TPU v3-8 for 100,000 steps with a batch size of 1024, utilizing the AdamW optimizer and a learning rate of 2e-5. The sequence length was capped at 128 tokens. The training data consists of over 1 billion sentence pairs from various datasets, detailed in the data_config.json file.

Guide: Running Locally

Basic Steps

Install Dependencies
Ensure you have sentence-transformers installed:
```
pip install -U sentence-transformers
```

Load and Use the Model
You can use the model with the following Python script:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

Alternative Usage with Hugging Face Transformers
If not using sentence-transformers, leverage Hugging Face's transformers library:

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)

Cloud GPUs

For large-scale or intensive tasks, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.

License

The model is released under an "other" license, and users should consult the Hugging Face model card for specific licensing details.

More Related APIs in Sentence Similarity