Mini L M L6 Keyword Extraction
valurankIntroduction
The MiniLM-L6-Keyword-Extraction model, part of the sentence-transformers library, is designed to map sentences and paragraphs into a 384-dimensional dense vector space. This facilitates tasks such as clustering, semantic search, and sentence similarity. The model is based on a fine-tuned version of the MiniLM architecture.
Architecture
The architecture utilizes the pretrained nreimers/MiniLM-L6-H384-uncased
model, which is fine-tuned for sentence embeddings using a contrastive learning objective. The model processes input text to generate dense vector representations that capture semantic content.
Training
Pre-training
The initial pre-training is done on the nreimers/MiniLM-L6-H384-uncased
model, focusing on dense vector generation for semantic understanding.
Fine-tuning
Fine-tuning involves a contrastive objective, computing cosine similarity for sentence pairs and applying cross-entropy loss. The model was trained using a TPU v3-8 for 100,000 steps with a batch size of 1024, utilizing the AdamW optimizer and a learning rate of 2e-5. The sequence length was capped at 128 tokens. The training data consists of over 1 billion sentence pairs from various datasets, detailed in the data_config.json
file.
Guide: Running Locally
Basic Steps
-
Install Dependencies
Ensure you havesentence-transformers
installed:pip install -U sentence-transformers
-
Load and Use the Model
You can use the model with the following Python script:from sentence_transformers import SentenceTransformer sentences = ["This is an example sentence", "Each sentence is converted"] model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embeddings = model.encode(sentences) print(embeddings)
-
Alternative Usage with Hugging Face Transformers
If not using sentence-transformers, leverage Hugging Face'stransformers
library:from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F def mean_pooling(model_output, attention_mask): token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentences = ['This is an example sentence', 'Each sentence is converted'] tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2') model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2') encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): model_output = model(**encoded_input) sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:") print(sentence_embeddings)
Cloud GPUs
For large-scale or intensive tasks, consider using cloud GPU services such as AWS EC2, Google Cloud, or Azure.
License
The model is released under an "other" license, and users should consult the Hugging Face model card for specific licensing details.