distilbert base cased
distilbertIntroduction
DistilBERT is a distilled version of the BERT base model, designed to be smaller, faster, and more efficient while maintaining competitive performance. It is a cased model, which distinguishes between different capitalizations, such as "english" and "English."
Architecture
DistilBERT is a transformer-based model that reduces the size of the original BERT model while retaining its effectiveness. It employs three training objectives: distillation loss (matching BERT's output probabilities), masked language modeling (MLM), and cosine embedding loss (aligning hidden states with BERT). These objectives enable the model to learn a similar internal representation of language as BERT, enhancing its speed in inference and downstream tasks.
Training
The model was pre-trained using BookCorpus and English Wikipedia, involving a self-supervised process with no human labeling. The training utilized an 8 GPU setup (16 GB V100) for 90 hours. The preprocessing involved lowercasing and tokenizing text using WordPiece with a 30,000-word vocabulary. Sentences were paired with a probability of being consecutive or random, and tokens were masked based on specific rules.
Guide: Running Locally
To run DistilBERT locally:
- Install Transformers library:
pip install transformers
- Use with a pipeline for masked language modeling:
from transformers import pipeline unmasker = pipeline('fill-mask', model='distilbert-base-uncased') print(unmasker("Hello I'm a [MASK] model."))
- PyTorch Example:
from transformers import DistilBertTokenizer, DistilBertModel tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertModel.from_pretrained("distilbert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)
- TensorFlow Example:
from transformers import DistilBertTokenizer, TFDistilBertModel tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = TFDistilBertModel.from_pretrained("distilbert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='tf') output = model(encoded_input)
For optimal performance, consider using cloud GPU services like AWS EC2, Google Cloud, or Azure.
License
DistilBERT is licensed under the Apache 2.0 license, allowing for free use and distribution with proper attribution.