kenlm
edugpKenLM Model Documentation
Introduction
KenLM models are probabilistic n-gram language models used for tasks such as fast perplexity estimation, aiding in filtering or sampling large datasets. For example, these models can filter out samples in a dataset that are unlikely to appear in a base dataset like Wikipedia due to high perplexity scores.
Architecture
The repository contains several KenLM models trained on tokenized datasets across various languages. Each language-specific model comprises three files:
{language}.arpa.bin
: The trained KenLM model binary.{language}.sp.model
: The SentencePiece model for tokenization.{language}.sp.vocab
: The vocabulary file for the SentencePiece model.
These models are trained using preprocessing steps such as replacing numbers with zeros and normalizing punctuation, based on cc_net's methods.
Training
Models are trained on datasets like Wikipedia and OSCAR, with specific preprocessing to ensure consistency, including keeping default values for parameters like lower_case, remove_accents, normalize_numbers, and punctuation during inference.
Guide: Running Locally
-
Dependencies:
- Install KenLM:
pip install https://github.com/kpu/kenlm/archive/master.zip
- Install SentencePiece:
pip install sentencepiece
- Install KenLM:
-
Load and Use the Model:
- Example Python code to load a model and get perplexity:
from model import KenlmModel # Load model trained on English Wikipedia model = KenlmModel.from_pretrained("wikipedia", "en") # Get perplexity print(model.get_perplexity("I am very perplexed")) # Outputs a low perplexity score print(model.get_perplexity("im hella trippin")) # Outputs a high perplexity score
- Example Python code to load a model and get perplexity:
-
Cloud GPUs: For enhanced performance, consider using cloud GPUs from providers like AWS, Google Cloud, or Azure to handle larger datasets efficiently.
License
The KenLM models and associated files are released under the MIT License.